Technical SNAFU round-up

Jul 05, 2009 03:32


When in doubt, micro-reboot:
My Fujitsu T5010 has long had issues with its AuthenTec fingerprint reader. In particular, it will just stop working, seemingly randomly, after a few days or so. By process of elimination, I determined that restarting a service daemon that ships with the reader, ATService, will resolve the issue until it next resurfaces in a few days' time. This is a pain to do repeatedly, particularly since I can't when immediately necessary (before logging back on).

Kludge-solution: prophylactically restart the service at regular intervals.
Task Scheduler can do this pretty cleanly, automatically handling the annoying issue of having to provide Vista administrator privileges to do it. I configured a custom task to run 10 seconds after every login to issue the commands "net stop atservice" and "net start atservice" with administrator privileges. The problem isn't solved, but now it occurs much less frequently, and it automatically resolves itself shortly after every flare-up. (I tried to be smarter about it and have the problem completely go away by having the service restart shortly before I would need it to log in, but this seems to have been too dicey to get right. In particular, it often disrupts the connection the OmniPass login software has already made to the fingerprint reader driver.)

802.11, still half-baked:
The office recently installed several Meraki dual-band access points for a trial run comparing them with the old Cisco access points. I soon was experiencing extremely bad network behavior from my laptop, particularly (I think) at night -- regular drop-outs, long ping times, etc. After concluding that interference and signal strength issues were completely excluded (using a WiSpy and an AP scanner), I observed that my Intel WiFi Link 5300 ABG card was using 11n rates, and by trial and error, showed that disabling 11n completely solved the problem. 11n is still pre-standard but widely implemented. Unfortunately, as this makes clear, it's also wrongly implemented. Making higher bitrates available should never cause throughput or reliability to fall. Only a broken bitrate selection algorithm lets this happen.

USB did *what*?:
This is perhaps the weirdest of the bunch. My laptop lately had begun to hang frequently during suspend -- a pain in the ass when I want to go home! I've had vaguely similar issues in the past, but none of the old causes were applicable. A chance observation led me to the proximate cause: Plug and Play was hanging. Shortly before a hang during suspend, it was always impossible to add or remove devices. Suspend and resume rely on Plug and Play, tightly coupled in with power management, being in serviceable condition.

In Windows, Plug and Play is essentially single-threaded. The key operations of enumerating, adding, and removing devices are handled in the kernel under a single set of locks that essentially serializes the process. Unfortunately, this means that if a single driver hangs on a PnP request, the entire process of handling device churn grinds to a halt. Alas, the only sure way to cure a driver hang is to reboot.

Why was a driver hanging in the first place, though? This itself is a serious failure. I used the kernel debugger and some intuitive leaps of faith to trace through the stalled PnP requests (somewhat complicated because much of the data had been swapped out due to idleness), which led me to the device node representing the small USB hub I use to conveniently plug in my keyboard, mouse, and PS2->USB converter (for a KVM) all at once. What's strange about this cluster is that it's all covered by standard, heavily used, in-box Microsoft drivers: USBHUB, HIDUSB (for "Human Interface Devices"), and USBCCGP (for composite devices). In-box MS drivers rarely are found responsible for trouble; it's almost always 3rd party drivers that fail.

Once blame was pretty well localized, I found the problem was easily reproducible. Simply plug and un-plug the hub several times, and on about the third try, one of the devices on the hub will fail to come up. Unplug and re-plug the device, and it will fix itself. But then, unplug the hub, and the removal hangs. While the PS2->USB converter attracted my early suspicions because it showed up explicitly in the kernel dumps and it's a piece of cheap, known-unreliable hardware, it was quickly exonerated by pulling it from the hub. Similarly, the mouse was also unnecessary for triggering the hang. But both the keyboard and the hub were necessary. An empty hub was harmless. The keyboard connected directly might fail to come up, but it never caused a hang. Together, however, the failure was consistent.

So... the two most commonly used USB drivers were interacting with a run-of-the-mill generic hub and a run-of-the-mill Dell keyboard, in a way they never used to do on this or any prior computer, to cause PnP requests to hang. WTF??

My suspicion is that a hardware failure is involved and that it is triggering a corner case bug in one of the drivers... but damned if I can guess why. In the meantime, I'm just going to avoid that hub. :P Stay tuned...

bugs, operating systems

Previous post Next post
Up