Hardware

Lost while troubleshooting? Retrace your steps.


Our topic today: the value of retracing your steps. Now illustrated with embarrassing anecdotes!

Have you ever had one of those days? A day where you're not quite at the top of your game, and you're not picking up on things the way you normally would? Or a day when you get a little too self-assured, and you miss something obvious? I'll bet you have. We all stumble every now and again. It's these days that should remind us of our "fundamentals" — those standard practices that make our work easier.

Double-check things that might have recently changed.

A few weeks ago, I ended up placing a call to tech support myself. I thought for sure that I was seeing an outage of our university's central mail system. In the department I work for, we run a number of our own network servers, but we all read mail from the cluster maintained by the central computing service. I hadn't seen any new messages for a period of about 18 hours; none in my desktop mail client, or to my BlackBerry, or delivered to the web mail system. This is a strange occurrence, since I have several servers that mail me their status updates every night. I sent myself a test message from an off-campus email account, to confirm that new messages weren't being delivered.

Feeling pretty sure of myself, I called the central helpdesk hotline, expecting that the tech on-call would admit, "We're having a problem, and the system should be up soon." That didn't happen. "Well, we haven't had any other reports of an outage," he told me. The tech sent a message to my account, and a log trace showed that it was delivered to my mail queue without a problem. "Have you changed anything about your account recently?" he asked me. And I smacked my forehead.

The day before, at the close of a long shift, I had made some tweaks to one of the mail rules I have in place on my account. Being a little wiped out after the trials of the day, I made a rookie's mistake: I used an OR statement in my filter instead of an AND statement. This turned my finely honed scalpel—which I intended to catch only a very specific type of junk message — into a sledgehammer that sorted every e-mail I received into my spam folder. My problem was easily fixed, once the on-call tech had reminded me of the cardinal rule of troubleshooting: rule out any recent changes first. If I'd had the presence of mind to retrace my steps from the day before, I'd have found my own mistake, and avoided that serving of "crow."

Make sure that you can reproduce the problem...more than once.

Just yesterday, I encountered one of those problems that had me banging my head against the wall, at least until I went back to "square one." I was setting up a new laptop for a staff member, and had just finished dumping our standard software image onto the computer. The image was a tested build for that make and model of machine, and I was just going to apply some recent patches when I discovered that Windows couldn't detect the laptop's on-board network card. This was strange, since I had just cloned my software image onto the laptop over that Ethernet connection.

I proceeded to check the device manager for the missing network interface. I reinstalled the appropriate network drivers, and the PCI bus drivers, just to be thorough. I compared the hardware specifications of the new laptop to the ones I had previously recorded for that model line to make sure that there hadn't been a subtle hardware revision in the last month. I checked the BIOS and Windows' networking control panel to make sure that the card wasn't disabled in either of those places.

I started to get frustrated. I was being methodical, logical, all the things you're supposed to be when troubleshooting, and I still couldn't find the source of the problem. That's when it occurred to me that I should try flashing the laptop back to the manufacturer's software build. I did so, and there was my missing network interface!

For a moment, my faith in my custom software build was shaken. By going back to the OEM configuration, had I discovered a problem with our custom image? I quickly came to my senses, realizing that for months I'd had that software build running soundly on a dozen laptops in the office. So I shrugged, and I tried reimaging the laptop again with our in-house build...and everything worked fine. My phantom network card was detected on startup, and performed fine through a diagnostic.

I'll certainly keep an eye on that laptop for a few weeks. Maybe what I was seeing was the preview of a hardware failure to come. But if the problem doesn't recur, I'll sleep fine without having to have an answer for what was going on. A problem solved without a clear explanation is still one less problem. Retracing my steps in this instance saved me from making a mountain out of a molehill, because my next tactic was going to have me reinstalling Windows by hand, and then testing the driver interactions one by one.

Hopefully, every problem you encounter is easy to solve. Me, every now and again I'll get one that makes me want to turn in my multi-tool. When the work gets hard, and the next step is unclear, that should be our cue to examine the situation again, from the beginning. Sometimes, looking back can illuminate the way forward.

Because you'll catch that obvious thing you overlooked. Ha!

Editor's Picks