Data Centers

The server crash that wasn't


The first sign of trouble was the paging and the frantic calls to my extension. "Tim, the server is down. Help!" Those are dreaded words that a network administrator never wants to hear. I run to the server room to check things out. I had been working on another server a few minutes earlier, copying some archive e-mail files off to an external USB drive. The application the users were complaining about was on a different server. This just does not make sense.

Being the professional that I am, I quickly start at the physical layer. The server is running; I can log in, but it's not seeing the rest of the network. Bad NIC? Bad switch port? Nah...just a loose network cable. This is not supposed to happen. Could vibration have knocked it out? No, I quickly realized that this particular cable was one that I had intended to replace a long time ago. It had a broken connector clip. I must have knocked it loose a few minutes earlier.

Quickly, I plugged it in and made a mental note to myself to replace that cable some evening when most of the users are off-line. For now, they are back up and running, or are they? The calls continue, "We're still getting the same Btrieve errors." We use an old application that runs on P-SQL, Pervasive Btrieve, for those who remember it from the old Novell days. It hates lost connections and throws up immediately. I'm hoping the database has not become corrupted.

I warn everybody that I'm going to reboot the application server and restart it. It doesn't matter much that I warn them as there is nothing they can do. They have already lost their connections. I can only hope they didn't lose any critical data. A few minutes later all is well. The fire is out. The users are happily working and nobody remembers that there was a problem. My heart slows down and I get back to other tasks. Disaster averted. No server crash.

Ultimately, the problem was my fault and should not have happened. A loose cable and a nudge while plugging in a USB device on the server below it set the wheels in motion. I've had worse things happen like losing two drives out of a RAID when a fan failed and monitoring was not turned on. How about you? What's your server crash or network failure horror story? I would love to read about it. A simple loss of a DSL line doesn't count. That happens too frequently.

Editor's Picks