The first sign of trouble was the paging and the frantic calls to my extension. "Tim, the server is down. Help!" Those are dreaded words that a network administrator never wants to hear. I run to the server room to check things out. I had been working on another server a few minutes earlier, copying some archive e-mail files off to an external USB drive. The application the users were complaining about was on a different server. This just does not make sense.
Being the professional that I am, I quickly start at the physical layer. The server is running; I can log in, but it's not seeing the rest of the network. Bad NIC? Bad switch port? Nah...just a loose network cable. This is not supposed to happen. Could vibration have knocked it out? No, I quickly realized that this particular cable was one that I had intended to replace a long time ago. It had a broken connector clip. I must have knocked it loose a few minutes earlier.
Quickly, I plugged it in and made a mental note to myself to replace that cable some evening when most of the users are off-line. For now, they are back up and running, or are they? The calls continue, "We're still getting the same Btrieve errors." We use an old application that runs on P-SQL, Pervasive Btrieve, for those who remember it from the old Novell days. It hates lost connections and throws up immediately. I'm hoping the database has not become corrupted.
I warn everybody that I'm going to reboot the application server and restart it. It doesn't matter much that I warn them as there is nothing they can do. They have already lost their connections. I can only hope they didn't lose any critical data. A few minutes later all is well. The fire is out. The users are happily working and nobody remembers that there was a problem. My heart slows down and I get back to other tasks. Disaster averted. No server crash.
Ultimately, the problem was my fault and should not have happened. A loose cable and a nudge while plugging in a USB device on the server below it set the wheels in motion. I've had worse things happen like losing two drives out of a RAID when a fan failed and monitoring was not turned on. How about you? What's your server crash or network failure horror story? I would love to read about it. A simple loss of a DSL line doesn't count. That happens too frequently.