Data Centers

10 ways to survive a critical system outage

System outages are inevitable when working in IT. The trick is to get through the experience while keeping your wits and reputation intact.

Dealing with a critical outage


Just about every IT professional has had that moment. That one where you realize something your company relies on — which you support — is now down, out, or dead. Whether it's email, Internet access, the phone system, or a shipping program your warehouse depends on, a vital link has just been severed and you're the surgeon who has to repair it. For the purpose of this article, I'll refer to the vital service as "OMG" since that's probably what's on your mind right now.

Words can't describe that sinking feeling in the pit of your stomach. Suddenly, the routine frustrations of the day seem irrelevant. Resetting that accounting temp's forgotten password for the fifth time in an hour would be a sunny picnic, and you long for the serenity of troubleshooting the CEO's virus-ridden laptop.

How you deal with a system outage defines where the rubber meets the road in your career. Although the next few minutes (or hours) are probably the least amount of fun you'll have outside of a dentist's chair, how you react can mean the difference between being the hero or the goat. Here are 10 steps to guide your way through the experience so you can don the red cape after it's all over.

1: Don't Panic

Yes, this is borrowed from Douglas Adams, but it is universal. Even if you caused the problem, the worst thing you can do now is waste time fretting over how much trouble you'll be in or who's going to get upset with you. You may be petrified that this will become a career-ending move, but you can't worry about that right now. Your best chance is to correct what went wrong, and you can only do that by staying technical rather than getting emotional. Remember that this isn't a life or death situation (unless of course you support a hospital with patients at risk of a medical emergency — and if anything, that will make your ability to stay calm even more crucial). As the saying goes, fix it and then have a heart attack.

Now more than ever you'll need your focus, so push back any fear you may have and stick to the facts at hand: Something is broken and you need to get to work.

2: Notify your users any way you can

You need to get the word out to your users (and your boss) that OMG is down. Of course, your primary means of communication (email, for instance) may be OMG itself, meaning that email to the company announcing what just happened can't possibly go out in that scenario. Use instant messaging. Call department heads and ask them to tell their staff. Get a coworker to walk around informing people. Put up a sign saying "OMG is down as of [time]. Working on it now!" The last thing you want are heads popping up out of cubicles to ask, "Hey, is OMG down?" and "Did you know OMG isn't working?"  You'll quickly get worn out reciting the same response when you need to roll up your sleeves and hold onto the fire hose.

3: Get a bouncer

Notifying your users is the first step. Getting someone to help answer their queries when they show up is the second. A sign announcing that you know about the problem will satisfy those who just want to make sure you're aware of it, but higher-ups asking for status updates and ETAs need a human face. Someone has to be that face while you're in the trench fixing things. Even if you recruit the receptionist or an HR intern, there must be a buffer between you and the community, at least until things are stable.

4: Be prepared for the politics

Now that you've notified your user community and posted a go-between, you need to be aware of one more thing as you enter the fray. You're about to find out who your friends are. I have seen some very interesting personalities revealed in pressure situations like these. That type-A manager who seems so obsessive and nitpicky may just show up to calmly ask what he can to do help or plan out alternatives. That arrogant co-worker you thought might take perverse pleasure in seeing your dirty hands may turn out to be your most sympathetic ear. And that VP who seemed so congenial and charming may suddenly transform into a Gestapo nightmare, coming over to brush past your bouncer and bellow, "Do you understand how important OMG is?"

If this happens, take it with a grain of salt: They're upset about the problem and you're the target at the moment. You can't control them, but you can control yourself. It's understandable that they may be angry —money is at stake here. But don't get drawn into a screaming match. Hold the sarcasm and politely inform them, "We can figure out what went wrong and who's to blame later, but I need to get this fixed now."  When the dust settles, you'll be the one without regrets. Also, resist (if you can) having to give minute-by-minute status updates to management. It will only increase your stress and reduce your technical capabilities by splitting your concentration.

5: Document everything

Whew! Four tips already and none of them about how to actually fix the problem. That's because you need to pave the way to fix it with the best possible approach and environment.

Take a moment to document exactly what happened: what you were doing (if applicable), what commands were run, and what occurred, including all error messages. I can guarantee you that after all this excitement fades away, your memory will resemble Bonnie and Clyde's car after their infamous fatal run-in with the police. It's not enough to just get things up and running, as I'll discuss later. Write down your activities on a piece of paper, type them in a computer text file, or even just speak them aloud to a coworker taking dictation for you.

As you proceed through the resolution process, document what you find and what you attempt to do to correct the situation. Did you restart services? Restart the box? Update a registry key? Plug something in somewhere else? Put it all in there. Make no mistake. This process could save both you and your company. And back out any changes that don't work, so you don't build the groundwork for more problems later!

6: Lay out all the facts of the situation

Whether you caused the initial problem or not, you may find yourself preoccupied with making sure you're not the one without a chair when the music stops. It's understandable when you work in IT to be concerned that people will point fingers at you after an outage. "What'd you do?" is usually the first question a user asks after an emergency. You may be perceived as responsible for "not seeing this coming."

Regardless, don't try to fix the problem while simultaneously trying to cover anything up. Not only is it unethical, but it will slow down or confuse the resolution process - maybe even make it worse! Besides, a third party (or your boss) will quite likely be able to read between the lines and see what really happened. Systems record events, keep log files, and in some cases even audit administrator actions. In the end, something happened to cause OMG to go down, so laying down a smokescreen helps no one. Every week I hear about someone who faked academic credentials on their resume, got a prestigious position they weren't entitled to, and wound up embarrassed and terminated once the truth came out, long after they thought they'd pulled it off. Don't be that guy (or woman). It'll happen to you.

7: Don't push all the buttons at once

I think we all have the tendency to hit a button again when the first attempt didn't appear to work. This is why I see people hammering the button at a crosswalk. It's also why printers spit out umpteen copies of the same document once a jam has been cleared; someone figured if clicking File and then Print didn't work the first 16 times, 17 might be their magic number.

During a system outage you'll want to get things up and running as quickly as you can. However, if there are four things you can try don't try them all at once. Whatever relief you'll gain in the short-term, if it works will be replaced by guilt later because you don't really know what fixed the problem. Granted, this isn't the time to conduct a slow n' lazy case study on a Sunday afternoon, but it isn't the time to rush willy-nilly either. You have to identify the smoking gun to truly consider the problem solved.

8: Go straight to the top

When you're spinning your wheels and don't see an easy fix, don't hesitate to call support. (You DO have support on your products, right? If not, get it today. Beg if you have to.) It may mean waiting on hold or sitting by the phone expecting a call back. But get that support ticket created even if you simply think you might need it.

I'll admit that in my younger and more stubborn days of system administration, I assumed I could fix anything with enough time and determination. "Calling support is just waving the white flag!" I would say. It may well be the case that you can fix anything in the long run, but 30 hours of determination is less valuable than two hours on a paid support call. Case in point: A recent Microsoft support incident at a client site cost $259 and the issue was resolved within 45 minutes. Frantic Googling and crawling down the rabbit hole by trying Suggestion A, Suggestion B, Suggestion C (on forums populated with amateurs trying to be helpful, rather than professionals who know the ins and outs of the product) could have taken 10 times as long, if the resolution even appeared at all.

The client would have lost far more in revenue and productivity if not for investing that $259. It's not a matter of pride — it's a matter of good business decisions, and you're there to support the business. The best part is that now your problem is in other hands alongside your own and you have someone else to share the worry.

9: Perform a post-mortem

Good news! The problem was found and OMG came back up fine. You notified everyone that the situation was normal again (or at least as normal as it may be at your workplace). People began returning to work and the whole experience was just a bad dream, right?

Not so fast. You can't go have a beer yet. You need to keep your sleeves rolled up and put together all the notes you took. Notify all relevant parties what happened, what you did to fix it, and how you will ensure that it doesn't happen again. Set up alerts or processes for better advance notice of these kinds of issues. The worst possible time to develop a contingency plan is during an actual outage, so include what you will do if the issue DOES return and the fix doesn't work then. This is what any smart boss wants to see most right now — and so do his or her bosses.

10: Don't lose faith in yourself

It's often said that users don't call the help desk to report that their computers are working well. In that same vein, when you work in IT, it's easy to feel like you're only as successful as the events of that day. A system outage can be a huge blow to an IT pro's confidence, even if you've got a string of amazing successes attached to your belt. Try to be philosophical. You just had five hours of email downtime and you've been there five years? Well, an average of an hour per year of email downtime is a pretty good track record.

Stay positive. You got the job done, after all. When you go home, don't start questioning your talents and capabilities (unless you did something really reckless like yanking all the electrical plugs out of the back of a server rack; such a thing would clearly indicate that IT is the wrong career for you). As in #9, figure out how you can make the most out of this episode. Study more so you can learn about the subject and be prepared for similar events. Build good coworker relationships so you can count on your fellow employees in an emergency like this.

Be ready...

Always remember that IT professionals are paid to maintain order in a disordered universe, and it's up to you to remain stalwart so you can tackle the next system outage. It's coming. Will you be ready?

Other tips?

Have you ever found yourself scrambling to resolve a major system meltdown? What additional steps would you recommend for keeping the chaos to a minimum and resolving the problem as quickly as possible?



About

Scott Matteson is a senior systems administrator and freelance technical writer who also performs consulting work for small organizations. He resides in the Greater Boston area with his wife and three children.

Editor's Picks