Data Centers

10 ways to survive a critical system outage

System outages are inevitable when working in IT. The trick is to get through the experience while keeping your wits and reputation intact.

Dealing with a critical outage


Just about every IT professional has had that moment. That one where you realize something your company relies on -- which you support -- is now down, out, or dead. Whether it's email, Internet access, the phone system, or a shipping program your warehouse depends on, a vital link has just been severed and you're the surgeon who has to repair it. For the purpose of this article, I'll refer to the vital service as "OMG" since that's probably what's on your mind right now.

Words can't describe that sinking feeling in the pit of your stomach. Suddenly, the routine frustrations of the day seem irrelevant. Resetting that accounting temp's forgotten password for the fifth time in an hour would be a sunny picnic, and you long for the serenity of troubleshooting the CEO's virus-ridden laptop.

How you deal with a system outage defines where the rubber meets the road in your career. Although the next few minutes (or hours) are probably the least amount of fun you'll have outside of a dentist's chair, how you react can mean the difference between being the hero or the goat. Here are 10 steps to guide your way through the experience so you can don the red cape after it's all over.

1: Don't Panic

Yes, this is borrowed from Douglas Adams, but it is universal. Even if you caused the problem, the worst thing you can do now is waste time fretting over how much trouble you'll be in or who's going to get upset with you. You may be petrified that this will become a career-ending move, but you can't worry about that right now. Your best chance is to correct what went wrong, and you can only do that by staying technical rather than getting emotional. Remember that this isn't a life or death situation (unless of course you support a hospital with patients at risk of a medical emergency -- and if anything, that will make your ability to stay calm even more crucial). As the saying goes, fix it and then have a heart attack.

Now more than ever you'll need your focus, so push back any fear you may have and stick to the facts at hand: Something is broken and you need to get to work.

2: Notify your users any way you can

You need to get the word out to your users (and your boss) that OMG is down. Of course, your primary means of communication (email, for instance) may be OMG itself, meaning that email to the company announcing what just happened can't possibly go out in that scenario. Use instant messaging. Call department heads and ask them to tell their staff. Get a coworker to walk around informing people. Put up a sign saying "OMG is down as of [time]. Working on it now!" The last thing you want are heads popping up out of cubicles to ask, "Hey, is OMG down?" and "Did you know OMG isn't working?"  You'll quickly get worn out reciting the same response when you need to roll up your sleeves and hold onto the fire hose.

3: Get a bouncer

Notifying your users is the first step. Getting someone to help answer their queries when they show up is the second. A sign announcing that you know about the problem will satisfy those who just want to make sure you're aware of it, but higher-ups asking for status updates and ETAs need a human face. Someone has to be that face while you're in the trench fixing things. Even if you recruit the receptionist or an HR intern, there must be a buffer between you and the community, at least until things are stable.

4: Be prepared for the politics

Now that you've notified your user community and posted a go-between, you need to be aware of one more thing as you enter the fray. You're about to find out who your friends are. I have seen some very interesting personalities revealed in pressure situations like these. That type-A manager who seems so obsessive and nitpicky may just show up to calmly ask what he can to do help or plan out alternatives. That arrogant co-worker you thought might take perverse pleasure in seeing your dirty hands may turn out to be your most sympathetic ear. And that VP who seemed so congenial and charming may suddenly transform into a Gestapo nightmare, coming over to brush past your bouncer and bellow, "Do you understand how important OMG is?"

If this happens, take it with a grain of salt: They're upset about the problem and you're the target at the moment. You can't control them, but you can control yourself. It's understandable that they may be angry --money is at stake here. But don't get drawn into a screaming match. Hold the sarcasm and politely inform them, "We can figure out what went wrong and who's to blame later, but I need to get this fixed now."  When the dust settles, you'll be the one without regrets. Also, resist (if you can) having to give minute-by-minute status updates to management. It will only increase your stress and reduce your technical capabilities by splitting your concentration.

5: Document everything

Whew! Four tips already and none of them about how to actually fix the problem. That's because you need to pave the way to fix it with the best possible approach and environment.

Take a moment to document exactly what happened: what you were doing (if applicable), what commands were run, and what occurred, including all error messages. I can guarantee you that after all this excitement fades away, your memory will resemble Bonnie and Clyde's car after their infamous fatal run-in with the police. It's not enough to just get things up and running, as I'll discuss later. Write down your activities on a piece of paper, type them in a computer text file, or even just speak them aloud to a coworker taking dictation for you.

As you proceed through the resolution process, document what you find and what you attempt to do to correct the situation. Did you restart services? Restart the box? Update a registry key? Plug something in somewhere else? Put it all in there. Make no mistake. This process could save both you and your company. And back out any changes that don't work, so you don't build the groundwork for more problems later!

6: Lay out all the facts of the situation

Whether you caused the initial problem or not, you may find yourself preoccupied with making sure you're not the one without a chair when the music stops. It's understandable when you work in IT to be concerned that people will point fingers at you after an outage. "What'd you do?" is usually the first question a user asks after an emergency. You may be perceived as responsible for "not seeing this coming."

Regardless, don't try to fix the problem while simultaneously trying to cover anything up. Not only is it unethical, but it will slow down or confuse the resolution process - maybe even make it worse! Besides, a third party (or your boss) will quite likely be able to read between the lines and see what really happened. Systems record events, keep log files, and in some cases even audit administrator actions. In the end, something happened to cause OMG to go down, so laying down a smokescreen helps no one. Every week I hear about someone who faked academic credentials on their resume, got a prestigious position they weren't entitled to, and wound up embarrassed and terminated once the truth came out, long after they thought they'd pulled it off. Don't be that guy (or woman). It'll happen to you.

7: Don't push all the buttons at once

I think we all have the tendency to hit a button again when the first attempt didn't appear to work. This is why I see people hammering the button at a crosswalk. It's also why printers spit out umpteen copies of the same document once a jam has been cleared; someone figured if clicking File and then Print didn't work the first 16 times, 17 might be their magic number.

During a system outage you'll want to get things up and running as quickly as you can. However, if there are four things you can try don't try them all at once. Whatever relief you'll gain in the short-term, if it works will be replaced by guilt later because you don't really know what fixed the problem. Granted, this isn't the time to conduct a slow n' lazy case study on a Sunday afternoon, but it isn't the time to rush willy-nilly either. You have to identify the smoking gun to truly consider the problem solved.

8: Go straight to the top

When you're spinning your wheels and don't see an easy fix, don't hesitate to call support. (You DO have support on your products, right? If not, get it today. Beg if you have to.) It may mean waiting on hold or sitting by the phone expecting a call back. But get that support ticket created even if you simply think you might need it.

I'll admit that in my younger and more stubborn days of system administration, I assumed I could fix anything with enough time and determination. "Calling support is just waving the white flag!" I would say. It may well be the case that you can fix anything in the long run, but 30 hours of determination is less valuable than two hours on a paid support call. Case in point: A recent Microsoft support incident at a client site cost $259 and the issue was resolved within 45 minutes. Frantic Googling and crawling down the rabbit hole by trying Suggestion A, Suggestion B, Suggestion C (on forums populated with amateurs trying to be helpful, rather than professionals who know the ins and outs of the product) could have taken 10 times as long, if the resolution even appeared at all.

The client would have lost far more in revenue and productivity if not for investing that $259. It's not a matter of pride -- it's a matter of good business decisions, and you're there to support the business. The best part is that now your problem is in other hands alongside your own and you have someone else to share the worry.

9: Perform a post-mortem

Good news! The problem was found and OMG came back up fine. You notified everyone that the situation was normal again (or at least as normal as it may be at your workplace). People began returning to work and the whole experience was just a bad dream, right?

Not so fast. You can't go have a beer yet. You need to keep your sleeves rolled up and put together all the notes you took. Notify all relevant parties what happened, what you did to fix it, and how you will ensure that it doesn't happen again. Set up alerts or processes for better advance notice of these kinds of issues. The worst possible time to develop a contingency plan is during an actual outage, so include what you will do if the issue DOES return and the fix doesn't work then. This is what any smart boss wants to see most right now -- and so do his or her bosses.

10: Don't lose faith in yourself

It's often said that users don't call the help desk to report that their computers are working well. In that same vein, when you work in IT, it's easy to feel like you're only as successful as the events of that day. A system outage can be a huge blow to an IT pro's confidence, even if you've got a string of amazing successes attached to your belt. Try to be philosophical. You just had five hours of email downtime and you've been there five years? Well, an average of an hour per year of email downtime is a pretty good track record.

Stay positive. You got the job done, after all. When you go home, don't start questioning your talents and capabilities (unless you did something really reckless like yanking all the electrical plugs out of the back of a server rack; such a thing would clearly indicate that IT is the wrong career for you). As in #9, figure out how you can make the most out of this episode. Study more so you can learn about the subject and be prepared for similar events. Build good coworker relationships so you can count on your fellow employees in an emergency like this.

Be ready...

Always remember that IT professionals are paid to maintain order in a disordered universe, and it's up to you to remain stalwart so you can tackle the next system outage. It's coming. Will you be ready?

Other tips?

Have you ever found yourself scrambling to resolve a major system meltdown? What additional steps would you recommend for keeping the chaos to a minimum and resolving the problem as quickly as possible?



About

Scott Matteson is a senior systems administrator and freelance technical writer who also performs consulting work for small organizations. He resides in the Greater Boston area with his wife and three children.

11 comments
monica
monica

Great article. The weakest link in the chain is us the human beings.   Additional step based on experience with my clients is: educate, educate, educate everyone again and again about how to prevent outages based on email/internet related viruses and overload inboxes.

medfordmel
medfordmel

Maintain current documentation for everything within your environment.  You'll thank yourself later.  Include support contact information for every system.  Maintain internal and external contact information for department heads.


Before EVERY configuration change, back up and/or document the current configuration, document a sequential plan for all GUI steps, CLI commands, and physical changes (cable swaps, etc.), document a sequential rollback plan, BEFORE you make any changes.  If the change is successful, immediately back up and/or document the new configuration.


A well-maintained disaster preparation plan is crucial.  Assess per-hour and per-day opportunity cost associated with the loss of each network resource,  and prioritize accordingly.  Make plans for each network resource and for different types of failure.


Identify and include dependencies.  Know which network resources and staff will be affected when server A, application B, or switch C is down.


Keep copies of your documentation disaster preparation plan in a secure offsite location.  We once had a fire in our building, and our DR guy showed up at the front door in a panic because the disaster plan was on his desk - on the 19th floor.  Don't be that guy.


Provide as many redundant systems as your budget allows, starting in the most critical areas.  Cluster and virtualize whatever you can, and make and verify frequent backups - regular backups and pre-change backups.  Rotate copies of current backups to a secure offsite location.


As someone else mentioned, make certain that you have installation media available to rebuild every system.  Where possible, keep copies of installation media in a secure offsite location.


I know we rarely make time (or allot budget dollars) for such things, but when something goes wrong, you'll really wish you had a documented plan in advance.  It will save your company money, and will make you a hero when (not if) something breaks.

Louis Thompson
Louis Thompson

Good advice.

Really every company should have a Business Recovery Plan.  This should be updated every 90 days.

There is no way a IT manager should not know what/where/when/how to repair any aspect of their network.

Stuff breaks, servers go down, the weather is not your friend, and we all have had issues beyond our control happen in the course of our careers.  Having a recovery plan says you want to succeed and move on from the problem.


syhprum
syhprum

Surely the most important thing is to find some fall guy who gets the blame and gets fired while you get all the brownie points for fixing the problem!

ogoody50
ogoody50

The first item in the first reply is the best advice.I didin't find the article very helpful in practical terms.Every company is different in size, protocol, system set up for email and access to data, etc. so it would be hard to cover in one article. I work for a relatively small (non-profit) company but we have had a couple of these instances.  Our email is cloud based (rackspace) so that isn't too much of a problem even for those still using the Outlook interface rather than straight web access.  The server is backed up with coud (rackspace again), and several external drives on different computers set fo daily, weekly, and monthly backup.

Most recently our only server bit the dust.  NO PROBLEM! Employees were told to save all new work to their workstation drives. I set up the most recent existing backup on another workstation and made it available to the network for existing data access. In less than a day and a half the new server was up and running.No down time and all new data was transferred to the server drive without instance.

Don't panic is good advice, but it is easier not to with a good disaster plan :)

sirrahnosaj
sirrahnosaj

- Have a good Disaster Recovery Plan and keep it up to date. Review (and update where needed) after every major system change.

- Keep an up to date contact list for all relevant support staff (inside the DRP)  that you may need to speak with during an emergency (phone numbers, email addresses, whatever).

- During the incident, get someone to notify the backup guys (if you have any) asap, organising the right tapes for restorals can take some time.

- If you need to do a bare metal restore make sure you have the right disks ready on hand.

-  Make sure you have someone available onsite asap for cable changes, h/w checks, etc.

- If you're not the person under the pump, offer assistance, if not needed at the time then keep an ear out. If he/she is the last one in the office, don't just walk out without checking for any help needed.

- If you're in for the long haul, get naps where possible (eg during restores) or try and share the load with someone else. Make sure you have good handovers(What you've done, what is happening now, what needs to happen next)

- as per the article, documenting steps taken is critical, recording exact timings can help too...for various reasons - comparing against log files for results of actions, determining timings if you have to repeat tasks, producing a final report, etc.

- remember to eat (relatively healthy food). And drink sports drinks (tempting to have coffee...mmm coffee... but it's fools gold).

- If you seem to be going nowhere with vendor support, escalate internally, they can get on to the customer reps and get it up from the script jockeys and on to the specialists.

- try and do a reality check with someone else technical that can review your steps taken. A second eye can always help. Easy to miss something when you are tired/distracted.

Luis Manuel Antunes
Luis Manuel Antunes

:) if you work in a bank and need to process stock market orders that solution won't fit at all :)

Padgett Justin
Padgett Justin

Submit my letter of resignation... Kidding, always make sure you are prepared for several outages. When I first started in Networking I wasn't. Now I know to have a back up firewall/router (whichever your organization uses) with the current configuration installed. If a server is down, try to revert to another with the same role (we had physical servers, but I introduced virtual... that came in handy when I needed it) If your computer crashed and you were writing some code... well, I just dont know! haha

dmritchie2
dmritchie2

Don't let the Powers That Be (tm) convince you that making a phone call to support is more important than shutting down your servers as the power backup systems are preparing to die and power off all of your VMs.

rm.squires
rm.squires

@sirrahnosaj

Article has some good general advice and above is a great addition to the article :)

Editor's Picks