Just about every IT professional
has had that moment. That one where you realize something your company relies
on — which you support — is now down, out, or dead. Whether it’s email, Internet
access, the phone system, or a shipping program your warehouse depends on, a
vital link has just been severed and you’re the surgeon who has to repair it. For
the purpose of this article, I’ll refer to the vital service as “OMG”
since that’s probably what’s on your mind right now.

Words can’t describe that
sinking feeling in the pit of your stomach. Suddenly, the routine frustrations
of the day seem irrelevant. Resetting that accounting temp’s forgotten password
for the fifth time in an hour would be a sunny picnic, and you long for the
serenity of troubleshooting the CEO’s virus-ridden laptop.

How you deal with a system
outage defines where the rubber meets the road in your career. Although the
next few minutes (or hours) are probably the least amount of fun you’ll have
outside of a dentist’s chair, how you react can mean the difference between
being the hero or the goat. Here are 10 steps to guide your way through the
experience so you can don the red cape after it’s all over.

1: Don’t Panic

Yes, this is borrowed from
Douglas Adams, but it is universal. Even if you caused the problem, the worst
thing you can do now is waste time fretting over how much trouble you’ll be in
or who’s going to get upset with you. You may be petrified that this will
become a career-ending move, but you can’t worry about that right now. Your
best chance is to correct what went wrong, and you can only do that by staying
technical rather than getting emotional. Remember that this isn’t a life or
death situation (unless of course you support a hospital with patients at risk
of a medical emergency — and if anything, that will make your ability to stay
calm even more crucial). As the
saying goes, fix it and then have a heart attack.

Now more than ever you’ll need
your focus, so push back any fear you may have and stick to the facts at hand: Something
is broken and you need to get to work.

2: Notify your users any way you can

You need to get the word out to
your users (and your boss) that OMG is down. Of course, your primary means of
communication (email, for instance) may be OMG itself, meaning that email to
the company announcing what just happened can’t possibly go out in that
scenario. Use instant messaging. Call department heads and ask them to tell
their staff. Get a coworker to walk around informing people. Put up a sign
saying “OMG is down as of [time]. Working on it now!” The last thing
you want are heads popping up out of cubicles to ask, “Hey, is OMG down?”
and “Did you know OMG isn’t working?” 
You’ll quickly get worn out reciting the same response when you need to
roll up your sleeves and hold onto the fire hose.

3: Get a bouncer

Notifying your users is the first
step. Getting someone to help answer their queries when they show up is the
second. A sign announcing that you know about the problem will satisfy those
who just want to make sure you’re aware of it, but higher-ups asking for status
updates and ETAs need a human face. Someone has to be that face while you’re in
the trench fixing things. Even if you recruit the receptionist or an HR intern,
there must be a buffer between you and the
community, at least until things are stable.

4: Be
prepared for the politics

Now that you’ve notified your
user community and posted a go-between, you need to be aware of one more thing
as you enter the fray. You’re about to find out who your friends are. I have
seen some very interesting personalities revealed in pressure situations like
these. That type-A manager who seems so obsessive and nitpicky may just show up
to calmly ask what he can to do help or plan out alternatives. That arrogant
co-worker you thought might take perverse pleasure in seeing your dirty hands
may turn out to be your most sympathetic ear. And that VP who seemed so
congenial and charming may suddenly transform into a Gestapo nightmare, coming
over to brush past your bouncer and bellow, “Do you understand how
important OMG is?”

If this happens, take it with a
grain of salt: They’re upset about the problem and you’re the target at the
moment. You can’t control them, but you can control yourself. It’s
understandable that they may be angry –money is at stake here. But don’t get
drawn into a screaming match. Hold the sarcasm and politely inform them, “We
can figure out what went wrong and who’s to blame later, but I need to get this
fixed now.”  When the dust settles,
you’ll be the one without regrets. Also, resist (if you can) having to give
minute-by-minute status updates to management. It will only increase your
stress and reduce your technical capabilities by splitting your concentration.

5: Document everything

Whew! Four tips already
and none of them about how to actually fix the problem. That’s because you need
to pave the way to fix it with the best possible approach and environment.

Take a moment to
document exactly what happened: what you were doing (if applicable), what
commands were run, and what occurred, including all error messages. I can
guarantee you that after all this excitement fades away, your memory will
resemble Bonnie and Clyde’s car after their infamous fatal run-in with the
police. It’s not enough to just get things up and running, as I’ll discuss
later. Write down your activities on a piece of paper, type them in a computer
text file, or even just speak them aloud to a coworker taking dictation for you.

As you proceed through
the resolution process, document what you find and what you attempt to do to
correct the situation. Did you restart services? Restart the box? Update a
registry key? Plug something in somewhere else? Put it all in there. Make no
mistake. This process could save both you and your company. And back out any
changes that don’t work, so you don’t build the groundwork for more problems later!

6: Lay out all the facts of the
situation

Whether you caused the initial
problem or not, you may find yourself preoccupied with making sure you’re not
the one without a chair when the music stops. It’s understandable when you work
in IT to be concerned that people will point fingers at you after an outage.
“What’d you do?” is usually the first question a user asks after an
emergency. You may be perceived as responsible for “not seeing this
coming.”

Regardless, don’t try to fix the
problem while simultaneously trying to cover anything up. Not only is it
unethical, but it will slow down or confuse the resolution process – maybe even
make it worse! Besides, a third party (or your boss) will quite likely be able
to read between the lines and see what really happened. Systems record events,
keep log files, and in some cases even audit administrator actions. In the end,
something happened to cause OMG to go
down, so laying down a smokescreen helps no one. Every week I hear about someone
who faked academic credentials on their resume, got a prestigious position they
weren’t entitled to, and wound up embarrassed and terminated once the truth
came out, long after they thought they’d pulled it off. Don’t be that guy (or
woman). It’ll happen to you.

7: Don’t push all the buttons at
once

I think we all have the tendency
to hit a button again when the first attempt didn’t appear to work. This is why
I see people hammering the button at a crosswalk. It’s also why printers spit
out umpteen copies of the same document once a jam has been cleared; someone
figured if clicking File and then Print didn’t work the first 16 times, 17
might be their magic number.

During a system outage you’ll
want to get things up and running as quickly as you can. However, if there are
four things you can try don’t try them all at once. Whatever relief you’ll gain
in the short-term, if it works will be replaced by guilt later because you don’t
really know what fixed the problem. Granted, this isn’t the time to conduct a
slow n’ lazy case study on a Sunday afternoon, but it isn’t the time to rush willy-nilly
either. You have to identify the smoking gun to truly consider the problem
solved.

8: Go straight to the top

When you’re spinning your wheels
and don’t see an easy fix, don’t hesitate to call support. (You DO have support
on your products, right? If not, get it today. Beg if you have to.) It may mean
waiting on hold or sitting by the phone expecting a call back. But get that
support ticket created even if you simply think you might need it.

I’ll admit that in my younger
and more stubborn days of system administration, I assumed I could fix anything
with enough time and determination. “Calling support is just waving the
white flag!” I would say. It may well be the case that you can fix
anything in the long run, but 30 hours of determination is less valuable than
two hours on a paid support call. Case in point: A recent Microsoft support
incident at a client site cost $259 and the issue was resolved within 45
minutes. Frantic Googling and crawling down the rabbit hole by trying
Suggestion A, Suggestion B, Suggestion C (on forums populated with amateurs
trying to be helpful, rather than professionals who know the ins and outs of
the product) could have taken 10 times as long, if the resolution even appeared
at all.

The client would have lost far
more in revenue and productivity if not for investing that $259. It’s not a
matter of pride — it’s a matter of good business decisions, and you’re there
to support the business. The best part is that now your problem is in other
hands alongside your own and you have someone else to share the worry.

9: Perform a post-mortem

Good news! The problem was found
and OMG came back up fine. You notified everyone that the situation was normal
again (or at least as normal as it may be at your workplace). People began
returning to work and the whole experience was just a bad dream, right?

Not so fast. You can’t go have a
beer yet. You need to keep your sleeves rolled up and put together all the
notes you took. Notify all relevant parties what happened, what you did to fix
it, and how you will ensure that it doesn’t happen again. Set up alerts or
processes for better advance notice of these kinds of issues. The worst
possible time to develop a contingency plan is during an actual outage, so
include what you will do if the issue DOES return and the fix doesn’t work then.
This is what any smart boss wants to see most right now — and so do his or her
bosses.

10: Don’t lose faith in yourself

It’s often said that users don’t
call the help desk to report that their computers are working well. In that
same vein, when you work in IT, it’s easy to feel like you’re only as
successful as the events of that day. A system outage can be a huge blow to an
IT pro’s confidence, even if you’ve got a string of amazing successes attached
to your belt. Try to be philosophical. You just had five hours of email
downtime and you’ve been there five years? Well, an average of an hour per year
of email downtime is a pretty good track record.

Stay positive. You got the job
done, after all. When you go home, don’t start questioning your talents and
capabilities (unless you did something really reckless like yanking all the
electrical plugs out of the back of a server rack; such a thing would clearly
indicate that IT is the wrong career for you). As in #9, figure out how you can
make the most out of this episode. Study more so you can learn about the
subject and be prepared for similar events. Build good coworker relationships so you can count on
your fellow employees in an emergency like this.

Be ready…

Always remember that IT
professionals are paid to maintain order in a disordered universe, and it’s up
to you to remain stalwart so you can tackle the next system outage. It’s coming.
Will you be ready?

Other tips?

Have you ever found yourself scrambling to resolve a major
system meltdown? What additional steps would you recommend for keeping the chaos
to a minimum and resolving the problem as quickly as possible?