Security

How to conduct a production outage post-mortem

Production outages can be stressful, but they can also result in valuable lessons. Here's are some tips on conducting a post-mortem to prevent repeat occurrences.

istock-504607748.jpg
Image: iStock/shironosov

Working in IT has many benefits; plenty of employment opportunities, interesting and challenging work and the ability to get involved with a lot of cool technology.

The flip side can be long nights, maddening problems, and - probably dreaded most of all by every IT pro - a production outage, where critical systems or services are rendered unavailable, either through human action or technical failure.

There's no greater stress in IT than being the one responsible for getting the lights back on, especially when the source of the problem is unclear. Additional worries about one's ongoing employment don't help matters, either.

Resolving the problem is often cause for celebration — and rightfully so — but it's important not to just blithely move on to the next issue. A production outage is a serious condition which merits significant introspection to help safeguard the company, and one's career against a reoccurrence of the problem, or being impacted by a similar one.

Here are 10 ways to make the most from a production outage and move forward in a constructive fashion.

1. Gather the information

Use system logs, human testimonials, any available email or instant messaging trail and all other related data to find out as much as possible about the outage. Electronic information is likely to be the most reliable type, especially since it often includes timestamps to help you follow a chronological trail to map out the incident. A centralized logging system such as Splunk can be a huge asset here since it provides a single portal to search aggregated log files.

2. Identify the root cause

It's not enough to just look at the data and say "something crashed." What caused the crash? Was it a human error, a memory leak, a failed hardware component, bad firmware, a faulty patch or some other element? If possible, engage the vendor since they can usually zero in on the cause of such problems much more rapidly than average IT staff who juggle multiple responsibilities and talents.

3. Determine the impact

This should be an easy step. What systems or services were affected by the outage? Was email down? Were multiple file servers unavailable? Did a database fail? Were there any dependencies? How long did the outage last and were there any workarounds or alternatives used (or available) to mitigate the effect upon the company, employees or customers?

Assessing the impact does more than just scope out where "ground zero" was but will assist in developing preventative measures discussed below.

SEE: 6 cybersecurity and emergency situations every IT department should train for

4. Assess staff actions

This is trickier than the previous step. It's important to outline the actions staff took before, during and after the outage. Log files can help piece the puzzle together if this is ambiguous territory. The "history" command on Linux system is a gold mine of information and the Event Logs in Windows can also be useful.

This is why I recommend that staff keep a written log of the steps they took during these types of incidents - even in something as simple as a Notepad window - along with the timing involved. In times of crisis many IT professionals panic and throw everything at a problem in hopes of a speedy fix. The drawback to this approach is the difficulty in determining what actually fixed the problem, however.

This step may involve a measure of blame or finger-pointing, particularly if the outage was caused by human error or a failure to prevent the incident despite advance warning. If the outage was deliberately caused by malicious intent (something certainly infrequent and likely difficult to establish) then some measure of discipline should be applied, depending on managerial and HR standards. However, hold off on a rush to judgment until you at least get through step seven.

5. Establish whether existing safeguards failed

In my experience this a common cause for production outages is that safeguards which were put in place to prevent such incidents either didn't work or went ignored.

For example, an Exchange server's log volume fills up, forcing the server to shut down. Emails had been sent to staff for some time alerting them that the disk space was low, but these were being filtered to another folder and went unnoticed. Or, perhaps the alerts were configured to be sent to one individual rather than the group, and that individual is the former email administrator and is no longer with the company. It could be that staff weren't notified via email that a system was dead since the notifications relied on that very same system and it a standalone server.

The point here is to look at what might have staved off the outage and what can be done to remedy that for the future.

6. Determine how to improve technological processes

Perhaps you found in the previous step that no safeguards had failed (or there were no safeguards!) but there still weren't sufficient preventative measures. This is where the prior steps will deliver value since you can now determine what needs to be done to keep the company from ending up in the same spot again.

Consider implementing additional monitoring and alerting, such as leveraging text messaging capabilities to contact IT staff immediately when potential problems are detected. Perhaps redundancy can be introduced or improved so that a single server runs in a cluster or an active/passive setup so a server failure won't cause service downtime. Using multiple ISPs with multiple internet gateways can help network traffic keep flowing if there is an ISP outage or an upstream router fails. Even conducting daily physical walk-throughs of a data center can come in handy to spot warning lights or discover alarm bells on a system experiencing problems.

SEE: Patching WannaCrypt: Dispatches from the frontline

7. Determine how to improve human processes

The technology part is only half of the improvement plan. Better human practices often go hand-in-hand with preventing future outages, especially if this one was caused by human error or misconduct.

Consider whether a "peer approval" system - whereby one person types a command and the other person verifies this is correct before the enter key is pressed - might come in handy. Does change management need to be introduced, whereby proposed changes are described and submitted for approval? Are staff working on systems late at night and subject to fatigue which causes lapse in attention span or judgment, and if so can this work be scheduled for another time? Do staff need additional training to help hone their skills?

Even simple habits such as typing "hostname -f" on a Linux system or "set" on a Windows system to confirm the host name is correct before taking action on it can serve as a useful safeguard.

8. Implement and test the improvements

Put your proposed changes in place, document the improvements and notify staff of the details and how to administer them (if applicable) so these will become the new standards going forward.

But don't just blindly trust that this will work and there's no need for further concern. Test the changes during an arranged maintenance window. For instance, with the example of the Exchange server with the full log volume, copy a set of large files to the drive to bring it up to a level which should trigger an alert (75% full, for instance) and confirm the appropriate personnel were contacted accordingly.

9. Decide who to notify

This can be one of the toughest steps listed here. Now that the incident is being properly wrapped up and laid to rest, notifying users or customers of a production outage may still be a necessary step even after it's been resolved so that they understand what happened and what's being done about it.

It's critical to keep relevant people in the loop to maintain credibility, lay out the ramifications of the outage and discuss what safeguards are being put in place to prevent an outage of this nature from reoccurring, or to facilitate a quicker recovery next time.

Even if nobody may have noticed the outage occurred in the first place, it's better to inform them after the fact than to risk someone noticing it — along with your failure to address the issue later.

10. Move on and adjust as needed

A production outage can be costly, time-consuming, frustrating and even embarrassing. Many an IT professional has taken a hit to their ego and reputation (or the perception thereof) and found it difficult to let go of such episodes and move on.

It's important to do so for the sake of one's morale and career, however — not to mention not letting such matters eat away at your attention span and thereby causing further technological problems.

Adjust the improvements put in place here as needed and keep in mind some outages may be inevitable, as every ISP or telephone company can attest, so the question should not be, "Did something bad happen?" but "What did we do to solve the problem?"

Also see

About Scott Matteson

Scott Matteson is a senior systems administrator and freelance technical writer who also performs consulting work for small organizations. He resides in the Greater Boston area with his wife and three children.

Editor's Picks

Free Newsletters, In your Inbox