Security

Prevent recurring problems with root cause analysis

In this series, we'll step through an easy root cause analysis process that requires no special training -- just a little effort and a lot of common sense.

Documented, management supported, incident response processes -- processes for which response teams are well trained -- won’t necessarily achieve the ultimate objective of preventing recurrence unless root causes are identified. In this series, we'll step through an easy root cause analysis process that requires no special training -- just a little effort and a lot of common sense.

In Part 1, we look at constructing a simple root cause diagram for later analysis.

Why worry about root cause?

Many organizations, and even well-trained response teams, fail to prevent recurrence of unwanted events because they treat the symptoms instead of underlying causes. For example, if payroll is late because a switch failed, many organizations would simply look at how to deal better with switch failure. But switch failure may be the proximate cause, not the root cause. For our purposes, I define proximate cause as that activity which occurred, spatially or temporally, immediately prior to the incident.

The root cause is often a failed control, process, or a gap in staff skill sets that caused an earlier condition or event. This earlier event set off a series of causes and effects leading to the proximate causes. In Figure 1, for example, the root cause condition or activity occurred at Event 2, well before the proximate cause. The best way to prevent recurrence is to change what happened at Event 2. In other words, making changes to processes or conditions early in the chain of events is usually better than managing proximate causes.

In our switch failure example, the organization might discover that the underlying problem is a missing or broken change management process. Fixing this root cause will not only prevent the switch failure recurrence. It would also help prevent other unrelated failures as well.

Figure 1: Root cause conceptual diagram

Figure 1: Root cause conceptual diagram

Building a simple root cause diagram

There are many ways to build a root cause diagram. The most popular approach pushed by most root cause trainers is the fishbone or Ishakawa diagram. A simple fishbone is shown in Figure 2, with a more complicated analysis shown in Figure 3 (childrensmercy.com).

Figure 2: Simple fishbone diagram

Figure 2: Simple fishbone diagram

Figure 3: Complex fishbone diagram

Figure 3: Complex fishbone diagram

However, most of us technical types are not prepared for nor inclined to spend time building complex decision/analysis frameworks. We need something more straightforward, something that quickly gets to root cause so we can get to the next user or system issue which arose while we worked this one. The "8D Five Why's" is my answer to this challenge.

8D problem solving consists of eight steps that lead from an incident to managed resolution, including root cause analysis and recurrence prevention. Step 4 (D4) is root cause analysis, with a very simple approach. Ask why five times and you should be able to identify the fundamental issue, or issues, leading to the primary event. Although many 8D practitioners don’t actually graph their answers, I prefer to do so. As we’ll discuss in Part 2, a picture often makes it easier to "see" the problem.

Let's step through a real-world example, shown in Figure 4. In this incident, a vendor supplied desktop computer which controlled a critical production system was replaced. The production system immediately failed, causing interruption to a critical process.

Figure 4: 8D five why root cause diagram

Figure 4: 8D five why root cause diagram

The root cause analysis team was formed by following the company's after action review (AAR) process, ensuring complete and objective recording of events. To begin, the analysis facilitator asked the first why. Why did the incident happen? Two proximate causes were identified. First, the replacement system was not configured properly. Second, the response to user problem reports was not effective. The team agreed these two causes should be treated separately. They appear to result from different cause and effect chains. For our example, we’ll focus only on what caused bad system configuration.

The facilitator continued by asking the second why. Why was the system configured improperly? This continued for each answer through three more iterations. The assumption is that this is sufficient granularity to identify root cause. But root cause sometimes is not apparent after answering the fifth why. When that happens, the team must step through the process again, looking for activities or conditions which might have been omitted on the first pass.

Like any AAR process, root cause analysis must be free from finger-pointing. Every participant must understand his or her participation will not result in disciplinary action or peer ridicule, and management must back up these assertions.

We’ll continue this process in Part 2 by identifying one or more root causes and how to decide what to do about them.

About

Tom is a security researcher for the InfoSec Institute and an IT professional with over 30 years of experience. He has written three books, Just Enough Security, Microsoft Virtualization, and Enterprise Security: A Practitioner's Guide (to be publish...

4 comments