In this post, we continue with identifying one or more root causes of an actual event documented in a previous article. Let’s begin by reviewing the tasks we’ve already completed.

As part of an overall 8D problem solving approach, we began by identifying the cause and effect chain leading to the direct or proximate causes of the primary event–a system failure caused by a system console replacement. Starting with the primary event, we asked why for each of the effects/causes, up to five iterations. The results were diagrammed as shown in Figure 1.

Figure 1: Result of 8D Five Why Process

The next step is identification of conditions which, when combined with a specific action, contributed to or started the unwanted chain of events.

Identify conditions

As shown in Figure 1, the response to “why” is usually a description of an action taken. It might also describe an event caused by system failure or scheduled processing. In many cases, these activities or events were properly executed and occurred in the right order. The defining characteristic leading to unwanted outcomes might instead be due to altered conditions or context.

For example, employees near a plant entrance might consistently toss just-used matches into a drum used for that purpose. When nothing else is placed into the drum, the process works. However, if one day someone decides to toss kerosene soaked rags into the drum, the next deposit of a “hot enough” match might result in a fire. The drum is the same. Placing matches in the drum is normal. However, a condition changed–the presence of combustible material.

So for each of the answers in Figure 1, we need to describe any unusual conditions present at the time. This is what I did in Figure 2. Note that not all actions are associated with a relevant condition. Too often, I’ve seen teams get hung up on defining conditions for each activity. Don’t get caught in that trap.

Figure 2: Adding verifiable conditions

Tell the story and verify observations

Once the draft diagram is complete–working from the event backward in time–the next step is to walk the diagram from the bottom up. At each activity/condition pair, stop and make sure you have verifiable evidence to support them. Assumptions are not allowed. If no verifiable evidence (i.e. logs, first hand observations, etc.) are available, then remove the unsupported activity or condition from the diagram.

The result of this walk-through should be the presence of actions actually taken within verifiable contexts that when stepped through provide a cohesive and clear story about the events leading up to the primary event.

Once the team is satisfied with the cause and effect diagram, it’s time to identify root cause(s).

Identify root cause

I’ve found that removing or changing unwanted conditions is typically more productive than removing actions. In our example, the actions were all correct given a specific equipment replacement context. However, at least one of the conditions listed in Figure 2 indicates a misunderstanding of the process actually needed to replace this specific workstation.

As indicated in Figure 3, the team decided that the root cause was a process or a step in an existing process in which the Help Desk verifies the type of system being replaced. Knowing this PC was controlling a production system, not just acting as a standard end-user device, should have initiated a different method of replacement, including use of the established change management process. Change management requires notification of all affected technical teams and includes formal quality assurance testing. This would probably have stopped dead the chain of events leading to failure.

Figure 3: Locating root cause in a cause and effect diagram

Again, adjustments made early in the cause and effects chain are usually more effective than those implemented close to the proximate cause. Our identified root cause satisfies this guideline. Another step in ensuring we’ve identified an actual root cause is to step through the replacement process, but this time conceptually using the missing control. If the primary event’s probability of occurrence is reduced to an acceptable level, then this is a root cause.

Sometimes, making a single adjustment doesn’t sufficiently reduce. Caused by political, financial, or technical constraints, the team is unable to go far enough. When this happens there are two possible solutions. First, the team can identify another point at which they can improve or implement a control. The goal is to arrive at the desired probability of occurrence with a combination of changes instead of a single root cause remediation.

The second solution borrows from the Failure Modes and Effects Analysis (FMEA) methodology. The results of a FMEA are similar to those of a root cause analysis. The primary difference is time. A FMEA is usually completed BEFORE a process or system is implemented to reduce or eliminate design issues. The concept I want to take from FMEA is the desire to achieve a balance between probability of occurrence and detectability. When probability of occurrence cannot be sufficiently reduced, improving detectability–and therefore an organization’s ability to quickly respond, mitigating negative impact–is critical.

The action plan

The final step in root cause analysis is implementation of administrative, technical, or physical controls needed to remediate root cause. The only way to ensure completion is via a formal action plan. The plan should list the tasks necessary to reduce probability of occurrence or increase detectability. For each of these tasks, the plan should include the resource assigned and the expected completion date. And don’t forget. Someone has to actually own and manage the plan.

The final word

The only way to prevent recurrence of unwanted events is to eliminate the underlying causes. Treating symptoms is usually easier, because no formal processes are needed; IS teams don’t have to actually talk to each other. But repeatedly treating the sickness when the underlying cause is a compromised immune system will not result in real IT service delivery improvements.