By Harris Kern

Problem management is a continuous process. It encompasses problem detection, documentation of the problem and its resolution, identification and testing of the solution, resolution, problem closure, and generation of statistical reports. Many times in IT organizations, IT managers skip or mishandle one or more of these steps, which results in increased demands on the staff.

Here’s a breakdown of the steps for managing problems. If you follow them correctly, you’ll reap great benefits for your organization, including increased staff productivity and end-user satisfaction.

Step 1: Define problem management process and practices
The first step in establishing an effective problem management discipline is to publish a plan on how to handle problems. This plan should cover the following:

  • Procedures for handling problems: What is done after a problem is detected and reported, how problem data is captured and stored, and how the problem is managed to resolution
  • Roles and responsibilities of the IT support staff: Who receives the problem, who records all information, who handles problem resolution, and what each entity is supposed to do
  • Measurements for problem resolution: What will be tracked to monitor the efficiency of the problem management discipline
  • Problems to be handled and how to classify them: Severity and priority assignment methodology, and escalation guidelines
  • Bypass procedures: Actions that can be taken to immediately restore system availability in the event of specific events or problems

Step 2: Detect or recognize the problem
In this step, you activate the necessary tools to detect problems. Use all facilities for capturing problem reports, including the help desk. Gather data and record all pertinent information in a location accessible to all support staffers. Notify affected users to help minimize the impact of the problem.

Step 3: Bypass the problem
As soon as the problem is detected, take all possible steps to bypass it or minimize its impact on users. Ideally, you should identify bypass procedures in advance, ensuring that they’ll have no side effects on other systems, applications, or users. Keep in mind that a bypass is not a resolution of the problem. All too often, IT treats a bypass as a permanent fix, only to have the system eventually fail because the bypass was not designed to run forever or because the bypass affected other systems.

In some cases, IT managers use bypass procedures so often that they become the norm for “solving” the problem, when they actually do little to prevent the problem from happening again. Examples include rebooting a server or network router without identifying the source of the failure or pressing [Ctrl][Alt][Del] when a PC hangs instead of finding the failing software application and fixing it.

Record all bypass activities along with the problem information so that when the problem is passed to other support staffers, no relevant information is lost.

Step 4: Analyze the problem
At this stage, identify the true cause of the problem and evaluate, test, and apply possible resolutions. Review records to see if similar problems are on record. Efficient, effective problem analysis can significantly reduce the time it takes for resolution.

Step 5: Manage the problem to resolution
Many times, a single support professional can’t resolve a problem entirely unaided, and the problem must be shared among multiple support staffers, especially if it’s complex or involves multiple systems or applications. It’s important that someone monitor and manage the problem to resolution, making sure it’s resolved within the process performance targets.

Once the problem has been fixed, flag it as temporarily closed for a given period of time, such as one week. After this period lapses, ask the affected users whether the problem has recurred, or whether any unwanted effects were caused by the fix. If not, you can close the problem permanently.

Step 6: Report on the status and trends of problems
The next step is to gather problem statistics and generate summary reports for identifying trends and implementing preventive measures. These reports may include:

  • Summary of closed problems: Problems that occurred, how long it took to resolve them, and what the solutions were
  • Status of open problems: Existing unresolved problems, when they were opened, and why they remain as unresolved action items
  • Problem trends and statistics: Number and type of problems, areas affected, frequency of occurrence
  • Root cause of problems report: Problems that occurred, why they occurred, what can be done to prevent recurrence
  • Action plan for the next period: Plans to improve on problem trends and resolution times

These reports inform IT management of the current health of the system and offer a way to communicate with users on IT’s support activities.

Step 7: Redefine the problem management process if necessary
The redefinition step provides a way to refine or enhance the existing management discipline, based on the measurements that have been achieved. It’s part of the continuing improvement cycle of this and all other systems management disciplines.

The following process factors are critical to the success of problem management.

All problems, big and small, are covered
Small problems lead to (or are symptoms of) bigger problems, so it’s important that you record all problems. The recurring data read error eventually becomes a bad disk problem. The intermittent LAN connection problem sooner or later turns out to be a broken cabling problem. And the nuisance General Protection Fault error in Windows is likely due to a bad memory component.

Escalation procedures are followed
Many IT support staffers mistakenly believe that escalating a problem is an admission of incompetence, so they violate established escalation guidelines. This situation is dangerous—IT management loses control over the problem, often without even realizing it.

Problems are assigned severity levels and prioritized accordingly
All problems should be covered, but they should all be treated the same. On the contrary, you need a severity and priority assignment methodology that ensures important problems are handled first. A problem may be considered more severe if one or more of the following conditions are true:

  • Multiple users are affected
  • Critical business function cannot be performed
  • Alternate systems are not available
  • Entire system is not available

Users are updated on the status of the problems
Users experience major frustration when they have to wait for IT support staff to update them on the status of a problem. So it’s imperative that the IT staff in charge of managing the problem regularly update affected users as often as possible. Users appreciate knowing what’s been done, the current status of the problem, and when to expect a resolution.

Problem trends are analyzed and measures are taken to address them
The objective of systems management is to make everyone proactive in resolving problems. The analysis of problem statistics is a valuable tool for achieving this goal, because it helps identify potential problems based on past experience.

With a well-defined problem management process in place, your IT organization will realize numerous benefits: solve repetitive problems, reduce the number and impact of problems, reduce problem resolution time, and improve support staff productivity.

The Harris Kern Enterprise Computing Institute is a consortium of publications—books, reference guides, tools, articles—developed through a unique conglomerate of leading industry experts responsible for the design and implementation of ”world-class” IT organizations. For more information, visit the Harris Kern Enterprise Computing Institute Web site.