Columnist Tom Mochal receives dozens of e-mails each week from members with questions about project management problems. Mochal shares member questions and provides answers in a column each month.

Question:
Our group has difficulty debugging some of our complex programs. I’m not talking about the garden-variety bugs; I mean the ones that pop up in programs that have previously run successfully for a long time. We have some simple debugging tools, but they don’t always do the job. Do you have a process that you follow to find these bugs in the shortest amount of time?

–“Buggy” Henderson

Answer:
Buggy,

I guess I should show my age by first stating that I‘m old enough to have dealt with the hundreds (or thousands) of pages of paper you could generate when you encountered an error in a mainframe program. The dumps actually allowed you to look up the appropriate memory addresses to see what the program variable values were and to see the last records processed. The best thing about this experience was that it gave me a chance to tell an old war story at the beginning of this column.

Of greater value to reading the dumps was the thought process I went through when trying to track down bugs. These techniques are still useful today, even if you’re a Web developer or use other cutting-edge development technology.

In many cases, when you encounter a bug, you can look at the general characteristics and the error message generated to quickly determine the cause. In other cases, there’s nothing obvious that shows the point of failure. I call this latter variety ”subtle bugs.” They’re not easy to find, and they’re generally caused by a combination of events that masks the root cause. Subtle bugs can take a long time to resolve.

Obviously, you want to use any good automated debugging tools that you have. However, you should use the tools in combination with a good set of detective skills. The next time you run into a subtle bug, try using some of these debugging techniques:

  • Always re-create the error first. This piece of advice seems obvious, but it’s the cause of a lot of frustration for inexperienced programmers. For the most part, you cannot solve what you cannot see. If a user tells you that he or she encountered an error, you can look forever trying to find a cause. However, is the user giving you the exact sequence of events? Usually not. You need to carefully re-create the scenario to a point where the error occurs consistently. If you cannot re-create the error, ask the user to stay vigilant and notify you if the problem occurs again. If it does, with luck, you’ll be in a position to capture enough information to re-create the problem.
  • Display the interim values. If you don’t have a tool that allows you to interrogate the code flow, you should display interim variable values as the program executes. Doing so starts to build a visual picture of what is going on and can help you understand the program flaw.
  • See if anything has changed. When stable programs suddenly go bad, the problem is usually caused by events out of the ordinary. Your first step is to see if the program has changed recently—ditto for any interface programs. If they have, then restore the old programs in the test environment and see if you can re-create the error. If you can’t re-create the error, it was probably introduced in the recent changes. If you can re-create it with the old programs, then the faulty code has probably been around for a while.
  • Narrow down the code. Some programs do a lot of processing, which can make it difficult to see what’s going on. To make your debugging task more manageable, comment out large sections of the code and then try to re-create the error. This is a process-of-elimination approach. If you comment out a large block of code and the program runs fine, then the offending code is in the commented block. Next, uncomment out subsections or a small number of lines of code. Each time you do so, run the program and see if the bug occurs. When the bug hits again, you have found the code causing the error.
    This technique worked for me when I was trying to help a programmer track down a program memory leak. The programmer was convinced that a vendor component was causing the leak. After discussing the possibility and visiting the vendor Web site for possible patches, I suggested that the programmer simply comment out the line of code that utilized the vendor component. He did and reran the program. The memory leak was still there, and the programmer was able to conclude that the problem didn’t lie in that component after all.
  • Narrow down the data. This technique is similar to the previous suggestion, except you start to narrow down the data instead of the code. If the program uses large files or tables, start by cutting the data in half. If the error is still there, then continue to selectively reduce the input. If the error no longer occurs, then work with the data you first eliminated, since the error combination is probably in that set of data. Theoretically, by progressively eliminating chunks of code, you should be able to isolate the error to a single row (or record), or combination of rows, that is causing the problem.
  • Look for patterns. In many cases, errors occur over and over. When this happens, it is important to isolate the pattern. For instance, you may find that an error occurs when processing every other transaction, or an error may occur for certain people but not for others. If you can detect a pattern, you typically have a head start to solving the problem.

Conclusion
Subtle bugs can take hours or days to resolve. They can take much more time if you don’t have an overall game plan for determining the cause. Good, sound investigating techniques like the ones I’ve presented here are absolutely vital to resolving subtle bugs as quickly as possible.


Send Tom your questions

What project management questions do you have? Tom Mochal may answer your question in a future article. Send us an e-mail with your question.