In part one of this series, I discussed the basics of troubleshooting. In part two, I outlined the practice of troubleshooting from the general to the specific. In this third installment, I’m going to discuss the methodologies used to narrow the scope of your analysis. These easy-to-follow methods can save you time and effort when diagnosing a problem. Don’t go into the field without them.
Ever since the first computer failed, the process of troubleshooting has been continuously refined. Today, this process generally follows these steps:
- Gather the symptoms
- Examine the system
- Look for the obvious
- Ask questions
- Examine the outputs
- Play the odds
- Stack the deck in your favor
Gather the symptoms
Have you ever finished fixing a problem and said, “I wish I’d looked at that earlier?” Usually, this is a clear indication that you didn’t look at all of the symptoms. Asking the question, “what is wrong with the system?” is one of the keys to more effective troubleshooting. More specifically, you should ask which of the known outputs of the system have moved outside of their acceptable range. By starting with the big picture and narrowing it down from there, you are less likely to overlook critical symptoms. No matter how simple or complex the system, troubleshooting begins by gathering the problem’s symptoms.
Examine the system
The first step in troubleshooting a problem is to look for the obvious symptoms. Start the process of gathering symptoms by conducting an examination of the system. All your senses should be involved at this stage.
- Do you smell something strange? If you want to strike fear into the heart of a systems administrator, give them a whiff of an overheated computer. When carbon (used to make resistors) gets too hot, it puts off a very distinctive odor.
- Do you hear an unusual sound? Contrary to popular belief, computers do have moving parts. The two biggest noisemakers in a computer are cooling fans and disk drives. When these parts start to wear out, the first symptom you are likely to identify will be an audible one.
- Do you see something out of place? Examining the connectors and checking the LEDs are some of the fastest ways to start gathering symptoms.
- Do you have any loose components? In my experience, loose connections are the number one cause of computer failure.
This is just a cursory examination. You don’t have to look at every leaf in the tree. You still have a long way to go before you get to the fault isolation stage of this process. Keep in mind that you are just starting the troubleshooting process.
The first symptoms of a problem are usually obvious. A common mistake in troubleshooting, however, is to fix the first symptom you find. As a result, you wind up fixing symptoms and never address the root problem. To avoid this pitfall, take your time and follow the steps outlined in this series of articles.
Look for the obvious
I have a coworker named Bob. In the interest of “research” for this article, I removed the power cord from the back of his monitor. The first thing Bob did was the old “three-fingered salute” (Ctrl-Alt-Del) to wake up his monitor. When that didn’t work, he scratched his head for a second or two, then finally pressed the reset button. His computer re-booted, but there was still no video. Finally, he started checking the connections and found the unplugged power cord.
Aside from the entertainment value, the point of this “research” was to illustrate that Bob failed to look for the obvious. We all do it. We just jump right in and never take the time to examine the system before trying to fix it.
During this initial phase of troubleshooting, you should be asking questions. The questions you ask depend on the system you are troubleshooting. Most of you have probably heard or asked these questions many times. The more pertinent your questions, however, the more likely you are to define the problem quickly and accurately. For example, if a user calls up and says that he or she cannot print to a network printer, you should ask the following questions:
- Can you (the user) print to a different printer?
- Can other users print to the printer?
- Do they receive any error messages?
- When was the last time they could print?
Examining the outputs
The next step involves figuring out which outputs of the system are outside the realm of “normal” operation. This step is important because most users don’t know whether components and their outputs are functioning correctly.
For example, a user will tell you that they cannot see their “P:” drive. It’s your job to figure out if the problem is due to a network problem, a permissions problem, an administrator error, and so on.
We define the problem by examining various known component outputs. Can the components see other network resources? Is the output of the NIC correct? Can other users see the drive? Is the output of the HUB correct? Did the logon script change? Is the output of your fellow administrators correct?
Playing the odds
One of the easiest lessons learned in troubleshooting is to play the odds. Certain system components are more likely to fail than others. Usually, these components receive the most interactions. In mechanical terms, that means moving parts wear out. In electronic terms, it means that cables don’t go bad in the middle. The ends of a cable receive more wear and tear, and are therefore more likely to fail.
Think of a computer system in terms of inputs, processes, and outputs. By far, the most common cause of problems in computer systems is user (or administrator) interaction. As the old saying goes, “Garbage in, garbage out.” By looking at the inputs of a system, we are simply playing the odds.
The same idea holds true for non-moving parts: Those with the most interaction have a higher probability of causing problems. If your TCP/IP stack doesn’t stack up, you don’t start by looking at the NIC; you look at the configuration.
Identify the weak links
Not all components are created equal. When you start putting components together to create a system, the lesser components are more likely to fail than their industrial strength brethren. Experience with specific systems helps us to identify those weak links. Over time, we begin to recognize the symptoms created by them and can easily identify the root problem.
Stacking the deck
Every year, casinos rake in obscene amounts of money by stacking the odds against you. The members of the Super Geek Club are no different. They don’t win the IT game by blind luck; they play the odds. In order to win the troubleshooting game consistently, they have developed processes and methodologies like the ones we’ve discussed to stack the odds in their favor.
Handling the tough problems
Anybody can fix the easy problems. How do you handle the tough ones? In part four of this series, I will reveal the secret mathematical equation used by super geeks to solve even the most difficult of problems in record time. Stay tuned!
Mike Sullivan is a senior systems engineer with Merge Computer Group, Inc., a Richmond, VA, consulting firm. His credentials include MCSE+I, MCDBA, MCT, and 19 years of IT experience. He is a full-time consultant who occasionally takes time off from his clients to teach.
Are you the IT guru at your organization? Do you have any troubleshooting tips you consistently rely on? I’d like to hear about them. Post a comment or send Mike an e-mail.