I like to think of network troubleshooting as being like a funnel. At the top, the widest part, are the symptoms of the problem, along with the vast number of potential causes and related issues. At the bottom is the specific solution to that particular problem. Troubleshooting is essentially the process of filtering information and matching symptoms to solutions. Here are some pointers for getting to the bottom of the funnel with the least amount of effort.
Keep an open mind
Each of us has an area of networking where we feel most comfortable working. If we love hardware, we may spend too much time focusing on cables and switches when the real problem is user permissions. If we're software junkies, we could be embarrassed to discover the loose cable after two hours of pointing and clicking and issuing commands. Although it may be a stretch, try to conduct a broadly focused assessment of the problems you encounter to make sure you don't overlook the culprit.
Target the time when things went wrong
Don’t forget that in most cases, there was a time not too long ago when everything was working properly. Pinpointing the time or event before failure took place allows you to rule out symptoms that aren’t relevant to the problem. It is important to refocus on the appropriate symptoms.
Go to the board
A large whiteboard is very helpful for troubleshooting tasks. You can start by writing down the problems and symptoms, and as you narrow your focus, you can erase or cross out extraneous data so that you have only the relevant issues in front of you. The board also helps illustrate for your associates the problem at hand.
Recently, I was in the middle of writing on our whiteboard the fact that pinging outside the network had failed. After reading this note, one of my associates came in to tell me that the DNS server in our home office was experiencing intermittent failures. Working out a problem on a whiteboard facilitates communication and ensures that all users know the problem. In this case, once the problem was correctly defined, the troubleshooting was finished.
Always check hardware first
When you begin to investigate a problem, the first step is to sift through the symptoms to decide whether they are primarily hardware- or software-related. Many of us ignore the cardinal rule of checking hardware first for some of the following reasons:
- It seems too simple.
- Hardware is often hidden in places we don’t feel like exploring.
- It’s much easier to start pointing and clicking or issuing commands than it is to get down on your hands and knees or to start climbing around and fiddling with machines.
- A hardware issue may involve higher costs and take more time to fix and therefore may be something you want to consider as a last resort.
Nevertheless, since hardware is the easiest source to rule out and should be checked first, network troubleshooting should begin with the following steps:
- Check cables and their connections to devices.
- Run hardware diagnostic tests to see if they pinpoint any failures.
- Try rebooting the server, router, or hardware device, if possible. This solution solves a myriad of problems, from locked keyboards to routing issues.
- Try to reproduce the problem on another machine or on a test network.
Troubleshoot the software
If all the hardware looks good, it’s time to delve into the software. When you’re talking about networking, software problems can usually be narrowed down into three categories. Let's look at each one.
If a network application is locking up or malfunctioning, you must determine what task is causing the failure. If it is an application you’ve recently upgraded (or placed on an upgraded OS), you may have to surf to the application vendor’s Web site to see if there is a patch.
When a user is having trouble with the browser, an application, or network connectivity, attempting to reproduce the problem with another user or machine will narrow the problem. This applies equally to desktops and servers when dealing with network issues.
You also have to be on the lookout for configuration errors. For example, Windows 2000 has very powerful local security policy and group policy implementations that can turn into a nightmare when misconfigured by a novice. It is also important to determine whether the problem is user- or computer-related.
If a Windows NT4 or Windows 2000 domain is in place, checking logon issues with the domain controller will be a critical step in ruling out logon problems. The Windows Event Viewer is a great tool for finding exactly what the errors generated mean. If you have an Event ID number, you can surf to Microsoft’s Knowledge Base and type the Event ID in the search engine to view a list of symptoms, causes, and solutions related to that problem. This is more than a theory. It is an excellent problem-solving tool that can make a network administrator very effective. I use the Knowledge Base at least once a month. If you’re working with other operating systems, you should simply check the log files manually or work with a log file reader for the platform.
When there is a connectivity failure, the first thing you should try is the IPConfig/all command on a Windows machine (IFConfig, if you’re on a Linux box). If your machine has a proper IP address, subnet mask, and gateway, you use the ping process. First, ping localhost or 127.0.0.1 (at the command prompt). Next, ping the IP address of the problem machine to ensure that your network adapter card is functional.
If neither of these pings results in a connection, you need not go any further with the ping process until you make sure your TCP/IP configurations are bound to your network adapter card and that your network adapter is working properly.
But if these two pings were positive, you should ping your default gateway to assure connectivity to it. If that works, ping an IP address beyond the gateway (somewhere on your WAN or on the Internet). If that works, double-check connectivity by pinging the FQDN of a server on the Internet, such as yahoo.com, to make sure that the DNS is working.
If all of these pings work, and you still can’t get your Internet or WAN connection, it’s time to check software configurations.
Sometimes you run into problems that you have no local control over. Leased-lines and WAN connections go down. Remote offices go offline because of router errors at that site. If you can't find any problems with your hardware and software, and you think that there may be a WAN issue, you’re probably going to have to call your telecom vendor (if it’s an Internet connectivity issue) or a counterpart at another office (if it’s a WAN issue with a remote office).
Here are some additional suggestions:
- Get the problem narrowed down to a manageable size and scope.
- Share your ideas and thought processes with others in your department.
- Try not to troubleshoot in front of an employee or client. If possible, go in an office, close the door, and turn on the answering machine or voice mail. It is much easier to focus on the problem without an audience, which may distract you.
- Ask for help! Whether you use the Web or call another technician, you may find someone who has had the same problem. This may be the perfect time to solicit some advice.
- Walk away from the problem once in awhile and sort out everything in your head. This gives you a chance to review the basics and explore other possible solutions that you may have overlooked initially.
What kinds of troubleshooting tips do you have?
We look forward to getting your input and hearing about your experiences regarding this topic. Post a comment or a question about this article.