“When you have eliminated the impossible, whatever remains, however improbable, must be the truth.”
Your network was working, but it's not working now, right? I'm asking because you need to make sure you're in the right place, column-wise. This Daily Drill Down is about troubleshooting, not configuration. If your network never worked—not ever—then you're probably looking at a network configuration problem, which isn't the subject of this Daily Drill Down.
So what's troubleshooting? Troubleshooting is what you do when a successfully functioning network stops working. For example, suppose you've set up an NFS server to export directories to a client system. Due to a power outage, the server and the client lose power and, when power is restored, both machines reboot. After you log in to the client and attempt to access your remote directory, you find that it's not available. What happened? It used to work. It's not working now!
To solve the problem, you'll need to do some troubleshooting—a term that refers generally to the process of defining, isolating, and solving a technical problem. Here's a troubleshooting formula that's based on long, sometimes painful experience:
- Check the connections. Almost all network problems can be traced, ultimately, to connection problems. It makes sense to start by checking to see whether everything's plugged in properly.
- Try restarting the affected systems. If the problem is attributable to a software glitch, this may fix it. You'll never know why, exactly, but what the heck—the network's working again.
- Piece together what happened and where the problem occurred. Perhaps some sort of identifiable event separates the horrible now, when the network does not work, from the wonderful then, when it did. What's the difference? For example, in the NFS problem just mentioned, both machines were restarted. Hmmm! Could that have something to do with this?
- Get your black belt in ping. Of all the network utilities you can use to figure out what's wrong, ping is by far the most useful. Once you learn how to use ping systematically, you'll almost always be able to track down the problem and solve it.
A caveat before proceeding: In this Daily Drill Down, I’ll be talking about relatively simple networks that don't have internal bridges, routers, or subnet divisions. More complex networks, such as those with multiple subnets, may experience subtle, hard-to-tackle problems related to routing issues, which are beyond this Daily Drill Down’s scope.
Troubleshooting requires a Zen-like frame of mind, characterized by calmness and orderliness of thought—which is too bad, considering that network failures are much more likely to cause pandemonium than peaceful reflection. Users are yelling at you. The boss is glowering. Deadlines are looming. Panic is beginning to set in.
But you must pull yourself together. Just remember: You will be able to solve the problem. After all, the network did work before. The question is, why isn't it working now? There is a reason. You will find it. Take heart.
Whatever brought your network down is likely to happen again. Will you remember what happened and how you fixed it? Keep a log of the symptoms you're seeing, the error messages you receive, the tests you perform, and what you do to cure the problem.
OK, ready? Remember: Most problems that pop up in a previously functioning network are physical in nature. Take a few minutes and check out these likely possibilities:
- Are your network interface cards properly seated in their slots? Upward pressure from the network cable could eventually pry the card out of its slot slightly and that's especially true if you neglected to attach the card firmly to the computer's case by means of the provided screw.
- Check the LED lights on your network cards. A green light on your Ethernet card usually means that it is properly connected to your computer and the rest of your network. If it flashes when you send or receive data on your network, that’s another good sign. Make the same checks on your hub, cable modem, or DSL adapter. While you're at it, make sure these gizmos are plugged in and getting the power they need.
- Swap out cables to make sure they're working. On occasion, I've seen an Ethernet cable go bad out of the blue, and I couldn't restore network functionality until I replaced it. Bear in mind, too, that you can see the little green LED lights on both ends of the connection even if one or more of the wires is no longer conducting current. Be especially suspicious of cables that negotiate sharp turns or bends.
- Is someone running a hefty electrical motor right next to an Ethernet cable? The electromagnetic field could play havoc with network signals.
- Devices such as hubs and ISDN or DSL gateway devices often come with AC power adapters. Are they plugged in? Are the adapters functioning?
- If you're having a problem connecting to the Internet, make sure the problem is really on your end. Call your ISP; perhaps they're having a service outage at the backbone level. A couple of weeks ago, the T3 line running into my hometown went down and nobody could get to any Internet site outside the local area. If you connect via modem, make sure there's a dial tone. Once again, check all the physical connections: Is the phone cord plugged in securely?
Try a good, healthy reboot
Can't find anything wrong with the physical connections? Perhaps there's a software glitch. Try restarting all the affected systems. Often, this cures the problem.
Figure out what just happened
If the network's still down after you've checked the physical connections and restarted the affected systems, try to piece together what has changed since the network last worked properly. Did you install a new workstation? Did you restart the server? Was there a power outage? Any of these events could precipitate a network failure in some way.
Talk to users. Did they change their system configuration in any way? Install new software? They didn't pick a new IP address out of the blue, did they, because the new one sounds “luckier”? If the IP address they picked duplicates one that's already in use on your network, you've found the problem.
As you're considering what might have happened, think outside the technical box. Human behavior might be involved. For example, my son had his friends over. They brought their Ethernet-equipped laptops for something they call a “LAN party.” There weren't enough ports, so they unplugged my office subnet without telling me (“C'mon, Dad, it was 2:00 A.M. You were asleep!”). I’m shocked to tell you that they lacked the consideration to plug me back in after they were finished, and the next morning, the network was down. Of course, I assumed that the problem was some deeply mysterious, technical problem that would require a Ph.D. in network engineering to solve—until I checked the connections.
Figure out where it happened
To try to determine where the problem happened, you'll need to ask these questions:
- Does the problem occur with just one application, or does it affect all network applications on this system? If it's just one application, the problem probably lies in a misconfigured server or client.
- Does the problem occur with just one machine, or are all the computers on the network affected? If it's just one computer, the problem almost certainly lies with that machine (and not the network). Try restarting the system and see if the problem persists.
- What are the relevant error messages? To see the system message log, switch to superuser, if necessary, type dmesg | less, and press [Enter]. Do this for all the affected systems.
Consider the NFS example mentioned at the beginning of this Daily Drill Down:
- What just happened? A power outage that forced all the systems to restart. The problem had something to do with all the systems restarting at the same time.
- Which applications are affected? Only NFS is affected. Everything else works fine. The problem is limited to NFS, obviously.
- Where does the problem occur? The problem occurs with all the NFS clients. So there must be something haywire with the server. But it's running!
- What are the error messages? When the clients start, they complain that they can't find the NFS server and can't mount the server's exported directories. The server's not available, clearly, when the clients start.
What's wrong with this darned NFS server? Aha, light dawns. I'm serving NFS using an old, 166-MHz Pentium that I've equipped with a monstrous, inexpensive hard disk (26 GB). So the system on which the NFS server is running starts a lot slower than the speedy Pentium II clients. The clients come up lickety-split. They try to mount the directories that the server is exporting, but the pokey ol' NFS server hasn't come up yet. When it does, it starts running happily, but the clients have given up in disgust and gone on to better things.
Don't forget to verify your theory. I did so by restarting the server first. Then I restarted the clients. And you know what? Everything worked fine. Bingo! Problem successfully identified. The permanent solution? Increase the clients' LILO loader delay so that they don't start booting until the NFS server is available. End of problem!
Get your black belt in ping
Still stuck? You've checked the physical connections. Everything's fine. You've asked what may have happened since the network was last functioning perfectly. Nothing has happened—at least, nothing has happened that you know of.
It's time to get serious with ping. Among network analysis tools, this one is the heavyweight. As you probably already know, ping can tell you whether you can reach another TCP/IP-enabled connection on your network, including other computers, your gateway device, and network-capable printers. What you probably don't know is that it's possible to use this utility in a systematic way—and once you do, you can almost always isolate the source of a network problem. Once you've isolated the source, you're on your way to a solution.
Understanding ping's messages
Before you get started with ping, take a moment to learn what the utility's messages mean:
- Network unreachable: The local system cannot find a route to the system with which you're trying to connect. Most of the time, this problem is due to a faulty connection.
- Unknown host: This message means that ping was able to resolve the domain name you typed into an IP address. Something's haywire with your domain name server (DNS [Domain Name System] server).
- 100% packet loss: This message indicates that ping was able to resolve the domain name you typed. What's more, it was able to find a route to the remote system. However, the remote system isn't responding.
Using ping systematically
Now that you understand ping's error messages, use the following step-by-step procedure on each of the affected systems:
- ping 127.0.0.1: The 127.0.0.1 address is known as the loopback address. It provides a way for a TCP/IP-enabled computer to send messages to itself. If you can ping your loopback address, TCP/IP is installed, but it's not necessarily configured correctly. If you can't ping the loopback address, it's bad news. Somehow, your TCP/IP configuration is messed up, or the underlying TCP/IP software has crashed. Try restarting the system and launch this command again.
- ping your_ip_address: If you don't know the current system's IP address, use ifconfig to find out (open a terminal window, switch to superuser, type ifconfig, press [Enter], and note that if this ping is successful, your IP address is properly connected to your network card. If it isn't, there's something wrong with this system's Ethernet interface. Again, try restarting the system; if this doesn't fix the problem, type ifconfig again, and check to make sure that the Ethernet interface is up and running (you should see a message that includes these words). If not, reconfigure the Ethernet interface.
- ping your_computer_name: If you don't know your computer hostname, check your /etc/hosts or /etc/HOSTNAME file (open a terminal window, type cat /etc/hosts or cat /etc/HOSTNAME, and press [Enter]). If this command fails, there's a problem with your domain name server (DNS server) configuration. Most small networks rely on an ISP's DNS serverŸwhich means that DNS services aren't available when the network isn't connected to the Internet. To fix this problem, make sure that all of the networked systems have an /etc/hosts file that lists all available IP addresses, hostnames, and aliases that are used on the network. Make sure that these files contain no errors! If you're running a DNS server on your network, make sure that it's running correctly; restart it, and try this command again.
- ping other_ip_address: Now try reaching other systems on your network by typing their numerical IP addresses, one by one. Are you able to access some computers, but not others? The problem could be caused by a poor connection. Another possibility: There are two computers on your network that are trying to use the same IP address. Check the configuration of the computer you can't access, and make sure it's using a unique IP address.
- ping other_computer_name: Repeat the previous step, but use hostnames instead of numerical IP addresses. If there’s a problem here, make sure that the current system's /etc/hosts file matches the hostnames and IP addresses of the other computers on your networkŸand while you're at it, check all the other /etc/hosts files to make sure they're all perfectly identical. If you have lots of computers on your network, consider running a DNS server so that you won’t have to maintain /etc/hosts files manually.
- ping gateway_device_address: Use the numerical IP address of the gateway device that's used to connect your network to the Internet. If you can't access this device, then that's where the problem lies. Try restarting the deviceŸand check the connections again.
- ping dns_server_ip_address: Use the IP address of your ISP's DNS server. You should have recorded this in your /etc/resolv.conf file. If you cannot access this server by means of its numerical IP address, report this fact to your ISP. If a DNS server is down, the ISP's other customers will also have problems navigating to Web sites on the Internet. Chances are that you'll be told that they're working on the problem.
- ping internet_website: Substitute the name of some popular Internet site such as http://www.yahoo.com. Since you've already pinged the IP address and hostname of your default gateway, this message should travel to your ISP's DNS server and finally to the Web site. If you can access the site by means of an IP address but not the domain name, the problem lies with your ISP's DNS server (or your local DNS server, if you're running one).
Still having problems?
The problem-solving strategies outlined in this Daily Drill Down should suffice to cure most network problemsŸbut not all of them. If you're still having trouble, consider the following:
- Hardware failure: Network interface cards, Ethernet hubs, AC power adapters, gateway devices, and other critical components can and do fail sometimes. Try to isolate the problem using ping and, if necessary, swap out the device with one that's known to work.
- Tampering: Did someone decide to try out some newfound (and unpolished) network configuration skills without telling you?
- Unauthorized access: This one isn't fun to think about, but it's a very real possibility, especially if you notice that your network seems to be logging on to the Internet when no one appears to be using it, and some of the systems seem to be running at a high percentage of CPU load even though they're unattended. In a future Daily Drill Down, you'll learn how to discern the telltale signs of such intrusions.
Chances are very good that you'll have identified and solved the problem by nowŸand that's great. Take a moment to reflect on the experience and draw some lessons from it. In particular, remember that the only real defense against system and network failure is a sound, well-administered data backup system that's run regularly and systematically. If you haven't gotten started with your backup system, do it nowŸbefore the next glitch brings your network down.
Bryan Pfaffenberger, a UNIX user since 1985, is a University of Virginia professor, an author, and a passionate advocate of Linux and open source software. A Linux Journal columnist, his recent Linux-related books include Linux Clearly Explained (Morgan-Kaufmann) and Mastering Gnome (Sybex; in press). His hobbies include messing around with his home LAN and sailing the southern Chesapeake Bay. He lives in Charlottesville, VA. If you’d like to contact Bryan, send him an e-mail.The authors and editors have taken care in preparation of the content contained herein but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for any damages. Always have a verified backup before making any changes.