“The server is down.” Those four words can strike fear and despair in the hearts of users and network administrators alike. For the former, it means they won’t be able to perform network tasks—run mission-critical network applications, access files, and send jokes to everyone on their e-mail list. For you, it means a deluge of phone calls from management, working through lunch (and maybe dinner too), pulling your hair out, and wondering why you went into this line of work in the first place.
Take heart. Before you sit down to type your resignation, read this Daily Drill Down. It’ll provide a strategic plan for troubleshooting server problems and getting your server up and running as quickly as possible.
Is it your fault?
You know the story: One day the network is humming along, no problems, everyone has connectivity—then bam! You come to work one morning and it’s all gone haywire. One or more critical servers are not accessible. What happened? Why do good servers go bad? Sometimes it’s simple: “parental” neglect. Do you ignore your server when it’s functioning properly, rationalizing that your time is better spent on other things? Just as good parents keep an eye on their kids even when there are no apparent problems, good admins are continuously monitoring their servers to catch potential problems early; performing routine maintenance such as disk defragging, application of operating system hot fixes and service packs, and regularly reviewing event logs for error or information messages.
Classifying server problems
Server problems can generally be broken down into a few categories:
- · Hardware problems
- · Operating system configuration problems
- · Application/services-related problems
It’s important to determine which category your problem fits into; otherwise, you can waste hours of troubleshooting time searching for a needle in the wrong haystack. What seems like an OS problem (STOP errors or unexpected reboots) may actually be hardware problems caused by faulty memory or a dying power supply.
Ten troubleshooting tips
Depending on the source of your problem, getting your server back up and running may or may not be quick and simple. These ten troubleshooting tips will give you a starting point for tracking down the culprit responsible for your server’s demise, and address several of the most common server problems.
1. Develop a standardized troubleshooting routine
To identify and correct server problems most efficiently, you should have a standardized troubleshooting procedure that you follow each time something goes wrong. Troubleshooting server problems is generally more complicated than troubleshooting problems with client machines, because the operating system itself is more complex and because of all the services that run on a server machine.
Use checklists and forms (see sample here) to guide the troubleshooting process. This will prevent you from leaving out essential steps or overlooking the obvious.
Download our TPG server troubleshooting checklist
Plaster this checklist all over your server room so your entire staff has a step-by-step process to guide them through the server troubleshooting process.
2. Start at the bottom: Troubleshooting physical layer problems
The first step in troubleshooting server problems is to determine whether the physical layer is functioning properly. This includes the server computer hardware, any attached peripherals, and the cabling. Switching out network cards or cables will help you identify whether the hardware is at fault. For testing long runs of cable that can’t be easily switched, use a cable tester or multimeter.
If you have recently added new hardware to the server, make sure that it’s compatible with the operating system.
Tip
If you’re running a Microsoft server operating system, check the hardware compatibility list on the Microsoft Web site.
You should also make sure that you have the correct drivers installed for your new hardware. Check the hardware vendor’s Web site for updated drivers.
3. Traffic control: Troubleshooting addressing and routing problems
Once you’ve eliminated hardware as a suspect, you should check the operating system configuration. The typical business network runs on TCP/IP, and incorrect TCP/IP settings can result in addressing and routing problems.
Make sure the server’s TCP/IP settings are correct. TCP/IP settings have been known to mysteriously change, for example, when you install an application. The most common situation is that the server’s TCP/IP configuration is reset to make the computer a DHCP client. In most cases, servers should have static IP addresses. You may not notice the problem with a file or print server, but if your DNS server or domain controller’s configuration is changed, your clients will experience problems.
Figure A illustrates the TCP/IP settings on a Windows .NET server.
Figure A |
Make sure that all TCP/IP settings on the server are correct when you have a network connectivity problem. |
4. Identification required: Troubleshooting name resolution problems
What’s in a name? A great deal, when computers communicate using names, which must be resolved to IP addresses. If you have an apparent connectivity problem, always try pinging another system by both name and IP address. If your server is “out of touch” when attempting to communicate by host name but is able to connect to other computers by IP address, check its DNS settings, WINS settings, and/or HOSTS and LMHOSTS files.
5. Trouble at the top: Troubleshooting application problems
The applications that get installed on servers are different from the productivity applications that are typically installed on workstations. Server applications tend to be those that allow you to manage, maintain, or monitor the server (such as disk utilities, network monitoring, or “sniffer” programs, etc.), or those that add another server service, such as proxy server software installed on top of the server operating system.
Compatibility with the underlying operating system is essential. Programs such as Microsoft’s Certified for Windows 2000 program give you a way to evaluate the compatibility of server applications.
On a Windows server, check the application log (Start | Programs | Administrative Tools | Event Viewer) for application-related errors.
Tip
The event logs are always a good place to start in troubleshooting server problems. In addition to the application logs, Windows NT/2000/.NET servers provide the system and security logs and additional logs (such as DNS and Directory Services) when specific server services are installed. The Error, Warning, and Information messages in the logs can be your best clues in determining what went wrong.
6. Paper chase: Troubleshooting print server problems
The source of printing problems can be at any one of the following levels:
- · The printing device attached to the print server computer
- · The configuration of the print server
- · The physical connections between printer and server or server and network
- · The configuration of the client
Your first step in troubleshooting print server problems is to make sure that the printing device is working and all connections are secure. Make sure the correct printer drivers are installed. Check that the print spooler service is running and to see that there’s plenty of disk space on the server for spooling. Check the permissions set on the printer. If your network is a Windows 2000 or .NET domain, check group policy settings for printers.
7. It’s in the mail: Troubleshooting e-mail server problems
E-mail is the most used network application today. Problems with the mail server can be due to the same connectivity problems I’ve already discussed (hardware problems, TCP/IP settings). In addition, check the following:
- · Make sure that the Mail Exchange (MX) resource records in your DNS entries are correct. For more information about MX records, see this fine site.
- · Make sure the mail server has plenty of disk space for user mailboxes. You may need to impose limits on mailbox sizes if you have many high-volume users.
- · If you want your mail server to receive mail from other mail servers, make sure that your mail server is configured to enable relay; otherwise, these requests will be blocked.
Tip
Be aware that opening your server to relays can make your server vulnerable to be used to relay spam through your server. Make sure the server is configured to accept relay messages only to your own mail domains.
8. Terminal condition: Troubleshooting Terminal Services problems
The “thin client” solution is becoming more and more popular now that Microsoft includes Terminal Services built into its Windows 2000 and .NET server operating systems. For problems with your terminal server, check the following:
- · In a Windows 2000/.NET domain, check group policy and individual users’ account properties in Active Directory if users are unable to connect to the terminal server or are unexpectedly disconnected.
- · Check the Terminal Services configuration settings (Start | Programs | Administrative Tools | Terminal Services Configuration). Make sure that the terminal server is running in Application mode; otherwise, only administrators can connect, and you can only have two active sessions at a time.
- · Make sure that you have set up a Terminal Services license server and that you have sufficient licenses.
Figure B shows the Windows 2000 Terminal Services Configuration tool.
Figure B |
Make sure the terminal server configuration is set to the proper mode. |
9. Can’t call home: Troubleshooting dial-up/remote access server problems
A remote access server allows clients to dial in to connect to your network or establish a connection through a VPN over the Internet. If clients are unable to connect to your remote access server, check the general connectivity issues mentioned earlier, then check the following:
- · Make sure that remote access services are installed and configured on your server. Make sure the service is started.
- · Make sure that your dial-in, PPTP, and/or L2TP ports are enabled to accept inbound remote access calls.
- · Make sure that the remote access server is configured to allow connections on the protocol(s) that are being used by the remote clients (IP, IPX, AppleTalk, or NetBEUI).
Tip
To view and configure properties of the Windows 2000 Remote Access Server, select Start | Programs | Administrative Tools | Routing And Remote Access, right-click the server name, and select Properties.
Figure C shows you how to access the Remote Access Server’s Properties.
Figure C |
On a Windows 2000/.NET server, open the Routing And Remote Access console and right-click the server name to view or configure its properties. |
10. Listen for zebras: What else could the problem be?
A good philosophy for troubleshooting server problems is the old adage: “When you hear hoof beats, expect horses, not zebras.” This means you should consider the more common sources of problems: hardware failure, misconfigured network settings, etc., rather than the exotic ones.
But what happens after you’ve checked out all the usual suspects and still can’t connect to your server? Then you may want to consider some less commonly discussed possibilities:
- · Check your server’s security settings, as well as any site- or domain-wide security policies that may be preventing connectivity.
- · Check client licenses and licensing settings. If your server is configured to use per-seat licensing and is set for 100 licenses, the 101st client may not be able to connect even though you’ve purchased additional licenses, if you haven’t changed this setting.
- · Always consider the possibility that the server itself is not at fault. Check the routers and the client computers to make sure that the real problem doesn’t lie elsewhere.
Rehabilitating your “bad” server
Sometimes the worst happens: Your “bad” server is beyond help, so you end up having to reinstall the operating system and start over. This can be an annoyance, requiring several hours of work, or it can be a disaster, resulting in precious data being lost forever, depending on how well you’ve prepared for disaster.
If getting your server back up and running means wiping its disk and starting clean, you’ll be glad you practiced preventative maintenance.
Preventative maintenance: The importance of backing up
You know how important it is to always have a current backup of your mission critical data, but do you actually practice what we all preach? Not only should you back up important data everyday, but you should also regularly do a test restoration to make sure that your backup software and hardware are in good working order. If not, you want to find out before you actually need it.
Power to the server: The importance of a good UPS
Power surges or outages can be responsible for all manners of mysterious “glitches,” even if your server appears to have survived intact. It’s crucial to protect your server with a good Uninterruptible Power Supply (UPS) to prevent this.
Good to go: Using VM software to mirror your server configuration
The ultimate in disaster protection is server clustering—creating an exact duplicate of your server on a second machine, which can instantly take over the duties of your failed server so that your users experience little or no interruption in network services.
Clustering solutions range from expensive to very expensive. If your budget doesn’t allow for this ideal solution, consider the poor man’s version of clustering: Create a mirror of your server configuration on a virtual machine using VMware or similar software on another computer on your network. This virtual server might not stand up to long-term or heavy use, but it can take over your server’s duties for a short time while you re-create your server, and it can be used as a guide in that re-creation process if you haven’t documented all of your server’s settings as you should have.
Conclusion
Any way you look at it, a server failure is no fun, but you can make the experience a lot less traumatic, both for you and for your users, by following the tips in this article to get your server back up and running as quickly and painlessly as possible. In this Daily Drill Down, I’ve provided guidelines for isolating the source of the trouble and for troubleshooting some of the most common server problems.