Hardware problems can really disrupt the way your company functions. If a hardware problem occurs on a server, then many users may not be able to access critical resources. In this Daily Drill Down, I’ll explain some ways you can get Windows 2000 back up and running when hardware problems occur.
Before I begin
Before I get started, it’s important to point out that there are an infinite variety of hardware problems. No single method is going to pinpoint every one of these problems. Therefore, the purpose of this Daily Drill Down is to teach you about methods you can use for diagnosing some of the more common types of hardware problems.
Because no one method or tool will diagnose every hardware problem, it’s important to know what troubleshooting and diagnostic tools are available to you. Only by using a wide range of tools and techniques can you hope to correct the extremely diverse variety of problems that can occur.
The log files
When it comes to tracking down a hardware problem, the best place to start is with the log files. The log files will often tell you exactly what’s wrong with the system. If you’ve come from a Windows NT background, it may sound very strange to start the process by looking at the log files. After all, in Windows NT, if there was a hardware problem, many times the system wouldn’t boot, and therefore, you couldn’t look at the log files. If the system did manage to boot, the hardware problem probably wasn’t too severe to begin with and was easy to diagnose on your own.
However, let’s think about this logic in the context of Windows 2000. Windows 2000 offers a couple of ways to boot the system and access the log files during all but the worst hardware failures. Even though you may have to boot the system into a crippled state such as Safe Mode, you can still access the log files.
Now, what about the idea that if the system is bootable, the log files won’t be much help? Personally, I’ve often found that nonhardware problems can many times emulate a hardware failure. For example, you might have a service or a device driver file that becomes corrupted. Naturally, the hardware device won’t work if the device driver has been corrupted, but you may never spot such a problem without reviewing the log files.
So which log files should you look at when failures occur? If you can boot the system into Safe Mode or into Normal Mode, I recommend checking the system log to see if any warnings or errors have been reported. Unfortunately, if you have to resort to using the Recovery Console, using the system log isn’t an option.
If the system won’t boot into Normal Mode, the next thing you should check is the boot log. To do so, boot the system, and when you see the screen that asks which operating system you want to boot, press the [F8] key. When you do, you’ll see a more extensive boot menu. From this menu, select the Boot Logging option. This option will attempt the boot process but will add entries to the log file each step of the way. This means you’ll have a log of everything that happened up to the point of failure. Since the log is a text file, you can view it through Safe Mode or through the Recovery Console with no problem. The file’s name is NTbtlog.txt, and the file is located in your Windows directory.
The Safe Mode method
So far, I’ve mentioned Safe Mode quite a bit. The Windows 2000 Safe Mode is very similar to the Safe Mode option in Windows 9x. The idea behind Safe Mode is that Windows will load with a minimal set of drivers and services. Therefore, if the failure isn’t occurring on a device that’s absolutely critical to the system, Windows 2000 will be able to load because it won’t be loading drivers and services related to the damaged device.
To access Safe Mode, boot the system and press the [F8] key when you see the menu that asks which operating system you want to load. When you do, you’ll see a more extensive boot menu. Simply select the Safe Mode option, and Windows will boot into Safe Mode.
Once you’ve managed to boot Windows into Safe Mode, you’re well on your way to correcting the problem. If you’ve ever tried to repair a hardware problem in Windows NT, you know how difficult it can be to fix a problem from outside the operating system. Booting into Safe Mode provides you with access to the GUI interface. This means you’ll have tools available to you that you wouldn’t have access to from outside the GUI.
So now the big question is once you’re in Safe Mode, what do you do? With any luck, you’ll have a clue as to what the problem is and you can go ahead and fix it. However, if you don’t know what the problem is, I recommend going into Device Manager and disabling every device that isn’t critical to the system. Remember that this is exactly what Windows did, and it was able to boot into Safe Mode. This means that if you were able to access Safe Mode, all of the critical drivers are working. Therefore, disable everything that isn’t critical and try booting the system into Normal Mode. If the system boots into Normal Mode, you can be sure that one or more of the devices you disabled was causing your problem.
The next step of the process is to determine which device was to blame. To do so, reenable one device and reboot the system. If the system boots, then the device you enabled was okay. If the system fails to boot, go back into Safe Mode, disable that device, and enable a different device. The idea is to enable one device at a time, rebooting between each device, until you’ve determined which device or devices are causing the problem. Once you have that information, you can begin taking steps to correct the problem.
The Recovery Console
The Recovery Console is new to Windows 2000. As you may recall, one of the biggest concerns with correcting problems on a Windows NT system was that if the system wouldn’t boot and if the hard drive was formatted in the NTFS format, there was no way of accessing the hard disk to repair the problem (short of using a hacker tool). It didn’t take Microsoft long to realize that this was a drawback, so it included something called the Recovery Console in Windows 2000. The Recovery Console is a command-prompt environment that grants the administrator full read and write access to all the partitions on your system. It also offers other capabilities, such as the ability to enable or disable services from a command-prompt environment. The Recovery Console isn’t as powerful a tool as Winternals Software’s ERD Commander 2000, but it will get the job done in a pinch.
Unfortunately, the Recovery Console isn’t installed by default. The reason Microsoft doesn’t automatically install the Recovery Console with Windows is that it consumes over 70 MB of hard-disk space. If you have the disk space to spare, I recommend installing the Recovery Console on your servers before a crash occurs. However, if a crash has already happened or if you can’t spare the disk space, you can access the Recovery Console through the Windows 2000 boot disks.
The Blue Screen of Death
Most of the time when Windows 2000 won’t boot, the boot process will begin but then abruptly end at the Blue Screen of Death. When you get a Blue Screen of Death, it’s often tempting to ignore the hieroglyphics on this screen and move on to the process of trying to boot the system into Safe Mode. As you may recall, when I discussed Safe Mode earlier, I said that you could go through Device Manager and start disabling devices. As you saw, the process of disabling all the devices and reenabling them one by one is tedious. You may not really want to go through this process, but you may not have a clue as to the cause of the problem. However, the Blue Screen of Death may contain the answers you need.
The section at the top of the Blue Screen of Death is known as the bug check section. You can see a sample of this section here:
All of the information in the bug check section means something. The first thing you’ll want to look at is the message portion of the section. In this particular case, the message section reads:
There are dozens of possible messages you could see in the bug check section. Unfortunately, space doesn’t allow me to explain what all of these mean, but you can find this information on the TechRepublic Web site in ”Understanding the Windows 2000 Blue Screen of Death, part 1.”
The other section that’s important to look at is the bottom line. Many times (but not always), this line provides the memory address and the file that caused the error. You can use this information to determine the cause of the problem. For example, if the filename in the Blue Screen error is TCPIP.sys, then this is a good indication that the problem may lie with your network card or the Windows 2000 network drivers. Rather than disabling every device in the Windows 2000 Device Manager, you can simply disable the network card to see if that solves the problem. If that doesn’t take care of things, you might try reinstalling the individual networking components.
One final tool that’s very useful for tracking down hardware problems is Performance Monitor. Performance Monitor allows you to see exactly how individual hardware components are working. You can use Performance Monitor to tell whether a component is failing or if it’s just overwhelmed with work. For example, if your network connection wasn’t working correctly, you could use Performance Monitor to see if any traffic was flowing through the network card, if the card was generating network errors, or if the card was generating an excessive number or retries.
In this Daily Drill Down, I‘ve explained that although Windows 2000 provides you with a wide variety of diagnostic tools, no one tool is able to diagnose every hardware problem. I’ve also explained some techniques you can use to diagnose hardware problems and get your machine back online.