Have you ever had that 3:00 AM phone call from someone saying that one of your servers is displaying the Blue Screen of Death? In such a situation, your first instinct would probably be to tell them to reboot the server and let you go back to sleep. However, as you’ve probably already found out, rebooting isn’t always the magic cure all. It can be a gut wrenching feeling staring at the incomprehensible blue screen with all its numbers and codes. However, this experience doesn’t have to be so traumatic. The secret is to know how to read the Blue Screen of Death. In this Daily Drill Down, we’ll show you how to read the Blue Screen of Death. We will also discuss some of the errors that you’re likely to encounter and we’ll provide you with some techniques for eliminating them.
Anatomy of a blue screen
There are four basic sections that you should be aware of on a Blue Screen of Death. The first section lists the actual error message. The second section lists the Windows NT modules that are already loaded into memory. The third section lists the modules that were about to be loaded had the error not occurred. Finally, the fourth section lists the current status of the Kernel Debugger. We’ll cover each of these sections in detail.
The error message
The section highlighted in Figure A shows the actual error message. This message contains an error code number, the addresses where the error occurred, and a text code indicating the type of error. Below, we’ve listed some of the more common error codes and their causes.
|This is the actual error message.|
This error is caused by an application trying to divide by zero. If you receive this error and don’t know which application caused it, you might try examining the memory dump.
The IRQL_NOT_LESS_OR_EQUAL error is caused by a buggy device driver or an actual hardware conflict. If you’ve recently added new hardware to your system, try removing it and see if the error goes away. Likewise, if you’ve recently loaded a new device driver, you might try using ERD Commander Professional Edition, by Winternals Software, to temporarily disable the new driver and see if the problem goes away.
An incorrectly configured device driver usually causes this type of error. As we’ll explain later, you can use another section of the blue screen to figure out which driver is causing the problem.
Such an error indicates a catastrophic failure in the system’s registry. However, this error can sometimes be caused by failure to read the registry from the hard disk rather than because the registry itself is corrupt. Most of the time though, if you get this error, you’ll have to restore from backup.
Just as the name implies, this error indicates that Windows NT is having trouble reading from the hard disk. This error can be caused by a faulty device driver or a bad SCSI terminator. If you’ve checked for these problems, but are still receiving the error, check to make sure that a virus hasn’t destroyed your boot sector.
This error message is almost always caused by your computer’s memory. If you receive this error, check to make sure that all of your SIMMs are the same type and speed. You should also check to make sure that your computer’s CMOS is set for the correct amount of RAM. If all of these suggestions check out, try replacing the memory in the computer.
This is, perhaps, the most obscure error message. In most cases, if you receive this error, it’s related to the most recent change you’ve made on your system. Try undoing the change to get rid of the error.
An NTFS_FILE_SYSTEM error indicates hard disk corruption. If your system is bootable, run CHKDSK /F on all of your partitions immediately. If your system isn’t bootable, try installing a new copy of Windows NT in a different directory. You can use that copy to run the CHKDSK program. When you’re done with the second copy, you can edit your BOOT.INI file to make your computer start your original copy of Windows NT.
This error indicates that Windows NT wasn’t able to read a page of kernel data from the page file. Bad memory, a bad processor, incorrectly terminated SCSI devices, or a corrupt PAGEFILE.SYS file may cause this situation. The first step in correcting such an error is to recreate the PAGEFILE.SYS file and see if you can bring your system back online.
This is a generic error message in which the hardware abstraction layer can’t report on the true cause of the error. In such a situation, Microsoft recommends calling the hardware vendor. This error can sometimes be caused by mixing parity and non-parity SIMMs or by bad SIMMs.
Modules that have loaded
The section that we’ve highlighted in Figure B shows the modules that Windows NT has already loaded into memory. You can use this section primarily to look at the modules that are already loaded, and be somewhat confident that none of the modules listed are causing your problem.
|These are the modules that NT has already loaded into memory.|
Modules that were about to load
The section that we’ve highlighted in Figure C shows which modules were about to load when the error occurred. Many times, this section can give you an idea of which module is causing your problem. This is especially true if you’re receiving a KMODE_EXCEPTION_NOT_HANDLED error. For example, suppose that the next module on the stack to load was tcpip.sys. In such a situation, it’s likely that an incorrect network card driver may be causing your problem. If you happen to own ERD Commander Professional Edition by Winternals Software you could disable the network card driver, and try booting your system again. If the system boots, you could correct the driver problem.
|These are the modules that were next to load, had the error not occurred.|
The section highlighted in Figure D indicates the current status of the kernel debugger. The kernel debugger allows you to link two computers running Windows NT via a RAS connection or a null modem cable. When a Blue Screen of Death occurs, the crash dump information is sent to the functional computer for diagnosis.
|This section lists the status of the Kernel debugger.|
To use the kernel debugger, both computers must be running the same version of Windows NT, and have the symbol set installed. You must also install the debugging software from the \SUPPORT\DEBUG\PI386 directory on your Windows NT CD-ROM.
Next, you must add environment variables to both computers, as shown in Table A:
|_NT_DEBUG_PORT||COM1 or COM2|
|_NT_SYMBOL_PATH||location of symbol files|
At this point, you need to modify the BOOT.INI file on the computer that you plan to use to examine the crash dump information. To do so, add /CRASHDEBUG to the end of the line that you plan to use to boot Windows NT. Reboot NT before continuing.
When both machines are setup, you must run the REMOTE program before triggering the blue screen. On the PC having the problem, type the following command:
REMOTE /s “I386KD –v” DEBUG
In this command, the /s indicates that this computer will act as a server and send the crash dump file to the client. The –v indicates verbose logging mode.
On the computer that you plan to use to examine the crash dump, type the following command:
REMOTE /C computername DEBUG
In this command, the /C indicates that this computer will function as a client and receive the crashdump file from the server. The computername is the name of the computer having problems.
An easier way
As you can tell, setting up the kernel debugger can be complicated. If you don’t want to go through all of this, there are a couple of other things you can try first.
If your computer is bootable, you can set Windows NT to create a memory dump file when a Blue Screen of Death occurs. To do so, open the System Properties dialog box from Control Panel and go to the Startup/Shutdown tab. Next, set the options shown in Figure E. Keep in mind that the partition where you store the memory dump file must have at least enough free space to store your page file, plus your physical RAM space, plus 1 MB. For example, if your machine has 128 MB of RAM, the partition must have enough free space for the page file, plus an extra 129 MB.
|Use these options to create a memory dump file.|
Once you’ve created a memory dump file, you can use the DUMPEXAM.EXE program in the \SUPPORT\DEBUG\I386 directory of your Windows NT CD-ROM to create a report of the crash. You can see an example of such a report in Figure F.
|You can use the DUMPEXAM.EXE program to create a report similar to this one.|
Last known good configuration
You have undoubtedly heard the phrase, “If it ain’t broke, don’t fix it.” In the world of Windows NT, this can be especially true. Blue screens don’t occur without reason. If you have a blue screen that you can’t seem to figure out and you’ve ruled out a hardware failure, chances are that it may be related to a change that you or someone else has recently made. In such a situation, you could try using the Last Known Good Configuration as a last resort. Using this option will sometimes bring your system back to life, but will undo the changes that you’ve made since the last time the system was rebooted.
In this Daily Drill Down, we’ve discussed the various pieces of information displayed on the notorious Blue Screen of Death. As we did, we explained what each of these items meant, and provided you with several steps you can take to correct the error.
Brien M. Posey is an MCSE who works as a freelance writer. He also works as a systems engineer for the United States Department of Defense. You can contact him at Brien_Posey@xpressions.com. Because of the high volume of e-mail that he receives, it’s impossible for him to respond to each message, although he does read them all.The authors and editors have taken care in preparation of the content contained herein, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for any damages. Always have a verified backup before making any changes.