One of the most difficult areas to troubleshoot is the hard drive. So, wouldn't you kill to have a set of troubleshooting steps to guide you? Read on as Faithe Wempen reveals her seven methods for tracking down hard drive problems.
When a floppy or CD-ROM drive doesn’t work, it’s an annoying but not particularly scary problem to fix. However, when a hard disk fails, the computer doesn’t boot—as in the case of a boot drive failure—and the frenzy to save important company data ensues. When faced with such a problem, don’t panic. Just remember these simple troubleshooting tips for hard drives.
Seven steps to my troubleshooting process
Here’s a quick look at the process I follow when troubleshooting a hard disk. I’ll elaborate on each of these points in the proceeding sections. With each point, ask yourself the question(s) that follow.
- Physical connectivity—Is the drive receiving power? Is it plugged into the PC by a correctly connected ribbon cable? For IDE drives, are its jumpers set correctly? Or with SCSI drives, are its SCSI termination and ID set correctly?
- BIOS setup—Does the BIOS see the drive?
- Viruses—Does the drive contain any boot sector viruses I need to remove before continuing?
- Partitioning—Does FDISK find a valid partition on the drive? Is it active?
- Formatting—Is the drive formatted using a file system that the OS can recognize?
- Drive errors—Is a physical or logical drive error causing read/write problems on the drive?
- Operating system—Does your OS have a feature that checks the status of each drive on your system? If so, what is that status?
Checking physical connectivity
To work properly, a hard drive needs power and a connection via a ribbon cable to the PC. If a drive doesn’t work after moving it to a new PC, after physically moving the PC, or after the cover has been taken off, start your troubleshooting by checking the physical connectivity. It’s possible for plugs to jiggle loose when moving a PC, and it’s easy to uproot a ribbon cable connection when pulling circuit boards or performing other maintenance tasks inside the case.
A hard disk works with any Molex connector from the PC’s power supply. Make sure the plug is fully inserted. Molex connectors require a lot of pressure to fully insert, and even more pressure to remove, so don’t be afraid to push hard or pull, as the case may be. Just make sure you handle the plastic connector, and do not try to push or pull the wires.
As the PC starts up, place the palm of your hand on the flat part of the hard disk. If you can detect any vibration, the drive probably has power. If there’s no movement at all, either the drive’s physical mechanism is shot or the Molex connector you have selected is faulty. Try using a different connector before assuming the drive has a problem.
Systems like the AT/LPX have a small connector that runs from the front of the case to the hard disk. On ATX systems, it runs from the motherboard to the hard disk. This enables the LED on the case to illuminate when the hard disk is in use. Don’t rely on that LED as a positive indicator as to whether or not the hard disk is receiving power, though. The light could be burned out, the wire disconnected, or the drive might be receiving power but not be connected correctly to the PC.
The other physical requirement for a drive is the PC itself. If it’s an IDE model, the drive should be connected via a ribbon cable to the IDE bus on the motherboard. Connections can also be made with a SCSI or proprietary expansion card. Secure both ends of the ribbon cable connector and make sure the connector is covering all pins. On systems where the pins are bare instead of surrounded by a plastic ridge, it’s easy to offset the connector by a row or two on the pins. If the drive is getting power but the BIOS can’t find it, try a different ribbon cable; the one in use might have a broken wire or other flaw.
There are two types of hard disk ribbon cables: 40-wire and 80-wire. UltraDMA 66 and above requires the 80-wire cable. If you use the 40-wire type, the drive will be limited to UltraDMA 33 performance. See my article “All About IDE” for further information.
The red stripe on the ribbon cable must match up with Pin 1 on both the drive and the motherboard or expansion card. Sometimes, though, it’s not easy to locate Pin 1. Look for tiny numbers at one end of the connector. If you see a 1 or 2, that’s the end with which the red stripe should be matched. Some connectors are notched on one side while the ribbon cables have a tab that fits into that notched area. However, this isn’t always the case. Unlike with floppy drives, where the drive light stays on even if you have the ribbon cable backward, there is no simple way to tell whether you have the cable backwards. Without the notched connectors, your only choice is to use the trial-and-error method.
Don’t mount the drive in the computer case until you’re sure it works. Sometimes those little screws can be hard to reach, so you only want to mount the drive once. For testing purposes, the hard disk can temporarily sit at an angle, unmounted, but propped up in some way. Don’t allow the drive to hang suspended by the ribbon cable or power cable; this puts stress on the cable and can cause broken wires or dislocated connectors.
Checking jumper settings
On an IDE hard disk, one or more jumpers on the drive must be set to determine its Master/Slave status. This setting isn’t usually an issue in an existing hard disk installation that suddenly doesn’t work anymore, but it can cause problems when you move a drive from one PC to another.
Depending on the drive, the following jumper settings may be available.
- Single—Use this setting when the drive is the only one on that IDE subsystem; that is, the only one on that ribbon cable. Not all drives have a Single setting; if there is none, use the Master setting instead.
- Master (MS)—When there are two drives on the IDE subsystem and the other drive’s jumpers are set to Slave, or if this is the only drive on the subsystem and it doesn’t have a separate Single setting, use this setting.
- Slave (SL)—Use this setting when there are two drives on the IDE subsystem and the other drive’s jumpers are set to Master.
- Cable Select (CS)—If you are using a cable that relies on the device positioning to determine its Slave/Master status, use this setting. This setting is uncommon.
In Figure A, Master has been selected. Jumpers are set with the jumper running up-and-down; setting them side-to-side would be the same as using no jumpers at all.
|Depending on the drive, the jumper positions may or may not be clearly labeled. There|
should be a chart or sticker on the drive showing the positions. If you see neither, try
visiting the drive manufacturer’s Web site to see if a diagram is available.
Checking SCSI termination
If the machine uses a SCSI drive, there are two factors with which to be concerned: termination and ID. These settings are not an issue when troubleshooting a drive that has suddenly gone bad in an existing system, but if you are moving a drive from one system to another and it doesn’t work in the new system, improper SCSI settings may be the culprit.
If this is the last SCSI device in the chain, it must be terminated. Termination methods vary. On some devices, you set termination with an extra jumper; on others, you use a cap or plug over a connector. On most hard disks, you terminate using a jumper setting.
SCSI-based drives usually have jumpers just like ATAPI ones, but instead of setting the Master/Slave status, they assign a SCSI ID number to the device. Some SCSI devices have a wheel or button instead of jumpers with a little window indicating the setting, but this is uncommon on a hard disk.
There can be up to seven SCSI devices on a single narrow SCSI bus, and up to 15 devices on a wide SCSI bus. There are either eight or 16 addresses in total, depending on your system. The host adapter takes one of those addresses, leaving seven or 15 for the remaining drives. Usually, the host adapter claims the highest number for itself.
The SCSI ID comes from a binary representation of the jumpers. For example, on a device with three SCSI jumpers and all of them are without jumper settings, the ID would be 000b (b stands for binary here), or 0. An ID of 001b would be 1; 010b would be 2; and so on.
The problem lies in the fact that some manufacturers set the jumpers to read from left-to-right, while others use right-to-left. So on one drive, the leftmost jumper set would be 1, while on some other drive, the rightmost jumper set would be 1. Check the drive’s label for information about which way the drive works. If all else fails, try the manufacturer’s Web site.
Checking BIOS setup (IDE only)
In most modern systems, the BIOS can automatically detect your hard disk, so no special BIOS setup is required. However, if you are working with an older or quirky BIOS, you might need to enter the BIOS setup program and change the drive’s IDE channel—i.e., Primary Master, Primary Slave, etc.—from None to Auto so the BIOS will attempt to find and identify the drive.
On an old BIOS, you occasionally may need to select User as the drive type and manually enter the drive’s settings. Automatic detection of IDE devices was part of the ATA-3 standard, released more than 10 years ago, though, doing so would be rare.
To enter the BIOS setup program, watch the screen at startup. It should list the key you need to press to enter Setup. The most common ones are [Delete], [F1], or [F2].
Some BIOSs also have a separate Detect IDE Devices utility built in. If the BIOS contains such a utility, you can use it to prompt the BIOS to detect the new hard disk. This comes in handy when you aren’t sure whether or not the drive is working, because you can get an answer immediately rather than rebooting and waiting to see whether the BIOS finds the drive on startup.
If you’ve come this far in the troubleshooting process and the drive still isn’t working, check for viruses. A drive containing a boot-sector virus will not only malfunction, it can spread the virus to the disk you boot from, such as your emergency startup disk (DOS or Windows 9x/Me).
On a system that you know is good and that has an antivirus program installed, update the virus definitions, and then make a virus-checking boot disk. Write protect it, and then use it to start the system containing the nonworking hard disk and check it for errors. If the drive is not partitioned and formatted, the boot disk might not be able to check the data area of the drive. That’s okay for now; just let it get as far as it can before moving on to the next step, checking the partition.
Checking for a valid partition
If the BIOS can see the drive but the drive isn’t working, make sure the drive is partitioned. Use FDISK, a command-line utility you’ll find on a Windows 9x/Me startup disk, to check. Boot from the write-protected startup disk and type FDISK. When asked whether or not you want large disk support, type Y.
If you choose N when questioned about enabling large disk support, any partitions you later format on the resulting partitions will be formatted as FAT rather than FAT32.
From the FDISK main screen (Figure B), type 4 to view the existing partitions.
If the active partition’s type is FAT, FAT32, or NTFS, it should be recognized by the operating system (Figure C). One exception would be if you put an NTFS drive into a Windows 9x/Me system. The OS wouldn’t recognize the NTFS because it doesn’t support NTFS, not because it was partitioned incorrectly.
|If FDISK reports that it’s a Non-DOS partition, the drive’s partition information has a problem. The most likely cause is a virus.|
If it is a partition problem, you have two choices: Try to recover the data using a disk recovery program, or give up on the data, delete the partition, and re-create it in FDISK. If you want to try recovery first, see the section below on Advanced Data Recovery Options.
If you want to delete the partition and re-create it, return to the FDISK main screen by pressing [Esc] and deleting the partition (option 3 on the screen), and then return to the main screen again and create a partition (option 1 on the screen). After using FDISK to create or delete partitions, you must reboot the machine before doing anything else.
Checking drive formatting
If FDISK recognizes the drive and it has a valid partition type, you should be able to view the drive’s content from a command prompt via your startup disk, or from the Recovery Console in Windows 2000 or XP. Change to that drive by typing its drive letter followed by a colon and pressing [Enter]. Then, display a list of files on the drive with the DIR command.
If you see a message about an invalid media type, the drive is probably not formatted using a file system that your OS recognizes. You can either try a data recovery program, or you can give up on the drive’s data and reformat it with the FORMAT command.
If you booted from a Windows 9x/Me boot disk, but your system ordinarily runs Windows NT, 2000, or XP, the disk might be formatted with NTFS. The fact that the boot disk’s OS cannot read it does not necessarily mean there is a problem with the formatting. For those OSs, try booting to the Windows Recovery Console to see whether or not you can access the disk from there. Read more about accessing the Recovery Console here.
Fixing physical and logical drive errors
Let’s assume at this point that your OS finds the drive and can read some files on it, but not all of them. Maybe you’re receiving read or write errors, or certain programs aren’t working right. The problem is likely a physical or logical disk error.
A physical disk error is a bad spot on the drive. It can result from physical trauma to the computer, like knocking it off of a table while it’s running.
For many years, hard disks have been self-parking; when you shut down the PC, the read/write head on the drive moves to the parking area of the disk where no data is ever stored. Then, if the computer gets bumped or jostled while it’s off, and the read/write head bounces up against the drive, no data will be lost. However, while a computer is running, damage can occur from physical trauma.
A logical disk error is a discrepancy between the two copies of the file allocation table (FAT) on the disk, or a discrepancy between the FAT’s version of what clusters are stored on the drive and the reality of actual storage. Such errors are typically caused by improperly shutting down the PC or abnormal program termination.
A message about a data error while reading or writing the drive is probably a physical error. Logical errors are manifested in many different ways, not always directly attributable to the disk itself. For example, certain programs might fail to run or might lock up after starting. Such a problem could mean a memory parity error or even a bad cooling fan; you never know until you check the system and eliminate the possibilities.
It’s best to try the simplest solution first, so run a disk-checking program. Windows 9x/Me/2000 comes with ScanDisk, which will check for both physical and logical errors. Windows XP comes with a similar utility called Check Disk. In Windows XP, access Check Disk from the Tools tab of the drive’s Properties sheet. In early versions of DOS, a command-line utility called CHKDSK does the same thing. Use it with the /F switch to fix any errors it finds.
A physical and logical check takes much longer than a logical check alone, so I typically do not perform a physical check unless I have reason to suspect a physical disk error. Today’s hard disks are more physically robust than in earlier years, so physical errors are rare.
Checking and reactivating disks in the Windows 2000/XP OSs
Windows 2000 and Windows XP both have a Disk Management feature that checks the status of each drive on your system. This utility allows you to convert to dynamic disks, change space allocation, and much more. See my articles “Managing disks in Windows 2000, part 1” and “Managing disks in Windows 2000, part 2” for more on this feature.
With Disk Management, the most important thing to check is the status of each drive. For example, in Figure D, you can see there are two hard disks: one FAT32 and one NTFS. Both are reported to be Healthy. If a drive reports that it is offline or a status other than Healthy, right-click it and choose Reactivate Disk.
|Windows XP reports that both of these hard disks are healthy.|
Advanced data recovery options
There are several good data recovery programs on the market today that can help you retrieve files from a hard disk that has suffered some type of disaster. The cheapest solution I’ve used is the Lost & Found program by PowerQuest. I recently picked up a copy of this for only $15 on eBay and was able to use it to recover the entire contents of a 20-GB drive that had been wiped out by a virus. However, Lost & Found is no longer manufactured or distributed by PowerQuest, so you won’t find any reference to it on their Web site except in the support archives. It’s also not a very friendly program to use, and it has trouble writing the recovered files to anything except a FAT16 drive.
At the other end of the spectrum is EasyRecovery by Ontrack. This program offers a very user-friendly interface and flexibility in its support of various drive partitions. The Personal Edition Lite allows you to recover up to 25 files and costs about $30. For recovery of unlimited files from a single PC, the Personal edition will run you $180. For around $500, there’s a version that lets you recover unlimited files from unlimited PCs.
There are many other brands of recovery software on the market; a search for data recovery software in any search engine will turn up several. You can also seek out a company that does data recovery, rather than buying the software yourself.
Because so much is stored on hard disks these days, knowing how to revive a failed hard drive is an important function for support techs. Having an effective guide to the recovery process might mean the difference between a total loss and full recovery. With my seven-step process, though, you’ll be ready to tackle nearly any type of hard disk error that presents itself.