Anyone who has worked with RAID (redundant array of independent disks) has heard the term “parity.” While most IT pros understand the general concept behind the word, many would be hard-pressed to define exactly what parity is or how to fix problems associated with it when they occur. Parity is a form of error correction commonly used in certain levels of RAID and works to reconstruct data on a drive that has failed in an array. In this article, I will be focusing on parity problems commonly associated with RAID levels 3, 4, 5, and 6. The remaining RAID levels either do not use parity or are not as commercially viable as levels 3 through 6.
First, a lesson on the different RAID levels
Different levels of RAID make use of physical disks in diverse ways. Each RAID level that supports error correction (parity) uses the capability in different ways as well. Table A explains these differences, as well as what can happen when a drive or drives in a RAID array fail.
In RAID 3 (Figure A), each file is broken up into blocks of identical size, which are then written to a disk in the array. The size of the block depends on the number of data disks in the array. With RAID 3, there is also a disk devoted to parity.
Under RAID 4 (Figure B), an entire block of data is written to a disk before writing the next block to the next disk. This results in a file being written across multiple disks but not necessarily evenly. Like RAID 3, RAID 4 uses a separate parity disk.
Like RAID 4, RAID 5 (Figure C) writes blocks of data to this disk before moving on, resulting in the possibility that one disk may store a larger chunk of data than another disk from the same file. Unlike RAID 4, however, RAID 5 parity is also striped across the disks. To achieve its level of resiliency, RAID 5 requires the overhead equivalent of one of the disks in the array for parity. The more disks that are added to the array, the lower the percentage of overhead. For example, with three disks, one-third of the space is dedicated to parity. However, with six disks, only one-sixth is used.
RAID 6 (Figure D) works almost identically to RAID 5. In RAID 6, the parity is also striped across all of the disks in the array, but it is written twice, which allows for the failure of more than one disk. Unfortunately, it also requires twice as much overhead as RAID 5.
Why use parity at all?
It looks like parity can make things more complicated with a RAID array, so why not just stick with something like RAID 0 or RAID 1, and leave parity out of the equation? For starters, RAID 0 gives no fault tolerance, so it is not suitable for high-availability environments. RAID 1 does not use parity and is very inefficient with its use of disk space, as it requires a full 50 percent of the available storage since the data is simply mirrored. Using parity and RAID 3, 4, or 5, a disk array can be created that is highly available and that can tolerate the loss of one of the disks. This is because the data can be rebuilt using the parity information stored in the array, and these RAID levels make much more efficient use of the available disk space.
What happens when parity goes bad?
With a single drive failure under any of RAID levels 1, 3, 4, 5, or 6, the failed drive can be replaced. The RAID array controller will automatically regenerate the data on the new drive using the parity information from the other drives and restore fault tolerance to the entire array.
Although RAID provides an extra level of protection in the event of drive failure, parity errors can crop up. When you encounter a parity error, it basically means there is bad data on the drive. If the data cannot be corrected, it may be time to load the data off to a backup tape. How will you know if the data cannot be corrected? When you open a file or run an application that attempts to read that particular portion of the disk, the file will not open, or the application will either crash or not run at all. In many instances, you will be notified via an error message that there was a problem reading from the disk. Often, the problem will become evident during the system backup, when all of the data on the disk is read in one sweep. In a RAID array, when a parity error is detected, the source data is reread to try to get it right.
With or without RAID, parity errors can be generated due to a number of factors other than a failed disk. For example, if the drive cables are not properly connected or shielded, or the wrong type of cable is being used to connect the disks to the controller, parity errors may occur. If you’re noticing a significant number of parity errors, try swapping the cables and testing the controller card to make sure it has not gone bad. Check the SCSI terminators as well to see if one may have come loose. Most RAID controllers come with diagnostics programs that can do some of the troubleshooting, so be sure to make good use of any of these packages as well.
You should also investigate the physical connections to your SCSI devices to determine if they’re the source of the parity problems. First, make sure that you are using the right SCSI cable. Ram Electronics has pictures of many common SCSI connectors as well as the SCSI Trade Association (STA)-endorsed terms and specifications for each type of connector. Most internal SCSI cables are of the ribbon variety, with any number of individual wires running through the ribbon. If even one of those wires is exposed, shorting out, cut, or not fully attached to the connector on the end, it may create data transfer problems. Finally, make sure that the SCSI cable is properly connected to both the controller card and the drive, and that the pins on the devices line up with the pins on the SCSI connector.
Testing a controller card is a little more difficult. The easiest way is to use the diagnostic program that comes with many SCSI and RAID adapters. During system installation for certain servers, such as those from Dell and Compaq, utilities are written to a small partition on a disk array. Among these utilities are programs that can test the array controller, and you can run these programs at system boot time by pressing a key combination on the keyboard. This key combination interrupts the boot process and instead runs the system utilities. Newer systems also include Windows-based array utilities that can perform many of the same functions. Dell, for example, includes its Array Manager product for servers shipping with an array controller that you can install with the rest of the system management suite.
A second controller testing method involves moving the controller to another machine and testing it with different hardware. This is definitely not preferable, as it could result in more downtime and assumes that you have spare hardware lying about that you can use to test this theory.
How does the parity become corrupted?
There are a number of possible causes for the corruption of parity on a disk:
- System crashes: When a system crashes, any data that was not written to disk is lost. In the event that data was being written to a RAID array, it is possible that either the data or the parity was written to disk, but not both. In a situation such as this, you can’t rely on the parity to reconstruct the data on the disk. Reducing the number of system crashes by making use of UPS units, redundant power supplies, and so on will help to protect against this type of parity corruption.
- Uncorrectable bit errors: A hard disk in an array is nothing more than a bunch of magnetic bits that gradually lose the ability to hold data over time. Eventually, bit errors are detected when an attempt is made to read data back from the drive. Many RAID arrays now make use of embedded software that monitors the individual disks and informs an operator when it feels that a disk is about to fail. When I am informed that there is an impending disk failure, I generally run a diagnostic on the RAID array to make sure that the controller is working properly and verify that the error message was indeed correct. If the verification comes back with a problem, I either replace the RAID card, which rarely happens, or replace any drives that the diagnostics identify as bad.
- A disk failure: Like a system crash, a disk failure can have a negative impact on parity. Disks can fail for a variety of reasons: age, overuse, excessive powering up and down, or power surges. When a disk in an array fails, replace it immediately and run a diagnostic on the array. A single disk failure is an indication that there may be more to come.
- Other possible causes: If the array checks out okay and the cables have been tested, the power supply in the system may be delivering too much power to a disk in the array, causing parity problems. This can be tested with a voltmeter (be careful, as electrocution is always a possibility when working with a voltmeter). First, disconnect the system from the power source and insert the probes of the voltmeter into the socket. Next, verify the output against the local standard (110 to 120 volts in North America). Once you plug the system back into the wall, you can disconnect the drive array from the power supply and use the voltmeter to test the individual power leads in the same way. Exact power specifications for the leads can be found in the system guide or on the manufacturer’s Web site.
Luckily, most of today’s RAID and SCSI controllers are very good about making sure parity errors are not introduced onto the disk. However, if this does happen, follow the suggestions above to minimize the risk of data corruption and failure. If you are not using a parity-enabled RAID scheme on a mission-critical system, do a cost/benefit analysis and get RAID installed, as it will be worth much more than the cost of a disk failure. An excellent discussion of RAID advantages and disadvantages can be found at Advanced Computer & Network Corporation’s RAID.edu Web site.