Data Centers

RAID for mass storage reliability

Need to know the difference between hardware and software RAID? Need to know the fundamentals of implementing RAID? If so, you've come to the right place. Robert McIntire lays the foundations underlying RAID.

Since the dawn of computing, long-term, mass storage has been a primary factor in the design of systems. At issue is speed, density of storage, and of course, fault recovery. In the beginning of the PC era, hard drives were sparse in storage, slow to access, and expensive. Over the years these issues have been addressed and improved again and again.

Hard drive manufacturers continue to enhance storage density and access times, and controller manufacturers continue to progress throughput as well as the ease of installation and configuration (not to mention cost per megabyte, which has improved by an order of some magnitude). Today, storage subsystems are extremely fast and very dense compared to their predecessors of the 1980s and 1990s. This is all well and good, but the factor we have not yet mentioned here is one of reliability. The fastest, most dense storage system is worth little if the reliability factor is low. The unfortunate fact is that mass storage, due to its mechanical nature, is still one of the most fallible components in the computing environment.

This begs the question, “How do we implement some level of reliability for mass storage in the computing environment?” It has been addressed in many ways over the years, but one standard, redundant array of independent disks (or RAID), has evolved and risen to the top. The RAID standard encompasses varying levels of redundancy by using arrays of multiple drives. My aim is to give you an in-depth look at a couple of different options for RAID, whether implemented at the OS level in hardware or software.

Hardware vs. software
For the purpose of this article, I’ll use the term “software RAID” to refer to those RAID options that are implemented within the network operating system (NOS) software itself. The term “hardware RAID” will be used to refer to the use of specialized hardware and software, which can include specialized controllers, custom drive assemblies, and special housings. Modern NOS vendors, such as Novell and Microsoft, began including disk redundancy features with their NOS software many moons ago.

Back when RAID-capable controllers were relatively expensive, configuring the NOS to perform this function was an attractive option. The NOS vendors started out by offering mirroring and duplexing, which are now considered RAID level 1. NOS support for RAID continued to grow, and eventually some vendors offered RAID level 5 support integrated with the OS, which includes disk striping at the byte level and stripe error correction information. The common denominator with NOS software implementations is that they generally provide only limited RAID functionality. On the other hand, hardware RAID opened a whole new door of reliability and performance for the server environment. In addition to this, hardware RAID servers are for the most part easy to set up and configure. With modern array controllers, RAID is almost completely transparent to the user. After a simple initial setup, the array drivers handle passing information to and from the OS in a seamless fashion. The user only sees a partition, volume, or drive letter as the lower layer RAID drivers handle the abstraction.

Software RAID in the NOS
For the software perspective, I’ll use Windows NT 4 Server as my example. There are several issues and terms that one needs to be aware of before starting: disk mirroring, striping, and striping with parity. Disk mirroring is implemented with disks in pairs and provides redundancy. Fortunately, you can mirror the system and boot partitions for redundancy. Another option is striping, which requires at least two disks and simply spreads the data evenly across all disks in what is called a stripe set. This method is loosely analogous to RAID level 0. The advantage of using this method is that disk writes are very fast since the OS is writing to more than one disk simultaneously. The disadvantage is that it provides no redundancy, which was our goal from the outset.

The next option is striping with parity. Striping with parity requires at least three disks and provides redundancy by calculating parity information for the data stored on the array and distributing it across disks in the parity stripe set. In this way, if any single disk fails, the system can regenerate the data from the parity information stored on the surviving disks. However, this comes with a cost, as server performance will suffer greatly in terms of disk access, while the processor regenerates data on the fly. If you add to that the continuous processor overhead for calculating parity information (even when the system is functioning normally), you begin to see how this method can cause a drain on system resources. So, with this option, you have redundancy, but at a price. Not only that, but after a disk failure, several tasks need to be performed to complete the repair process. You will need to shut down Windows NT and replace the disk. After the server is brought back up, you will need to instruct Windows NT to regenerate the stripe set. Once you have completed these tasks, the system will begin background operations to rebuild the data and write it to the new drive.

Since these features are bundled in with the OS software, no additional hardware is needed other than the additional drives. However, these features will have limitations when they are implemented in software. You can’t expand or contract a stripe set when implementing either method of striping using the built-in features of Windows NT. Neither can you install the OS system or boot partition upon a RAID level 5 stripe set. In other words, if you want to implement RAID level 5, you need at least one other disk for the Windows NT system and/or boot partition. This way, you would install Windows NT on your server, and then add the stripe set after the fact.

Implementing software RAID
Now that you have a better understanding of how the different levels of RAID redundancy are implemented in the Windows NT 4 OS, let’s implement them. I’ll assume a typical server configuration with separate volumes for the NOS and the data, with redundancy for both. I’ll consider the initial server installation, redundancy configuration, and how one would go about modifying and maintaining the RAID configuration. In doing so, I’ll skip the gory details that aren’t relevant to the core topic, mass storage redundancy.

First, I’ll use high-speed SCSI drives with a high-speed SCSI controller. Since the SCSI hardware specifics are not relevant to our discussion of RAID redundancy, I’ll save those for another time. For this, use a single SCSI controller with a dual bus. On one bus you’ll need to configure a mirror set and install both the system and boot partitions on one volume. On the other bus, you’ll install three drives configured for RAID 5 (stripe set with parity) as the data volume.

Let’s proceed with the following step-by-step approach for the server install.
  1. Install the SCSI controller and hard drive for the boot volume and system volume.
  2. Install the Windows NT OS.
  3. After the initial server installation, install a second SCSI drive for mirror of the boot/system volume.
  4. Use the Windows NT Disk Admin utility to mirror the boot/system volume.
  5. Install three more SCSI drives on the secondary SCSI bus for the data volume.
  6. Use the Disk Admin utility to create a RAID 5 stripe set with parity, and restart the server again.

For anyone who’s ever installed Windows NT, it should be apparent by now that I’ve abbreviated the task list somewhat. You could have taken a different route to complete the install, but I chose an iterative method to demonstrate the steps. This is in an attempt to focus only on that which gives us redundancy. What should become obvious is the number of steps required to set up a server using this particular software RAID configuration, and this is a fairly simple configuration.

Maintaining your RAID
Now that I’ve gone through a basic RAID software setup in Windows NT with you, we should examine maintenance issues. For instance, what happens if you experience a single drive failure on the stripe set? After a lot of processor churning and nail biting (not to mention user complaints), you check the event log and find that there are so many error messages about disk failure that you’re going to have to expand the log size. At this point, you scrape up another drive and schedule after-hours maintenance. And at midnight, you go to work in the dark with a full coffeepot. To repair the stripe set, take the following steps.
  1. Shut down the server and replace the defective drive.
  2. After the server is restarted, use Disk Admin to regenerate the stripe set and restart again.
  3. The OS will now begin rebuilding the array in the background.

Make certain you get a full system backup before starting repairs.

Now that doesn’t seem like a lot of hassle, but I haven’t yet compared it to the alternative method in the RAID hardware configuration. Putting that thought aside for a moment, let’s consider what happens when more than one drive fails. (Got backup? Enough said.) We’ve looked at installation and repair, but what if you need to expand or contract your RAID 5 array? To do that, you’d need to complete the following steps.
  1. Perform a full backup of the entire array using your favorite backup utility.
  2. Use Disk Admin to delete the array and restart the server.
  3. Add or remove drives from the array as desired.
  4. Use Disk Admin to create a new stripe set and restart.
  5. Restore all data to the stripe set.

Obviously this is not a very attractive option, since you’ll need to schedule a large window of downtime for this operation. There is no allowance in the NT implementation to support resizing the array.

Implementing hardware RAID
Now, let’s approach it again, this time from a hardware perspective. The components involved are usually a RAID controller—what is often called an array controller—the hard drives, and optionally some form of RAID cage or drive housing. (The housing can be internal or external to the server.) In this scenario I’ll assume an initial server installation with Window NT 4 on a typical Compaq server with a Compaq array controller. Many Compaq servers are equipped from the factory with an internal housing that will accommodate up to five hot-pluggable drives.

In Windows NT with an initial install of a hardware RAID level 5 system, one would need to provide the driver for the array controller during setup. So, you would proceed as follows:
  1. Install the RAID controller and three drives.
  2. Begin Windows NT install, providing the array driver disk when prompted.
  3. During installation, create a partition to contain the boot/system volume.
  4. At the end of the installation process, configure the remaining array space for the data volume.

Again, I’ve abbreviated things a bit, while outlining the fundamental concepts. At this point, we’ve got a single hardware RAID 5 array consisting of three drives and two partitions. One partition is for the OS and the other for the data volume. Compared to the software RAID installation, this seems easier and more straightforward.

In terms of redundancy, a hardware implementation of RAID provides for additional and very valuable features. Namely, many drive configurations are hot-pluggable, which means you can run your server nonstop. You simply pull out a defective drive while the server is up and replace it on the fly without the need to shut down the server. The system will then begin a background operation of rebuilding data from parity information and writing it to the new drive. Taking redundancy a step further, you can also install what is commonly referred to as a hot spare in your RAID 5 array. Basically, this additional drive acts as a live backup in case a drive in your array does fail. If one did fail, the system would immediately begin the background rebuilding operation to this spare drive. I won’t elaborate much on the steps required to repair a hardware-based array in the case of drive failure. Since the drives are hot-pluggable, simply pull the defective drive and insert the replacement drive while the server is up and running. This is certainly better than the midnight maintenance operation.

Expanding hardware arrays
How do you expand the hardware array?
  1. Install an additional drive in the RAID cage. Then, use the Compaq Array Configuration utility to add that new drive space to the existing array.
  2. Open Disk Admin in Windows NT, add the additional drive space to the data volume as if you were creating a Windows NT volume set, and restart.

This can be a little confusing since the NT software implementation of a volume set is not redundant. At this point, you have to deal with how Windows NT handles this particular RAID expansion in conjunction with the Compaq-provided array driver, which is not exactly what one would expect at this point. Rest assured that you still have a RAID 5 hardware array in operation. The great thing is that it’s actually possible to expand the array with a hardware implementation. Keep in mind that different manufacturers may handle this any number of ways. You may want to check with your hardware RAID vendor to find out how they accommodate this process. After the server restarts, it may pause longer while loading. The key here is to not panic when the startup blue screen hangs longer than usual. This is simply Windows NT checking the newly expanded array/volume set. I was involved in one such expansion when it took approximately a half hour to advance to a login prompt. Naturally, you’ll want to complete a full backup before performing any such operation.

Array controller software
The array controller software provided by RAID hardware vendors has features that can vary, but at a basic level these features are used to define and configure the arrays attached to the RAID controller. Generally, the software consists of three basic components: the actual firmware BIOS running on the Hardware RAID controller; the array configuration utility (ACU), which is usually a GUI program that you run under Windows NT to define and configure arrays; and the array driver itself, which provides the interface to the OS.

In the case of Compaq, there are certain versions of the array configuration utility that are compatible with certain minimum revisions of firmware BIOS on the actual array controller board itself. As vendors add features and abilities to their hardware, oftentimes it’s necessary to update the BIOS on your controller and install the latest array configuration utility and array drivers to utilize the latest features. Sometimes these updates contain patches and fixes for known problems. You’ll want to regularly check with your RAID controller vendor to keep your systems updated.

Cost vs. performance
In the final analysis, it can be difficult to determine a general rule about which way would be most effective in any given case. I have tried as much as possible to compare apples to apples in my examples, but there are a number of variations that could have been used. As much as I may try to perform a fair and unbiased comparison between hardware and software RAID, they are two very different animals in terms of performance and reliability. As I’m sure most experts would agree, from a cost-for-performance view, hardware RAID 5 is the hands-down winner. But, given the fact that many small IT shops run a tight budget, I wouldn’t expect the NOS vendors to stop offering their software versions any time soon. When it comes down to dollars, it’s good to know that the NOS still supports some level of built-in redundancy. But, for all-out reliability and performance, look for the RAID hardware offerings. As time goes by, RAID hardware manufacturers will continue to add features that will further enhance performance, reliability, and ease of use. As they do so, the case for choosing hardware over software will become a given.

Hints and tips
Update BIOS Hardware RAID controllers, like SCSI controllers have their own firmware BIOS, which you may want to consider updating before the OS installation.
Faster processor If you intend to use the software RAID 5 features included with your NOS, consider a faster processor or a multiprocessor server to handle the additional load.
Hot spare With hardware RAID implementations, a hot-spare in your array will not only help system performance in the event of a single drive failure, it should also be able to weather a dual drive failure.


Editor's Picks