In the first part of this series, I discussed how to maximize the performance of Windows Vista running on an Intel® Matrix RAID array. In this document, I discuss what happens to an Intel Matrix RAID array and system performance when a number of unplanned and unwanted events occur. I will also show the results of some failure tests and ways to minimize the performance degradation that occurs following one of these unexpected failures.
This blog post is also available in PDF format in a free TechRepublic download.
When things go wrong... and they will
Before you can decide if RAID belongs on your desktop PC you need to understand what happens when an improper Windows Vista shutdown occurs or some other unexpected failure occurs. No one usually worries about these events on a non-RAID PC, but when you are running a RAID array with one or more volumes that have fault tolerance you need to be aware of the consequences.
There are three basic types of unplanned events that can greatly impact the performance of a PC running RAID.
Improper or dirty shutdown
An unmanaged shutdown can occur due to a power interruption or failure, a BSOD, or the pressing of the system reset or power button. An unstable system can also lead to an improper Windows shutdown. Some of the causes of an unstable system are:
- System memory errors due to bad or incompatible memory
- Insufficient wattage from the power supply
- Unstable BIOS
- Bad system components, i.e. motherboard, CPU, PSU, modems, video cards, etc.
- Bad or improper drivers
After an improper shutdown occurs and the computer is restarted the RAID controller will recognize that an improper shutdown has occurred. A special recovery process may be required to verify that no data corruption has occurred and data redundancy has been properly maintained.
The recovery processes that can occur following an improper shutdown and the time in minutes for the recovery process to complete. All RAID volumes are 97.2GB.The status of a recovery process can be displayed from the Intel Matrix Storage Manager (IMSM) Console (Figure A). To do this, open the IMSM and select View from the menu and switch to Advanced mode. Right-click on the volume being Verified, Verified and Repaired or Initialized and click Show Verification Progress, Show Verification and Repair Progress, or Show Initialization Progress.
The IMSM status shows more than two hours left to verify and repair a 997.2GB RAID 5 volume following an improper shutdown. Note that one error was found and repaired in this example.
Replacing a failed member drive or replacing a removed member drive
A drive can fail at any time and often without warning. If Windows Vista is installed on a RAID 0 volume and one drive fails, then your performance falls to exactly zero. The OS and any data contained on the RAID 0 volume are gone, and for most real-world situations, they cannot be recovered.
If Windows Vista is installed on a volume with fault tolerance, then your performance will suffer slightly when a member drive fails, but you will still be able to use the OS. Until a replacement drive can be found and installed, the RAID array is more vulnerable to catastrophic failure.
Replacing a failed member drive or replacing a missing member drive requires a rebuild to restore data redundancy for the RAID levels with fault tolerance.
Rebuild times are expressed in minutes for a 97.2GB volume.While a rebuild is in progress, system performance will be poor. Times will vary from system to system, but as a rule of thumb you can multiply the numbers in Figure B by 5 for a 500GB volume, 10 for a 1TB volume, etc. For example, a 500GB RAID 1 volume rebuild could take more than 1 hour and 40 minutes to complete. In comparison, a 500GB RAID 5 volume rebuild could take more than 12 hours to complete! Sometimes you get a warning that a drive is failing before total failure (Figure B). My refurbished Maxtor drive was kind enough to start failing after the abusive failure testing so I could report the results here. Oh the sacrifices that my hardware makes for you, patient reader! The system hung while downloading a file to the RAID 0 volume. Figures C, D, and E show what the IMSM reported.
The IMSM Console shows the RAID 1 Docs and Media volume status as Degraded. The RAID 0 OS and Apps volume status was Normal.
The IMSM Console shows a red 'X' for the RAID array member drive on Port 3 and shows a status of Error Occurred.
Upon Vista startup this notification balloon popped up with a warning to back up data immediately.
Hovering over the RAID array icon in the notification area displays a similar message.
Once the error occurred on the RAID 0 volume and the system was restarted, the offending drive was marked as Error Occurred. I ran the SeaTools diagnostic tests, and the drive failed the long test. Interestingly enough, a S.MA.R.T. event was not triggered.
In this situation, the system booted into Windows Vista normally, and all the data on the RAID 1 volume was still available.
When a drive is failing, performance can slowly grind to a halt or you can experience a system lockup like I did. And then there is the impact on personal performance as you scramble to back up your work -- at least for those who have yet to learn the importance of routine backups.
RAID controller failure
Intel Matrix RAID uses firmware located on the Southbridge (I/O Controller Hub) chip and RAID drivers installed in Windows to implement RAID. If this chip or your motherboard fails, you will have to replace the motherboard/computer or find another computer to temporarily install the hard drives to recover your data.
The bad news is that your computer is down until a replacement can be found and installed.
The good news is that this type of failure is relatively rare. More good news is that the chipsets are usually backward compatible with RAID arrays created in previous firmware/driver versions. For example, I successfully moved my two Maxtor SATA drives configured in an Intel Matrix RAID array from a motherboard with an ICH7R chipset to a motherboard with an ICH10R chipset. You should have no problem finding a compatible replacement motherboard.
While a volume recovery process or a volume rebuild is occurring, system responsiveness can be greatly reduced.
This performance degradation can vary widely from system to system. I have two computers. Most of the components in one are more than three years old. The components in the other are new and some of the fastest I could find and afford at the time.
Performance degradation is barely noticeable in the new PC. Performance degradation in the old PC is crippling. It can take more than 20 minutes for Vista to load. When the Vista desktop does finally appear, Aero has been disabled and Vista is essentially unusable until the RAID volume is returned to Normal status.
RAID implemented on a server in a data center leads a very sheltered life. There is a steady supply of electrons, and data center managers work hard to keep the wrong kind of hands away from the servers.
The desktop world is a much different story. In order to determine if Intel Matrix RAID is up to the task on a desktop PC, I set off to perform a number of abusive failure tests.
The purpose of the tests is fairly simple -- simulate a power outage by turning the system off from the front panel power button. For each RAID level I performed five system shutdowns for three different system states:
- When the system was idle
- When playing an MP3 file from a RAID 1 or RAID 5 volume in Windows Media Player
- When ripping an MP3 file to a RAID 1 or RAID 5 volume in Windows Media Player
Volume write-back cache was turned off for all tests. I then restarted the system and recorded the status of the RAID volumes.
I stopped the tests when I realized that there was a problem with the testing procedures. Before I can explain what those problems were I need to explain how data redundancy works on a system running RAID and why it can be a problem.
The problem with data redundancy
RAID 1 and RAID 5 share the same challenge for the RAID controller -- how to ensure that data redundancy is maintained on all member drives without error, byte by byte.
One of the RAID controller's primary tasks is to direct data to two or more member drives as data is written to the RAID volume. For a RAID level like RAID 1, the data is duplicated on the two member drives providing data redundancy. When a power failure occurs, there is no way for the RAID controller to determine that fault tolerance was properly maintained.
The failure testing procedureI expected and got a lot of RAID volume recovery events. I began my testing by canceling the recovery process after each restart so I could complete the tests in one day. After six shutdown and restart cycles I let the recovery process complete. The process failed. I had apparently already abused the RAID volume beyond its limits (Figure F).
Canceling a volume recovery process can lead to a Failed volume.I had to rethink my test procedures. Lesson learned -- it is not a good idea to cancel the Verification and Repair or Initialization process for RAID volumes with data redundancy, i.e. RAID 1 and RAID 5. I would have to let the recovery process complete before performing the next test. I reduced the volume sizes to 97.2GB to speed up the testing process. The results are displayed in the following tables (Tables C through F). A status of Normal denotes that the RAID volume started without errors and required no recovery processing. The errors noted in the table are Verification Error(s) / ECC Error(s).
Table C: RAID 0 / RAID 1 (2 drives)
This is the RAID 0 and RAID 1 volume status following an improper shutdown with Vista running on a two drive 1202.7GB RAID 0 volume. The RAID 1 volume status was Verifying and Repairing when ripping an MP3 file to a RAID 1 volume.
Table D: RAID 1 (2 drives)
This is the RAID 1 volume status following an improper shutdown with Vista running on 97.2GB RAID 1 volume.
Test number five, process Idle, yielded some interesting results. I received a Windows Failed to Load because the System Registry File Is Missing or Corrupt error at startup. I was able to successfully repair the installation by inserting the Vista RTM DVD and selecting the Repair option during the Vista installation process. The system was reverted to a previous restore point. Vista then booted successfully on the next restart and the Verifying and Repairing process started.
Table E: RAID 0 /RAID 5 (three drives)
This is the RAID 0 and RAID 5 volume status following an improper shutdown with Vista running on a three drive 997.2GB RAID 0 volume. The RAID 5 volume status was Verifying and Repairing when ripping an MP3 file to a RAID 5 volume and once when playing an MP3 file from the RAID 5 volume.
Table F: RAID 5 (three drives)
This is the RAID 5 volume status following an improper shutdown with Vista running on 97.2GB RAID 5 volume.
- Vista running on RAID 0 does not typically trigger a verification of the RAID 0 volume.
- Vista running on RAID 0 always triggers a verification and repair of a RAID 1 or RAID 5 volume when an application is writing to a RAID 1 or RAID 5 volume.
- Vista running on RAID 1 always triggers a verification and repair of the RAID 1 volume.
- Vista running on RAID 5 always triggers an initialization or verification and repair of the RAID 5 volume.
- Vista running on a RAID 1 volume produced the most errors and ECC errors were consistently reported when playing MP3 files located on a RAID 1 volume.
Canceling volume Verifying, Initializing or Verifying and Repairing
When your system is shut down improperly, the Intel Matrix Storage Manager might report that a Verifying, Initializing (RAID 5) or Verifying and Repairing process is in progress. You can cancel this process from the IMSM Console, but understand the consequences before doing so.
RAID 0A volume verification can occur on a RAID 0 volume. Canceling this process (Figure G) will not put your data at risk, but you will not be notified if data corruption has occurred on the RAID 0 volume. It will, however, return the system performance to normal.
Right-click on the RAID 0 volume under Advanced Mode View to access the Cancel Volume Verification option.
RAID 1 and RAID 5
A volume Initialization (RAID 5) or Verification and Repairing process can occur on RAID volumes that support fault tolerance. If you have corrupt or unsynchronized data on a volume due to an improper shutdown and you do not allow the volume recovery process to complete, your data is more vulnerable. As I discovered during my testing, another improper shutdown can lead to a Failed volume or possibly even to the loss of the data on the volume.It is your data - cancel a volume Verifying and Repairing or Initializing process at your own risk!
Right-size member disks and RAID fault tolerant volumes
As you can see from the failure testing and the recovery process times listed in Table A, an improper shutdown can often lead to a lengthy and disk I/O intensive procedure before the volume can be marked as Normal. When a recovery process is running, system performance can be drastically reduced. Because of this, you don't want to buy three 1TB drives, configure them in a RAID array, and only use 10 percent of the available space in a RAID level with data redundancy.
During my latest RAID implementation, I wanted to avoid one terabyte drives for this very reason. Having two or three terabytes of disk storage in your system can be really cool, but do you really need that much disk space? I didn't. I went with three 750GB drives -- and even that was too much for my needs.There are some simple formulas that you can use to determine how much disk space will be available for each RAID level type. For reasons that I won't get into here, the Samsung 750GB drives I was using for testing appear in Windows as 698.6GB drives. That is why I am using such a strange number in Table G.
These are the RAID volume capacity calculations for simple single volume RAID arrayThat is the simple example of only one volume in the array. Things become a lot more complicated when creating two volumes as you can with Intel Matrix RAID. You have to assign a value to one of the volumes in advance to perform the calculation. For simplicity, I will select the volume capacity (vc) for the RAID 1 or RAID 5 volume first and use 500GB for both in the examples in Table H.
These are the RAID volume capacity calculations for a more complicated two volume RAID array; 'vc' stands for the volume capacity.
The RAID 0 and RAID 5 calculation is more complex so let me break it down for you:
RAID 0 volume capacity = (# of drives * capacity of smallest drive) - ((# of drives * RAID 5 vc) ÷ (# of drives - 1))
RAID 0 volume capacity = (3 * 698.6GB) - ((3 * 500GB) ÷ (3 - 1))
RAID 0 volume capacity = 2095.8GB - ((1500GB) ÷ (2))
RAID 0 volume capacity = 2095.8GB - 750GB
RAID 0 volume capacity = 1345.8GB
Determine in advance what your storage requirements are. You should probably add 50 percent or more to allow for future needs. You don't need to get scientific about this exercise, but you do want to avoid a large amount of unused space on a RAID 1 or RAID 5 volume.
Which RAID level is best for Windows Vista?
Now let's add the results from the failure tests to determine which RAID level is best for running Windows Vista.
Windows Vista on RAID 0
In addition to better performance, Vista when installed on a RAID 0 volume avoids the performance degradation that occurs after an improper shutdown.
If you aren't comfortable doing a complete clean Windows reinstall, then you shouldn't put the OS on a RAID 0 volume.
If you aren't disciplined enough to keep your personal files off the default system folders and the RAID 0 volume in general, then you shouldn't put the OS on a RAID 0 volume.
Windows Vista on RAID 1
Each improper system shutdown requires a Verify and Repair.
If you want to avoid performance degradation during a recovery process, don't put the operating system on a RAID 1 volume.
Windows Vista on RAID 5
Like RAID 1, RAID 5 requires a lot of work to maintain accurate data redundancy following an improper shutdown. Table A shows that the Initializing process takes longer than the Verifying and Repairing process required for RAID 1 since it has to write the parity information. The RAID 5 Verifying and Repairing process, however, takes less time to complete than the Verifying and Repairing process required for RAID 1.Watch out, though, if one of your member drives in a RAID 5 array fails. For my system the rebuild time for RAID 5 is more than seven times the rebuild time for RAID 1. You definitely don't want to make the mistake I made. Do not forget to reattach the power and data cables to the drives when you work on the motherboard!
RAID 5 is a really bad place for the OS if you want to avoid the poorer performance that occurs during a recovery process. A UPS can help, but they are hard to justify for most desktop users. If you must choose RAID 5, there are some common mistakes that should be avoided when a recovery process is required.
Windows Vista on Intel Matrix RAID
I have been using Intel Matrix RAID configured as RAID 0 and RAID 1 since the fall of 2006. Windows Vista has always been running on the RAID 0 volume, and I have yet to experience any data corruption. Aside from a few unimportant e-mails, I have lost no personal data. In fact, I survived a drive failure without any important data loss. And it looks like I will now survive a second drive failure without the loss of any of my important files.
There is the occasional Verify and Repair -- perhaps 2-4 times per year. I personally like the idea that a power failure doesn't always require my hard drives to churn away for more than an hour while the RAID Volume is being restored to Normal status.
When I did have a drive fail, the RAID array worked exactly as expected. I added a replacement drive to the array, recreated the RAID 0 volume, and reloaded Windows Vista onto the RAID 0 volume. The RAID 1 volume was rebuilt without problems. For more information about my RAID experience, please read "Want Speed and Data Safety? Consider RAID," although I do cringe a bit now that I reread my somewhat naïve words these many years later.
I also strongly suggest that you have a third drive available to you that can be used in case of a RAID array member drive failure. Use the non-RAID drive to temporarily run Vista. It is a bad idea to load Windows onto a degraded RAID 1 volume because a degraded RAID array is more vulnerable to catastrophic data failure.
If you choose RAID 0 for Windows Vista and RAID 1 or RAID 5 for any critical data you can't afford to lose, Intel Matrix RAID is an excellent way to achieve the performance you are craving and the data redundancy that can help you sleep better at night.
Before committing to a RAID solution, you should be aware of the following issues:
RAID volumes greater than 2TB requires the volume to be initialized as GPT instead of MBR. GPT partitions are only bootable in EFI-based systems. Only Windows XP x64, Server 2003 SP1, or follow-on versions of Windows have GPT support.
If you are doing a lot of beta testing of new versions of Windows you should be aware that new Intel Matrix RAID drivers may not be immediately available. You will have to rely on the RAID drivers that are included with Windows, assuming that they will continue to be included with new Windows releases.
The RAID levels Performance and Recovery ScorecardAnd now let's include the results from the failure testing (Table I) to the scorecard to give an overall rating:
These are the rankings from 1 (best) to 5 (worst) for the performance of the four RAID levels with volume write-back cache enabled/disabled, the improper shutdown recovery time, and the improper shutdown recovery occurrences for idle, play MP3, and Rip MP3.
There is an additional cost to pay for fault tolerance when the OS is running on a RAID level with data redundancy, and that cost is reduced system performance following an improper shutdown.
Which RAID level is best for your desktop PC? The somewhat amusing and perhaps best answer for most might be none. RAID is not for the uninitiated or for those who don't understand the risks. If you only browse the Internet, read your e-mail, create Word documents, or perform a variety of other common business related-tasks, then you will be hard-pressed to measure any productivity gains from RAID.
If you are an IT professional, then you aren't just a good candidate for desktop RAID, you are the perfect candidate. You should be able to create, configure, and maintain a RAID array. You likely have higher PC resource requirements and can benefit the most from the increased performance that RAID can offer.
Computer enthusiasts looking for faster performance can also benefit from RAID. However, enthusiasts are also more likely to experiment with overclocking. If you are an enthusiast and decide RAID is for you, you will have to curtail system overclocking. RAID and overclocking don't mix.
Anyone who has to work with very large data files can also benefit from the performance benefits of RAID. If you edit large video files, for example, these files can be edited on a RAID 0 volume and saved to a RAID 1 volume when the editing process is complete.
You won't see 90 percent plus performance gains from RAID 0, but if you save only ten minutes per business day, then it is easy to justify the cost of a second hard drive for use in a RAID array.
I have done some sample ROI calculations and found that for an employee making $20.00 USD, the cost of a second disk drive can be recovered in about one third of a year -- even after accounting for the lost time for array creation and maintenance and lost performance due to one of those pesky volume recovery processes.
TechRepublic's Windows Vista and Windows 7 Report newsletter, delivered every Friday, offers tips, news, and scuttlebutt on Vista and Windows 7, including a look at new features in the latest version of the Windows OS. Automatically sign up today!
I want to thank:
- Roger Bradford and Intel for their help and Intel Matrix RAID expertise.
- My parents for the gift that made the purchase of three new hard drives and RAID 5 possible.
Alan Norton began using PCs in 1981, when they were called microcomputers. He has worked at companies like Hughes Aircraft and CSC, where he developed client/server-based applications. Alan is currently semi-retired and starting a new career as a writer for TechRepublic.