Designing a storage solution isn't a trivial undertaking; there are many moving parts, many decisions to be made, and just as many mistakes that can be made. Here are seven mistakes that might lead to you getting in trouble.
1. Not taking RAID storage overhead into consideration.
Unfortunately, I've actually seen this happen. Any responsible storage implementation will probably use RAID to protect against the loss of one or more disks. With the exception of RAID 0, which is just a bunch of disks strung together to create a larger storage pool, all RAID implementations result in storage-related overhead that is used for mirror or parity information. The storage overhead requirements can be substantial. For example, in a RAID 1 implementation, 50% of the total disk space is used to copy the information to the mirrored set of drives. RAID 10 — an extension of RAID 1 that stripes data across multiple RAID 1 sets to improve performance — exacts a 50% space toll but is frequently used due to its significant performance benefits. Don't forget to take into consideration RAID overhead when deciding how much storage you need to buy.
RAID storage penalty for common RAID levels:
- RAID 0: No storage penalty, but no protection either.
- RAID 1: 50% storage penalty (mirrored disks).
- RAID 5: 1/n storage penalty where n is the number of disks that make up the array.
- RAID 6: 2/n storage penalty where n is the number of disks that make up the array.
More information about RAID levels:
- Understand 'single digit' RAID levels
- Get the basics on multilevel RAID sets
- Build Your Skills: Know the differences between RAID levels
2. Not taking RAID performance overhead into consideration.
RAID exacts more than just a storage penalty; in addition to reducing the amount of usable disk space, different RAID levels also impact the overall performance of the storage system. Different applications require different storage performance characteristics. Different RAID levels are best suited to different kinds of applications. For example, because of the need to calculate parity for RAID 5 and RAID 6, those RAID levels are not always suitable for write-intensive tasks such as, for example, SQL Server log files. Choosing a RAID level that is not best suited for your application will not yield the best possible results.
In general, here are some pointers:
- RAID 1: Read: Good, Write: Good
- RAID 5: Read: Good, Write: Mediocre
- RAID 6: Read: Good, Write: Poor (double parity calculation and storage)
- RAID 10: Read: Very Good, Write: Very Good
Don't take this list to the bank, though; performance needs and characteristics vary wildly between applications, so do your homework!
- RAID storage explained
- Comprehending the Tradeoffs Between Deploying Oracle Database on RAID 5 and RAID 10 Storage Configurations
- EMC CLARiion RAID 6 Technology: A Detailed Overview
- RAID 1+0 is the Cadillac of RAID
- Comprehensive RAID performance report
3. Not implementing a solution with enough spindles.
IOPS (Input/Output Operations Per Second) is a standard method by which storage performance is measured. While a lot of elements go into figuring out the total input/output capacity of a storage infrastructure, the number of spindles (a common way to refer to the number of disks in a storage solution) is one of the most important that you can design in. The more spindles you throw at a solution, the better the overall performance will be. Many people often assume that the transport mechanism — iSCSI, Fibre Channel, etc. — is the primary limiting factor from a performance standpoint, but this is often not the case. Each individual disk in your storage system is capable of a maximum number of IOPS. This maximum number is multiplied by the number of usable disks in your RAID configuration to arrive at a theoretical maximum IOPS value.
For some applications, you can figure out the number of IOPS that you need, but for other applications, you need to work with the vendor to arrive at a reasonable calculation. Without enough spindles to support your load, the rest of the storage design simply won't matter.
4. Choosing a RAID level that leaves your organization at risk.
For some, RAID had long been considered the gold standard when it comes to data protection; however, when used incorrectly, that protection might only be an illusion. Besides taking into consideration storage and performance needs, your RAID level needs to take into account the level of protection you want to maintain in the environment. RAID 5 is, by far, the most common level of RAID out there and, when used correctly, will provide organizations with a degree of protection. However, as drive sizes get larger, the risk of data loss increases pretty quickly. Since RAID 5 can tolerate the loss of only a single disk, losing two disks is a recipe for disaster.
For more information:
- There are some people that truly hate RAID 5... the group is named BAARF.
- How to protect yourself from RAID-related Unrecoverable Read Errors (UREs)
- RAID 5 Is A Cruel Mistress
- Why RAID 5 stops working in 2009
5. Using the wrong kind of disk.
I already indicated that you need to make sure you have enough spindles to support the needs of your application environment. Along with that spindle count, make sure you get the right kind of disks. From an IOPS perspective, not all disks are created equal. Further, from a reliability perspective, not all disks are created equal. SATA disks, for example, can be one or two orders of magnitude less reliable than SAS disks and create a much higher risk for data loss (read my URE article). Second, most SATA disks spin at slower rates than their SAS counterparts. Although there are enterprise-grade SATA disks that spin at 10K RPM, SAS disks almost always have a 10K RPM minimum speed and can spin as fast as 15K RPM. The faster the disk spins, the more quickly it can read and write information and, hence, the higher the IOPS value.
Note that there are tricks (such as short-stroking) that you can use to force more IOPS from a disk, but I'm not going to get into those here.
6. Not configuring a hot spare.
A hot spare is a critical part of a redundant storage system and provides the system with a way to immediately begin recovering from the loss of a disk due to hardware failure or some other catastrophe. The quicker that an array begins to rebuild after a failure, the less likely it is that the array will suffer another disk fault that could end up resulting in the loss of data from the entire RAID volume.
Using a hot spare results in the immediate loss of that disk as usable space in the array. With many people creating multiple RAID sets on an array, you might be concerned about losing a hot spare per RAID set. Many arrays will allow you to configure a global hot spare that can automatically take the place of any drive in any RAID set across the entire array, so you can minimize your hot spare overhead while continuing to meet availability needs.
7. Not implementing enough redundancy.
Depending on the way that your storage environment will be used, you will implement different levels of redundancy. For primary, high-need storage, make sure that you implement enough redundancy in the environment to meet business needs — that may mean dual controllers, dual UPSs, redundant data paths to the storage, redundant replicated arrays and, much more.
When designing your storage, draw every component on paper. Then, in turn, place an X over each component and determine the impact if that particular component were to fail and, for each, component, decide if you need an additional level of redundancy. For example, at Westminster, we use a dual controller EMC AX4 iSCSI. The whole storage infrastructure is redundant from the controllers to the Ethernet switches that service the storage network. For each server that connects to the storage, we use multiple NICs and provide two connections to storage; neither connection uses a common NIC in the server. For example, we use one motherboard NIC connection and an add-in Ethernet adapter connection in order to protect against the failure of a single NIC.
- A look at an iSCSI-based highly available architecture
- A look at some more AX4/iSCSI availability diagrams
- EMC AX4 - A failover update
Want to keep up with Scott Lowe's posts on TechRepublic?
Since 1994, Scott Lowe has been providing technology solutions to a variety of organizations. After spending 10 years in multiple CIO roles, Scott is now an independent consultant, blogger, author, owner of The 1610 Group, and a Senior IT Executive with CampusWorks, Inc. Scott is available for consulting, writing, and speaking engagements and can be reached at firstname.lastname@example.org.