Reply to Message
Using RAID 60
Using RAID 60 over RAID 6 (IMHO) is not just about raw resilience, but recoverability, write performance and scale.
I'll try to roughly quantify these. We will assume that disk failures are independent (so environmental events that would plausibly destroy the entire array are out of scope, and provided for by a genuinely independent backup), and that drive failure during rebuild is uniform over data read (ignoring the contribution from the MTTF for a random failure, which also increases risk with wide RAID6 sets).
1) Recoverability: By halving (or more) the width of each RAID set, the number of bits you have to read to rebuild a drive is much reduced, hence the risk of further drive failures during rebuild is reduced. For example, consider 48 2TB drives, at a UER of 1E-15 (e.g. WD RE4 or Seagate Constellation ES). With 6 8-disk RAID6 sets, the array provides a total of 6*(8-2)*2 = 72TB. If a single drive fails, then all 12TB of data in the RAID6 set it was within must be read to rebuild it. This is 12*8*1024*1024*1024*1024 bits (about 1E14). The probability of two (or more) unrecoverable errors is thus:
1 - (1-1E-15)^(1E14) - (1E14 C 1)*(1-1E-15)^(1E14-1)*1E-15
as it is a binomial.
This is 1 - (1-1E-15)^(1E14) -(1-1E-15)^(1E14-1)/10
Approximating (as 1/(1-1E-15)*10 is very close to 0.1):
1 - 1.1*((1-1E-15)^(1E14))
We can expand this as a Taylor series into:
1 - 1.1 * (1 - 0.1 + (1E14 C 2)*1E-30 - ...)
(ignoring further terms, which are rapidly extremely small - each at most 1/10 the previous, and with alternating signs)
1 - 1.1 * (0.9 + (1E14! / ((1E14-2)!*2!))*1E-30)
Approximate (1E14*(1E14-1)) as 1E28
1 - 1.1 * (0.9 + 1E28/2*1E-30)
= 1 - 1.1 * (0.9 + 0.005)
= 0.0045
hence a 0.45% chance of failure during rebuild.
If we do the same for a single wide RAID6 volume (using only 38 disks, to provide the same capacity):
The number of bits becomes 72*8*(1024)^4 - about 6E14
Substituting this in:
This is 1 - (1-1E-15)^(6E14) -(1-1E-15)^(6E14-1)/10
1 - 1.6 * (1 - 0.6 + (6E14 C 2)*1E-30 - ...)
~= 1 - 1.6 * (0.4 + (6E14! / ((6E14-2)!*2!))*1E-30)
~= 1 - 1.6 * (0.4 + 3.6E29/2*1E-30)
= 0.072
hence a 7.2% chance of failure during rebuild.
These are only approximate, as there are some not entirely negligible terms that we have truncated, but you should be able to satisfy yourself that the error is small relative to the difference.
2) Write performance:
Parity RAID in general suffers from performance overheads when writing data less than the total stripe width. Yet reads see substantial improvements from being able to satisfy a request from a handful of physical disks.
The wider the stripe width, the worse this becomes. Also, the fewer the number of partial stripes that can be held in the controller cache to speed parity calculation. By splitting the array into smaller RAID6 sections,
each write (hence parity calculation) depends on fewer drives, and involves less unwanted data. This is a bit vague - your mileage will vary - but the distinction is non-trivial in highly transactional workloads, as each sub-volume is essentially independent.
3) Scale:
It is perfectly feasible to implement the RAID0 at a higher level than the RAID6. In this case, multiple hardware controllers can be harnessed for their individual bandwidth and calculation capacity (and cache), but all contribute towards a single storage pool.
Some controllers can, I think, interact along the PCIe bus to achieve much the same thing.
SAS expanders (and SATA-II support for expanders) make this somewhat less useful, as a single controller can potentially connect to many more disks than it has native ports for, but the limitations on total throughput and parity engine/cache capacity still apply.
I'll try to roughly quantify these. We will assume that disk failures are independent (so environmental events that would plausibly destroy the entire array are out of scope, and provided for by a genuinely independent backup), and that drive failure during rebuild is uniform over data read (ignoring the contribution from the MTTF for a random failure, which also increases risk with wide RAID6 sets).
1) Recoverability: By halving (or more) the width of each RAID set, the number of bits you have to read to rebuild a drive is much reduced, hence the risk of further drive failures during rebuild is reduced. For example, consider 48 2TB drives, at a UER of 1E-15 (e.g. WD RE4 or Seagate Constellation ES). With 6 8-disk RAID6 sets, the array provides a total of 6*(8-2)*2 = 72TB. If a single drive fails, then all 12TB of data in the RAID6 set it was within must be read to rebuild it. This is 12*8*1024*1024*1024*1024 bits (about 1E14). The probability of two (or more) unrecoverable errors is thus:
1 - (1-1E-15)^(1E14) - (1E14 C 1)*(1-1E-15)^(1E14-1)*1E-15
as it is a binomial.
This is 1 - (1-1E-15)^(1E14) -(1-1E-15)^(1E14-1)/10
Approximating (as 1/(1-1E-15)*10 is very close to 0.1):
1 - 1.1*((1-1E-15)^(1E14))
We can expand this as a Taylor series into:
1 - 1.1 * (1 - 0.1 + (1E14 C 2)*1E-30 - ...)
(ignoring further terms, which are rapidly extremely small - each at most 1/10 the previous, and with alternating signs)
1 - 1.1 * (0.9 + (1E14! / ((1E14-2)!*2!))*1E-30)
Approximate (1E14*(1E14-1)) as 1E28
1 - 1.1 * (0.9 + 1E28/2*1E-30)
= 1 - 1.1 * (0.9 + 0.005)
= 0.0045
hence a 0.45% chance of failure during rebuild.
If we do the same for a single wide RAID6 volume (using only 38 disks, to provide the same capacity):
The number of bits becomes 72*8*(1024)^4 - about 6E14
Substituting this in:
This is 1 - (1-1E-15)^(6E14) -(1-1E-15)^(6E14-1)/10
1 - 1.6 * (1 - 0.6 + (6E14 C 2)*1E-30 - ...)
~= 1 - 1.6 * (0.4 + (6E14! / ((6E14-2)!*2!))*1E-30)
~= 1 - 1.6 * (0.4 + 3.6E29/2*1E-30)
= 0.072
hence a 7.2% chance of failure during rebuild.
These are only approximate, as there are some not entirely negligible terms that we have truncated, but you should be able to satisfy yourself that the error is small relative to the difference.
2) Write performance:
Parity RAID in general suffers from performance overheads when writing data less than the total stripe width. Yet reads see substantial improvements from being able to satisfy a request from a handful of physical disks.
The wider the stripe width, the worse this becomes. Also, the fewer the number of partial stripes that can be held in the controller cache to speed parity calculation. By splitting the array into smaller RAID6 sections,
each write (hence parity calculation) depends on fewer drives, and involves less unwanted data. This is a bit vague - your mileage will vary - but the distinction is non-trivial in highly transactional workloads, as each sub-volume is essentially independent.
3) Scale:
It is perfectly feasible to implement the RAID0 at a higher level than the RAID6. In this case, multiple hardware controllers can be harnessed for their individual bandwidth and calculation capacity (and cache), but all contribute towards a single storage pool.
Some controllers can, I think, interact along the PCIe bus to achieve much the same thing.
SAS expanders (and SATA-II support for expanders) make this somewhat less useful, as a single controller can potentially connect to many more disks than it has native ports for, but the limitations on total throughput and parity engine/cache capacity still apply.
Posted by DomBenson
26th May 2010



