After successfully setting up a Software RAID mirror, it’s important to recognise a drive failure and know how to recover from it.
Most often a failing drive will fill your /var/log/messages file with SCSI errors like these:
kernel: sidisk I/O error: dev 08:02, sector 1790260 kernel: SCSI disk error : host 0 channel 0 id 0 lun 1 return code = 15000002
You can check the RAID array status by looking at /proc/mdstat or get a more detailed status report on each RAID block using mdadm -detail /dev/md(n) where (n) is the set number:
# mdadm -detail /dev/md0
/dev/md0:
Version : 00.90.03 Creation Time : Fri Nov 2 15:52:45 2007
Raid Level : raid1Array Size : 14651136 (13.97 GiB 15.00 GB)
Device Size : 14651136 (13.97 GiB 15.00 GB)
Raid Devices : 2Total Devices : 2
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Sat Nov 10 14:54:36 2007
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
UUID : 644f52f2:66ea4422:1ae79331:d593083b Events : 0.48810
Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1 1 8 17 1 active sync /dev/sdb1
Here you can see that the devices are active and the state of the RAID set is ‘clean.’
Before putting a server in to production I always like to check that the RAID array is working as it should. Using hardware RAID and hot-swap disks this is easy; pull a disk, check that the system reacts as expected, and then re-insert the disk and allow it to rebuild. If you want to be more cautious, then shut down the server before removing the disk, then boot back up. Software RAID tends to be used with cheap SATA/PATA disks which are rarely hot-swap, so I would definitely not recommend pulling a disk out while it’s live.
Disk failure can be simulated via software with the mdadm tool:
# mdadm -manage -set-faulty /dev/md0 /dev/sda1
Now ‘mdadm -detail /dev/md0′ will show the array status as ‘degraded’ and the disk /dev/sda1 as faulty:
/dev/md0:
Version : 00.90.03Creation Time : Fri Nov 2 15:52:45 2007
Raid Level : raid1
Array Size : 14651136 (13.97 GiB 15.00 GB)
Device Size : 14651136 (13.97 GiB 15.00 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Sat Nov 10 15:11:48 2007
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 1
Spare Devices : 0
UUID : 644f52f2:66ea4422:1ae79331:d593083b Events : 0.48895
Number Major Minor RaidDevice State
0 0 0 - removed1 8 17 1 active sync /dev/sdb1
2 8 1 - faulty /dev/sda1
Looking at /proc/mdstat will show [F] next to the failed disk:
md0 : active raid1 sda1[2](F) sdb1[1] 14651136 blocks [2/1] [_U]
You should not have noticed any loss of service. Hopefully everything worked as expected.
Now to rebuild the array we need to remove the failed disk and add a new one. This is very easy to do, again using the mdadm tool:
# mdadm /dev/md0 -r /dev/sda1 mdadm: hot removed /dev/sda1
Check that the disk has been removed:
# mdadm --detail /dev/md0
/dev/md0:
Version : 00.90.03Creation Time : Fri Nov 2 15:52:45 2007
Raid Level : raid1
Array Size : 14651136 (13.97 GiB 15.00 GB)
Device Size : 14651136 (13.97 GiB 15.00 GB)
Raid Devices : 2
Total Devices : 1
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Sat Nov 10 15:19:54 2007State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0
UUID : 644f52f2:66ea4422:1ae79331:d593083b Events : 0.48944
Number Major Minor RaidDevice State0 0 0 - removed
1 8 17 1 active sync /dev/sdb1
Finally add a new device to the RAID set and allow it to reconstruct:
# mdadm /dev/md0 -a /dev/sda1
mdadm: hot added /dev/sda1
A quick look at /proc/mdstat will verify that the array is recovering and give an estimated ETA for the reconstruction to complete; mdadm will give a more detailed view of the array status:
# mdadm --detail /dev/md0
/dev/md0:
Version : 00.90.03Creation Time : Fri Nov 2 15:52:45 2007
Raid Level : raid1
Array Size : 14651136 (13.97 GiB 15.00 GB)
Device Size : 14651136 (13.97 GiB 15.00 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Sat Nov 10 15:25:21 2007State : clean, degraded, recovering
Active Devices : 1
Working Devices : 2
Failed Devices : 0
Spare Devices : 1
Rebuild Status : 40% complete
UUID : 644f52f2:66ea4422:1ae79331:d593083b Events : 0.48999
Number Major Minor RaidDevice State
0 0 0 - removed1 8 17 1 active sync /dev/sdb1
2 8 1 0 spare rebuilding /dev/sda1
In this example, I don’t have any hot-spare disks in the RAID set; if you do then the spare should start rebuilding the array as soon as one of the live disks fails.
Overall I’m quite happy with multidisk RAID arrays under Linux. While hardware RAID with SCSI or SAS disks would always be my first choice, I think the performance of SATA disks with software mirroring is quite adequate for hosting non-essential services. I have yet to properly configure RAID monitoring with mdadm’s -monitor option (as in I tried to use it but for some reason it does not fire off e-mail). If anybody has run into a similar problem please let me know how you solved it!