On the Speedup of Single-Disk Failure Recovery in XOR-Coded Storage Systems: Theory and Practice
Modern storage systems stripe redundant data across multiple disks to provide availability guarantees against disk failures. One form of data redundancy is based on XOR-based erasure codes, which use only XOR operations for encoding and decoding. In addition to providing failure tolerance, a storage system must also provide fast failure recovery to avoid data unavailability. The authors consider the problem of speeding up the recovery of a single-disk failure for arbitrary XOR-based erasure codes. They address this problem from both theoretical and practical perspectives. They propose a replace recovery algorithm, which uses a hill-climbing technique to search for a fast recovery solution, such that the solution search can be completed within a short time period.