On the Speedup of Recovery in Large-Scale Erasure-Coded Storage Systems
Modern storage systems stripe redundant data across multiple nodes to provide availability guarantees against node failures. One form of data redundancy is based on XOR-based erasure codes, which use only XOR operations for encoding and decoding. In addition to tolerating failures, a storage system must also provide fast failure recovery to reduce the window of vulnerability. This paper addresses the problem of speeding up the recovery of a single-node failure for general XOR-based erasure codes. The authors propose a replace recovery algorithm, which uses a hill-climbing technique to search for a fast recovery solution, such that the solution search can be completed within a short time period.