Efficient and Safe Data Backup With Arrow
Source: University of California
Content-addressable storage, error detection and correction, and deduplication are all interesting topics in the field of archival data storage. Arrow implements a data backup system that combines collision-resistant hash functions, rolling checksums, and error-correction codes to provide a deduplicating, versioned, error-recoverable archival storage system. It stores files as lists of checksums, and performs a fast checksum search algorithm for determining what parts of a file have changed, achieving both a speedup in time to store a version, and a savings in the amount of physical storage used. There checksums are also used to identify and verify the integrity of data stored in the system, and error-correction codes are present to allow correction of small storage errors. Rsync is a popular free software program for synchronizing similar files on computers connected on a network. Arrow borrows heavily from rsync, using a similar algorithm to search for duplicate chunks in files to be backed up, and uses the same rolling checksum function. Arrow is a new application of existing ideas. It is difficult to distribute the storage across multiple storage systems. It can't rely on a single piece of hardware, or a single software system. To be fully usable as an archival storage system, Arrow still needs support and optimization of the verification and correction processes, and a usable method for restoring backed-up data.