Probabilistic Failure Detection for Efficient Distributed Storage Maintenance

Source: Microsoft

Favorite

Free registration required

Provided by

/research-library/microsoft
Distributed storage systems often use data replication to mask failures and guarantee high data availability. Node failures can be transient or permanent. While the system must generate new replicas to replace replica lost to permanent failures, it can save significant replication costs by not replicating following transient faults. Given the unpredictability of network dynamics, however, distinguishing permanent and transient failures is extremely difficult. Traditional timeout approaches are difficult to tune and can introduce unnecessary replication.
Format:PDF Size:743.80
Date:Jul 2008