Checkpointing Vs. Migration for Post-Petascale Supercomputers

Provided by: INRIA
Topic: Big Data
Format: PDF
An alternative to classical fault-tolerant approaches for large-scale clusters is failure avoidance, by which the occurrence of a fault is predicted and a preventive measure is taken. The authors develop analytical performance models for two types of preventive measures: preventive checkpointing and preventive migration. They also develop an analytical model of the performance of a standard periodic checkpoint fault-tolerant approach. They instantiate these models for platform scenarios representative of current and future technology trends. They find that preventive migration is the better approach in the short term by orders of magnitude. However, in the longer term, both approaches have comparable merit with a marginal advantage for preventive checkpointing.

Find By Topic