Replication-Based Fault-Tolerance for MPI Applications

Download Now Free registration required

Executive Summary

As computational clusters increase in size, their mean-time-to-failure reduces drastically. Typically, checkpointing is used to minimize the loss of computation. Most checkpointing techniques, however, require central storage for storing checkpoints. This results in a bottleneck and severely limits the scalability of checkpointing, while also proving to be too expensive for dedicated checkpointing networks and storage systems. The authors propose a scalable replication-based MPI checkpointing facility. The reference implementation is based on LAM/MPI, however, it is directly applicable to any MPI implementation. They extend the existing state of fault-tolerant MPI with asynchronous replication, eliminating the need for central or network storage.

  • Format: PDF
  • Size: 651.5 KB