Application-Transparent Checkpoint/Restart for MPI Programs Over InfiniBand

Download Now
Provided by: The Ohio Society of CPAs
Topic: Big Data
Format: PDF
Ultra-scale computer clusters with high speed interconnects, such as InfiniBand, are being widely deployed for their excellent performance and cost effectiveness. However, the failure rate on these clusters also increases along with their augmented number of components. Thus, it becomes critical for such systems to be equipped with fault tolerance support. In this paper, the authors present their design and implementation of checkpoint/restart framework for MPI programs running over InfiniBand clusters. Their design enables low-overhead, application-transparent checkpointing.
Download Now

Find By Topic