The Ohio Society of CPAs
Ultra-scale computer clusters with high speed interconnects, such as InfiniBand, are being widely deployed for their excellent performance and cost effectiveness. However, the failure rate on these clusters also increases along with their augmented number of components. Thus, it becomes critical for such systems to be equipped with fault tolerance support. In this paper, the authors present their design and implementation of checkpoint/restart framework for MPI programs running over InfiniBand clusters. Their design enables low-overhead, application-transparent checkpointing.