Hybrid Checkpointing for MPI Jobs in HPC Environments
As the core count in high-performance computing systems keeps increasing, faults are becoming common place. Checkpointing addresses such faults but captures full process images even though only a subset of the process image changes between checkpoints. The authors have designed a hybrid checkpointing technique for MPI tasks of high-performance applications. This technique alternates between full and incremental checkpoints: At incremental checkpoints, only data changed since the last checkpoint is captured. The implementation integrates new BLCR and LAM/MPI features that complement traditional full checkpoints.