Towards Scalable Application Checkpointing With Parallel File System Delegation
The ever-increasing scale of modern High-Performance Computing (HPC) systems presents a variety of challenges to the Parallel File System (PFS) based storage in these systems. The scalability of application checkpointing is a particularly important challenge because it is critical to the reliability of computing and it often dominates the I/Os in a HPC system. When a large number of parallel processes simultaneously perform checkpointing, the PFS metadata servers can become a serious bottleneck due to the large volume of concurrent metadata operations. This paper specifically addresses this PFS metadata management issue in order to support scalable application checkpointing in large HPC systems.