Association for Computing Machinery
The scalability of future Massively Parallel Processing (MPP) systems is being severely challenged by high failure rates. Current Hard Disk Drive (HDD) checkpointing results in overhead of 25% or more at the petascale. With a direct correlation between checkpoint frequencies and node counts, novel techniques that can take more frequent checkpoints with minimum overhead are critical to implement a reliable exascale system. In this paper, the authors leverage the upcoming Phase-Change Random Access Memory (PCRAM) technology and propose a hybrid local/global checkpointing mechanism.