Lightweight Checkpoint Mechanism and Modeling in GPGPU Environment
While High Performance Computing (HPC) systems continue to scale in volume of computing elements and overall computing powers, the performance/cost benefit of these systems is subject to their abilities to provide high reliability, availability, and transparency in utilizing the underlying computing resources. This is evidenced by a recent announcement from Oak Ridge National Laboratory that their forthcoming machine, soon to be the world's fastest computer, will be a GPU cluster deployed across millions of cores. As such, fault tolerance has become a major concern in HPC, including GPGPU. In this paper, the authors propose a novel fault tolerance mechanism on GPUs and study the benefits of implementing such a mechanism in a HPC environment.