Lightweight Checkpoint Mechanism and Modeling in GPGPU Environment

Download Now Free registration required

Executive Summary

While High Performance Computing (HPC) systems continue to scale in volume of computing elements and overall computing powers, the performance/cost benefit of these systems is subject to their abilities to provide high reliability, availability, and transparency in utilizing the underlying computing resources. This is evidenced by a recent announcement from Oak Ridge National Laboratory that their forthcoming machine, soon to be the world's fastest computer, will be a GPU cluster deployed across millions of cores. As such, fault tolerance has become a major concern in HPC, including GPGPU. In this paper, the authors propose a novel fault tolerance mechanism on GPUs and study the benefits of implementing such a mechanism in a HPC environment.

  • Format: PDF
  • Size: 477.4 KB