Hiding Checkpoint Overhead in HPC Applications with a Semi-Blocking Algorithm

Download Now
Provided by: University of Illinois at Urbana Champaign
Topic: Big Data
Format: PDF
The HPC community has seen a steady increase in the number of components in every generation of supercomputers. Assembling a large number of components into a single cluster makes a machine more powerful, but also much more prone to failures. Therefore, fault tolerance has become a major concern in HPC. To deal with node crashes in large systems, checkpoint/restart is by far the preferred method. A typical way to implement checkpoints is by using a blocking algorithm, which suspends the execution of the application while the checkpoint is safely stored. One limitation of the blocking algorithm is that it saturates the network bandwidth at the time of checkpoint.
Download Now

Find By Topic