Building Algorithmically Nonstop Fault Tolerant MPI Programs
With the growing scale of High-Performance Computing (HPC) systems, today and more so tomorrow, faults are a norm rather than an exception. HPC applications typically tolerate fail-stop failures under the stop-and-wait scheme, where even if only one processor fails, the whole system has to stop and wait for the recovery of the corrupted data. It is now a more or less accepted fact that the stop-and-wait scheme will not scale to the next generation of HPC systems. Inspired by the previous stop-and-wait Algorithm-Based Fault Tolerance (ABFT) recovery technique, the authors propose in this paper a nonstop fault tolerance scheme at the application level and describe its implementation.