Failure Tolerance in Petascale Computers
Source: Carnegie Mellon University
Three of the most difficult and growing problems in future High-Performance Computing (HPC) installations will be avoiding, coping and recovering from failures. The coming PetaFLOPS clusters will require the simultaneous use and control of hundreds of thousands or even millions of processing, storage, and networking elements. With this large number of elements involved, element failure will be frequent, making it increasingly difficult for applications to make forward progress. The success of petascale computing will depend on the ability to provide reliability and availability at scale. While researchers and practitioners have spent decades investigating approaches for avoiding, coping and recovering from the models of computer failures, the progress in this area has been hindered by the lack of publicly available, detailed failure data from real large-scale systems.