Enhancing Application Robustness Through Adaptive Fault Tolerance
Source: Illinois Institute of Technology
As the scale of High Performance Computing (HPC) continues to grow, application fault resilience becomes crucial. To address this problem, the authors are working on the design of an adaptive fault tolerance system for HPC applications. It aims to enable parallel applications to avoid anticipated failures via preventive migration, and in the case of unforeseeable failures, to minimize their impact through selective check-pointing. Both prior and ongoing work are summarized in this paper. Over the past decades, the insatiable demand for more computational power in science and engineering has driven the development of ever-growing supercomputers.