Scalable, FaultTolerant Membership for MPI Tasks on HPC Systems

Source: Association for Computing Machinery

Favorite

Free registration required

Reliability is increasingly becoming a challenge for High-Performance Computing (HPC) systems with thousands of nodes, such as IBM's Blue Gene/L. A shorter mean-time-to-failure can be addressed by adding fault tolerance to reconfigure working nodes to ensure that communication and computation can progress. However, existing approaches fall short in providing scalability and small reconfiguration overhead within the fault-tolerant layer. This paper contributes a scalable approach to reconfigure the communication infrastructure after node failures. The authors propose a decentralized (peer-to-peer) protocol that maintains a consistent view of active nodes in the presence of faults.
Format:PDF Size:213.30
Date:Aug 2006