A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI

Most predictions of Exa-scale machines picture billion way parallelism, encompassing not only millions of cores, but also tens of thousands of nodes. Even considering extremely optimistic advances in hardware reliability, probabilistic amplification entails that failures will be unavoidable. Consequently, software fault tolerance is paramount to maintain future scientific productivity. Two major problems hinder ubiquitous adoption of fault tolerance techniques: traditional checkpoint based approaches incur a steep overhead on failure free operations and the dominant programming paradigm for parallel applications (the MPI standard) offers extremely limited support of software-level fault tolerance approaches.

Provided by: University of Tehran Topic: Data Centers Date Added: May 2012 Format: PDF

Download Now

Find By Topic