An Extended Reduction Based Parallel Programming Paradigm with Low Overhead Fault-Tolerance Support

It is widely accepted that the existing MPI-based fault-tolerance solutions will not be applicable in the exascale era, as with growing level of concurrency and relatively lower I/O bandwidths, the time required to complete a check-point can exceed the Mean-Time To Failure (MTTF). In this paper, the authors show that designing a programming model that explicitly and succinctly exposes an application's underlying communication pattern can greatly simplify fault-tolerance support, resulting in at least an order of magnitude reduction in check-pointing overheads over the current solutions. The communication patterns they consider are similar to the notion of dwarfs in the Berkeley view on parallel processing.

Provided by: Ohio State University Topic: Networking Date Added: Oct 2012 Format: PDF

Find By Topic