Multi-Criteria Checkpointing Strategies: Optimizing Response-Time Versus Resource Utilization

Download Now
Provided by: University of Tehran
Topic: Hardware
Format: PDF
Failures are increasingly threatening the efficiency of HPC systems, and current projections of exascale platforms indicate that rollback recovery, the most convenient method for providing fault tolerance to general-purpose applications, reaches its own limits at such scales. One of the reasons explaining this unnerving situation comes from the focus that has been given to per-application completion time, rather than to platform efficiency. In this paper, the authors discuss the case of uncoordinated rollback recovery where the idle time spent waiting recovering processors is used to progress a different, independent application from the system batch queue.
Download Now

Find By Topic