A Case for Virtual Machine Based Fault Injection in a High-Performance Computing Environment

Provided by: Springer Healthcare
Topic: Data Centers
Format: PDF
Large-scale computing platforms provide tremendous capabilities for scientific discovery. As applications and system software scale up to multi-petaflops and beyond to exascale platforms, the occurrence of failure will be much more common. This has given rise to a push in fault-tolerance and resilience research for High-Performance Computing (HPC) systems. This includes work on log analysis to identify types of failures, enhancements to the Message Passing Interface (MPI) to incorporate fault awareness, and a variety of fault tolerance mechanisms that span redundant computation, algorithm based fault tolerance, and advanced checkpoint/restart techniques.

Find By Topic