Evaluating Application Resilience with XRay

Download Now
Provided by: Louisiana State University
Topic: Hardware
Format: PDF
The rising count and shrinking feature size of transistors within modern computers is making them increasingly vulnerable to various types of soft faults. This problem is especially acute in High-Performance Computing (HPC) systems used for scientific computing, because these systems include many thousands of compute cores and nodes, all of which may be utilized in a single large-scale run. The increasing vulnerability of HPC applications to errors induced by soft faults is motivating extensive work on techniques to make these applications more resilient to such faults, ranging from generic techniques such as replication or checkpoint/restart to algorithm specific error detection and tolerance techniques.
Download Now

Find By Topic