Download now Free registration required
The authors present a new approach to fault tolerance for high performance computing system. Their approach is based on a careful adaptation of the algorithmic based Fault Tolerance technique (Huang and Abraham, 1984) to the need of parallel distributed computation. They obtain a strongly scalable mechanism for fault tolerance. They can also detect and correct errors (bit-flip) on the fly of a computation. To assess the viability of their approach, they have developed a fault tolerant matrix-matrix multiplication subroutine and they propose some models to predict its running time.
- Format: PDF
- Size: 314.55 KB