University of Illinois
The increasing size and complexity of High-Performance Computing systems is making it increasingly likely that individual circuits will produce erroneous results, especially when operated in a low energy mode. Previous techniques for Algorithm - Based Fault Tolerance (ABFT) have been proposed for detecting errors in dense linear operations, but have high overhead in the context of sparse problems. In this paper, the authors propose a set of algorithmic techniques that minimize the overhead of fault detection for sparse problems. The techniques are based on two insights. First, many sparse problems are well structured (e.g., diagonal, banded diagonal, block diagonal), which allows for sampling techniques to produce good approximations of the checks used for fault detection.