Date Added: Apr 2010
As transistor technology scales ever further, hardware reliability is becoming harder to manage. The effects of soft errors, variability, wear-out, and yield are intensifying to the point where it becomes difficult to harness the benefits of deeper scaling without mechanisms for hardware fault detection and correction. The authors observe that the combination of emerging applications and emerging many-core architectures makes software recovery a viable and interesting alternative to traditional, hardware-based fault recovery. Emerging applications tend to have few I/O and memory side-effects, which limits the amount of information that needs checkpointing, and they allow discarding individual sub-computations with typically minimal qualitative impact. Software recovery can harness these properties in ways that hardware recovery cannot.