Institute of Electrical & Electronic Engineers
Transient faults are emerging as a critical concern in the reliability of general-purpose microprocessors. As architectural trends point towards multi-core designs, there is substantial interest in adapting such parallel hardware resources for transient fault tolerance. This paper presents Process-Level Redundancy (PLR), a software technique for transient fault tolerance which leverages multiple cores for low overhead. PLR creates a set of redundant processes per application process, and systematically compares the processes to guarantee correct execution. Redundancy at the process level allows the operating system to freely schedule the processes across all available hardware resources.