Data Centers

StealthWorks: Emulating Memory Errors

Date Added: Aug 2010
Format: PDF

A study of Google's data center revealed that the incidence of main memory errors is surprisingly high. These errors can lead to application and system corruption, impacting reliability. The high error rate is an indication that new resiliency techniques will be vital in future memories. To develop such approaches, a framework is needed to conduct flexible and repeatable experiments. This paper describes such a framework, StealthWorks, to facilitate research on software resilience by behaviorally emulating memory errors in a live system. The authors illustrate it to study program tolerance to random errors and in the development of a new software technique to continuously test memory for errors.