A Realistic Evaluation of Memory Hardware Errors and Software System Susceptibility
Memory hardware reliability is an indispensable part of whole-system dependability. This paper presents the collection of realistic memory hardware error traces (including transient and non-transient errors) from production computer systems with more than 800GB memory for around nine months. Detailed information on the error addresses allows one to identify patterns of single-bit, row, column, and whole-chip memory errors. Based on the collected traces, the authors explore the implications of different hardware ECC protection schemes so as to identify the most common error causes and approximate error rates exposed to the software level.