DRAM Errors in the Wild: A Large-Scale Field Study

Free registration required

Executive Summary

Errors in Dynamic Random Access Memory (DRAM) are a common form of hardware failure in modern compute clusters. Failures are costly both in terms of hardware replacement costs and service disruption. While a large body of work exists on DRAM in laboratory conditions, little has been reported on real DRAM failures in large production clusters. In this paper, the authors analyze measurements of memory errors in a large fleet of commodity servers over a period of 2.5 years. The collected data covers multiple vendors, DRAM capacities and technologies, and comprises many millions of DIMM days.

  • Format: PDF
  • Size: 284 KB