Date Added: Jan 2012
As silicon technologies move into the nanometer regime, transistor reliability is expected to wane as devices become subject to extreme process variation, particle-induced transient errors, and transistor wear-out. Unless these challenges are addressed, computer vendors can expect low yields and short mean-times-to-failure. In this paper, the authors examine the challenges of designing complex computing systems in the presence of transient and permanent faults. They select one small aspect of a typical Chip MultiProcessor (CMP) system to study in detail, a single CMP router switch. To start, they develop a unified model of faults, based on the time-tested bathtub curve.