Data Centers

Failure Data-Driven Selective Node-Level Duplication to Improve MTTF in High Performance Computing Systems

Free registration required

Executive Summary

This paper presents the authors' analysis of the failure behavior of large scale systems using the failure logs collected by Los Alamos National Laboratory on 22 of their computing clusters. They note that not all nodes show similar failure behavior in the systems. Their objective, therefore, was to arrive at an ordering of nodes to be incrementally (one by one) selected for duplication so as to achieve a target MTTF for the system after duplicating the least number of nodes. They arrived at a model for the fault coverage provided by duplicating each node and ordered the nodes according to coverage provided by each node.

  • Format: PDF
  • Size: 873.06 KB