Date Added: Jun 2009
Considerable work has been done on providing fault tolerance capabilities for different software components on large scale high-end computing systems. Thus far, however, these fault tolerant components have worked insularly and independently and information about faults is rarely shared. Such lack of system-wide fault tolerance is emerging as one of the biggest problems on leadership-class systems. This paper proposes a coordinated infrastructure, named CIFTS that enables system software components to share fault information with each other and adapt to faults in a holistic manner.