Date Added: Aug 2009
The authors present in this paper a study on fault management in a grid middleware. The middleware is their home-grown software called P2P-MPI. This framework is MPJ compliant, allows users to execute message passing parallel programs, and its objective is to support environments using commodity hardware. Hence, running programs is failure prone and a particular attention must be paid to fault management. The fault management covers two issues: fault-tolerance and fault detection. Fault-tolerance deals with the program execution: P2P-MPI provides a transparent fault tolerance facility based on replication of computations. Fault detection concerns the monitoring of the program execution by the system. The monitoring is done through a distributed set of modules called failure detectors.