Data Centers

Fault-Management in P2P-MPI

Free registration required

Executive Summary

The authors present in this paper a study on fault management in a grid middleware. The middleware is their home-grown software called P2P-MPI. This framework is MPJ compliant, allows users to execute message passing parallel programs, and its objective is to support environments using commodity hardware. Hence, running programs is failure prone and a particular attention must be paid to fault management. The fault management covers two issues: fault-tolerance and fault detection. Fault-tolerance deals with the program execution: P2P-MPI provides a transparent fault tolerance facility based on replication of computations. Fault detection concerns the monitoring of the program execution by the system. The monitoring is done through a distributed set of modules called failure detectors.

  • Format: PDF
  • Size: 622.5 KB