Fault Tolerance Management for a Hierarchical GridRPC Middleware
Source: French National Institute for Research in Computer Science and Control
GridRPC middleware are usually managing failures by using TCP or other link network layer provided failure detector, automatic checkpoints of sequential jobs and a centralized stable agent to perform scheduling. Most recent developments have provided some new mechanisms like the optimal Chandra & Toueg & Aguillera failure detector, most numerical libraries now providing their own optimized checkpoint routine and distributed scheduling GridRPC architectures. This paper aims at adapting to these novelties by providing the first implementation and evaluation in a grid system of the optimal fault detector, a novel and simple checkpoint API allowing to manage both service provided checkpoint and automatic checkpoint and a scheduling hierarchy recovery algorithm tolerating several simultaneous failures.
| Format: | Size: | 319.80 | |
| Date: | Feb 2008 |



