Dynamic Adaptation of Checkpoints and Rescheduling in Grid Computing
Grid is a form distributed computing mainly to virtualilze and utilize geographically distributed idle resources. A grid is a distributed computational and storage environment often composed of heterogeneous autonomously managed subsystems. As a result varying resource availability becomes common place, often resulting in loss and delay of executing jobs. To ensure good performance fault tolerance should be taken into account. Here the authors address the fault tolerance in terms of resource failure. Commonly utilized techniques to achieve fault tolerance is periodic checkpointing, which periodically saves the jobs state.