Date Added: Jan 2011
In this paper, a unified lightweight error recovery scheme based on coordinated checkpointing and rollback for distributed shared memory clusters is proposed. The new scheme maintains multiple globally consistent checkpoints of the state of a distributed shared memory cluster and recovers to a pre-fault checkpoint of the system. It also describes and evaluates the coordinated checkpointing. The coordinated checkpoint neither needs to exchange coordination messages nor adds information to the process messages. It only accesses stable storage when checkpoints are saved. Each of the processes saves its state independently from the other processes. The checkpoint timers are set at different processes. Based on the results of performance evaluation the proposed scheme is shown to outperform the previously proposed checkpoint and recovery schemes for distributed shared memory clusters.