An Overview of Checkpointing Techniques for Fault Tolerance in Distributed Computing Systems
Checkpointing is an important feature in distributed computing systems. It gives fault tolerance without requiring additional efforts from the programmer. In order to provide fault tolerance for distributed systems, the checkpointing technique has widely been used and many researchers have been performed to reduce the overhead of checkpointing coordination. A checkpoint is a snapshot of the current state of a process. It saves enough information in non-volatile stable storage such that, if the contents of the volatile storage are lost due to process failure, one can reconstruct the process state from the information saved in the non-volatile stable storage.