Date Added: Dec 2011
Large clusters, high availability clusters and grid deployments often suffer from network, node or operating system faults and thus require the use of fault tolerant programming models. Distributed systems today are ubiquitous and enable many applications, including client-server systems, transaction processing, World Wide Web, and scientific computing, among many others. The vast computing potential of these systems is often hampered by their susceptibility to failures. Therefore, many techniques have been developed to add reliability and high availability to distributed systems. This paper presents two such techniques: check-pointing based rollback and log based rollback which allows efficient recovery in dynamic heterogeneous system as well as multithreaded applications.