International Journal of Computer Science and Mobile Computing (IJCSMC)
Large applications executing on grid or cluster architectures consisting of hundreds or thousands of computational nodes create problems with respect to reliability. The source of the problems is node failures and the need for dynamic configuration over extensive runtime. This paper presents two fault-tolerance mechanisms called theft-induced check pointing and systematic event logging. These are the transparent protocols capable of overcoming problems associated with both benign faults, i.e., crash faults, and node or subnet volatility.