On the Combination of Silent Error Detection and Checkpointing
In this paper, the authors revisit traditional checkpointing and rollback recovery strategies, with a focus on silent data corruption errors. Contrarily to fail-stop failures, such latent errors cannot be detected immediately, and a mechanism to detect them must be provided. They consider two models: errors are detected after some delays following a probability distribution (typically, an Exponential distribution); errors are detected through some verification mechanism. In both cases, they compute the optimal period in order to minimize the waste, i.e., the fraction of time where nodes do not perform useful computations.