Online Diagnosis and Recovery: On the Choice and Impact of Tuning Parameters
A sequenced process of Fault Detection followed by the erroneous node's Isolation and system Reconfiguration (node exclusion or recovery), that is, the FDIR process, characterizes the sustained operations of a fault-tolerant system. For distributed systems utilizing message passing, a number of diagnostic (and associated FDIR) approaches, including the authors' prior algorithms, exist in literature and practice. Invariably, the focus is on proving the completeness and correctness (all and only the faulty nodes are isolated) for the chosen fault model, without explicitly segregating permanent from transient faulty nodes. To capture diagnostic issues related to the persistence of errors (transient, intermittent, and permanent), they advocate the integration of count-and-threshold mechanisms into the FDIR framework.