Date Added: Nov 2010
In designing and building distributed systems, it is common engineering practice to separate steady-state ("Normal") operation from abnormal events such as recovery from failure. This way the normal case can be optimized extensively while recovery can be amortized. However, integrating the recovery procedure with the steady-state protocol is often far from obvious, and can present subtle difficulties. This issue comes to the forefront in modern data center, where applications are often implemented as elastic sets of replicas that must reconfigure while continuing to provide service and where it may be necessary to install new versions of active services as bugs are fixed or new functionality is introduced.