University of New Orleans Fund
A large number of cloud application failures happen during sporadic operations such as upgrade, deployment reconfiguration, migration and scaling-out/in. Most of them are caused by operator and process errors. From a cloud consumer's perspective, recovery from these failures relies on the limited control and visibility provided by the cloud providers. In addition, a large-scale system often has multiple operation processes happening simultaneously, which exacerbates the problem during error diagnosis and recovery. Existing built-in or infrastructure-based recovery mechanisms often assume random component failures and use checkpoint-based rollback, compensation actions, redundancy and rejuvenation to handle recovery.