China University of Petroleum
Conducting system operations (such as upgrade, reconfiguration, deployment) for large-scale systems in cloud is error prone and complex. These operations rely heavily on unreliable cloud infrastructure APIs to complete. The inherent uncertainties and inevitable errors cause a long-tail in the completion time distribution of operations. In this paper, the authors propose mechanisms and deployment architecture tactics to tolerate the long-tail. They wrapped cloud provisioning API calls and implemented deployment tactics at the architecture level for system operations.