Yale Systems, Inc.
Large scale systems, common in cloud computing, rely on redundancy for reliability and availability. Modern clouds have become ever-increasingly complex and diverse creating large messes that experience long outages when failures occur. While there exist significant effort in resolving faults after they occur, the authors propose a novel approach to untangling this mess before it occurs by auditing the underlying structure of a cloud, which they call the cloud Structural Reliability Auditor (SRA). SRA achieves their goal by auditing a cloud with the following steps: collecting comprehensive component and its dependency information, using this data to construct a system-wide fault tree, and leveraging fault tree analysis algorithms to determine and rank sets of components based on the likelihood of causing a cloud service outage.