For complex systems, the question of component failures are not one of if, but of when. While testing individual components and hardening them against failure is an important task, it does not provide a complete picture about how other components react when a single component fails.
In Chaos Engineering, the testing methodology developed by Netflix, the interrelationships between components in a system are tested by simulating failures in components and studying what knock-on effect occurs throughout the system.
TechRepublic’s Chaos Engineering cheat sheet is an introduction to the methodology. We will update this guide as more information about Chaos Engineering is available.
SEE: System Monitoring Policy (Tech Pro Research)
What is Chaos Engineering?
Chaos Engineering is the principle of finding weaknesses in distributed systems by testing real-world outage scenarios on production systems, or as close to production as is possible. In a more abstract sense, Chaos Engineering is a strategy to learn about how your system behaves by conducting experiments to test for a reaction.
Chaos Engineering is more than fault injection, such as introducing latency or errors. While fault injection is an important component when conducting experiments, it is limited in scope to testing one condition. Chaos Engineering can be used for other situations, such as large traffic spikes, byzantine failures, race conditions, and other unpredictable circumstances, that could lead to service outages. The goal of Chaos Engineering is to generate new information about how systems as a whole react when individual components fail.
In the O’Reilly book Chaos Engineering: Building Confidence in System Behavior through Experiments, the architects of Chaos Engineering, who are Netflix team members Casey Rosenthal, Lorin Hochstein, Aaron Blohowiak, Nora Jones, and Ali Basiri, suggest the following inputs for Chaos experiments:
- Simulating the failure of an entire region or datacenter.
- Partially deleting Kafka topics over a variety of instances to recreate an issue that occurred in production.
- Injecting latency between services for a select percentage of traffic over a predetermined period of time.
- Function-based chaos (runtime injection): Randomly causing functions to throw exceptions.
- Code insertion: Adding instructions to the target program and allowing fault injection to occur prior to certain instructions.
- Time travel: Forcing system clocks out of sync with each other.
- Executing a routine in driver code emulating I/O errors.
- Maxing out CPU cores on an Elasticsearch cluster.
Additional resources
- AWS re:Invent 2017: Netflix Senior Software Engineer Nora Jones Discusses Performing Chaos at Netflix Scale (AWS YouTube channel)
- Serious about cloud? Follow Netflix’s lead and get into chaos engineering (TechRepublic)
- Quick glossary: Hybrid cloud (Tech Pro Research)
- Download: Cloud computing policy (Tech Pro Research)
When was Chaos Engineering developed?
The ideas that formed Chaos Engineering developed at Netflix as the subscription streaming service began transitioning from its own data centers to the public cloud in 2008. As engineers at Netflix were becoming accustomed to the differences present in cloud architectures, they identified a need to create services with higher resiliency.
In that effort, Chaos Monkey–an automated Chaos testing tool that randomly disables running virtual machine instances in production–was created in 2010, and subsequently released as open-source software in 2012. The rationale behind Chaos Monkey, according to former VP of Product Engineering at Netflix John Ciancutti, is that “If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage.”
Additional resources
- Crash your cloud, before it crashes itself: Netflix shares tool to help find unknown bugs (ZDNet)
- AWS infrastructure is now behind three main streaming media providers (ZDNet)
- Netflix to raise technology, marketing, content spending in 2018 (ZDNet)
- Amazon Web Services: A cheat sheet (TechRepublic)
Why does Chaos Engineering matter?
Chaos Engineering is helpful in identifying if the failure of individual services in a system can fail gracefully without rendering the entire system inoperable to the user. Because the circumstances that cause system failure cannot be completely eliminated, the most useful plan of action is to ensure that a given system is as resilient as possible. In order to do so, it is necessary to conduct engineering experiments that test the resiliency of your system.
SEE: All of TechRepublic’s cheat sheets and smart person’s guides
The principles in Chaos Engineering can also be applied in other aspects of development and operations. Consider the concept of “canary analysis”: When new code is deployed, you can measure performance on a limited number of systems before deploying it more widely. In effect, canary analysis is a sanity check applied to staged software rollouts, with the benefit of performance logging. If your predetermined steady state does not fluctuate unnecessarily, it can be deployed widely. If the canary deployment exceeds your predetermined “error budget,” it is withdrawn for further refinement to protect the integrity of the service.
Additional resources
- AWS outage: How Netflix weathered the storm by preparing for the worst (TechRepublic)
- From vulnerability to exploit in 96 minutes, or why software fire drills are necessary (TechRepublic)
- Microservices and containers in service meshes mean less chaos, more agility (ZDNet)
- Cloud computing: Here’s how much a huge outage could cost you (ZDNet)
How do I implement Chaos Engineering in my organization?
At a basic conceptual level, there are four steps to implementing Chaos Engineering in your organization.
Identify a measurable output that indicates behavior, define “steady state”
In systems theory, “steady state” is attained if variables that define the behavior of a system do not change in time. Put more simply for this purpose, steady state is attained if recently observed behavior continues into the future. The variable needed in this case is a real-time positive indicator that the service is working as designed for the intended purpose. In the case of Netflix, the company uses the rate at which customers press the play button on a video streaming device as steady state. Netflix calls this “streams per second.”
Of note, steady state is not necessarily continuous. For this case, users generally do not use the service continually–as an example, subscribers are more likely to use the service in the evening than in the morning. The usage pattern throughout the day for this case is steady state, not necessarily from second to second.
For the purpose of Chaos Engineering, a business metric is a more functionally useful measure than a system metric. While many existing frameworks easily enable developers to observe system states such as CPU utilization, using a business metric provides more insight into the health of the system.
Form a hypothesis
For any experiment, a testable hypothesis is necessary to determine if the experiment is a success or failure–this aids in drawing conclusions when conducting Chaos experiments. Given that the purpose of Chaos Engineering is to ensure the reliability–or the graceful degradation–of systems, the hypothesis for your tests should resemble the statement “the events we will inject into the system will not result in a change from the steady state.”
If you have any reason to believe that an experiment will result in a change from the steady state, or otherwise break things in the production environment, do not conduct the experiment. You should first work to strengthen the reliability of your system before attempting to break it.
Simulate real-world events
Testing events that may result in a loss of availability–from the likely to the implausible–is important in developing an understanding of the resiliency of your system. Testing for hardware failure, state transmission errors, resource overload, network latency and failure, functional bugs, significant fluctuations in input, retry storms, race conditions, dependency failures, byzantine failures, and unusual or unpredictable combinations of communication between services can increase confidence in the reliability of your system.
Disprove your hypothesis
For the time your experiment was running, was there a difference in the steady state between the experimental group and the control group? Using logging and metrics to see if adverse effects happened as a result of the test is key to identifying structural problems in your system.
Accordingly, the harder it is to deviate from the steady state, the more confidence can be placed in the design of your system.
Additional resources
- 10 ways to survive a critical system outage (TechRepublic)
- How to conduct a production outage post-mortem (TechRepublic)
- Software bugs? Avoid these 10 costly programming mistakes (ZDNet)
- Systems downtime expense calculator (Tech Pro Research)
What tools can I use to get started with Chaos Engineering?
A variety of open-source tools exist to assist in the practice of Chaos Engineering in your organization. Foremost among these is Simian Army, which was developed by Netflix to test the reliability and security of AWS. Simian Army includes Chaos Monkey, which can be used to find services in production and randomly disable them, as well as Chaos Gorilla, which disables an entire availability zone. Finally, Chaos Kong disables an entire AWS region.
Other tools include Pumba, Blockade, and Tugbot, three options for Chaos testing in Docker, Chaos Dingo for Microsoft Azure, Monkey-Ops for OpenShift, Chaos Lemur for BOSH-managed environments, as well as Chaos HTTP Proxy for introducing failures into HTTP requests via a proxy server, and Chaos Lambda, which randomly terminates auto scaling groups in AWS.
For effective implementations of Chaos Engineering, automation of tests is necessary. Given that real-world scenarios that result in downtime are often the result of unexpected circumstances, simulating downtime in similar unexpected circumstances is necessary to measure a genuine response of your operating environment. To facilitate automation, Netflix developed ChAP, the Chaos Automation Platform, for this purpose. Much like canary analysis, ChAP is designed to end experiments if the results exceed a predetermined “error budget” in an effort to prevent catastrophic damage during an experiment.
The Chaos Community provides additional resources, including discussion groups on Google Groups and LinkedIn, as well as meetup groups for the San Francisco Bay area, Raleigh, NC, Hamburg, Germany, and Paris, France.
Additional resources
- Cloud computing: Three strategies for making the most of on-demand (ZDNet)
- Ex-Facebook engineers launch Honeycomb, a new tool for your debugging nightmares (TechRepublic)
Who uses Chaos Engineering, and why do some people consider it risky?
Chaos Engineering is used across a variety of industries. Some examples are:
- Technology giants including Google, Amazon, Microsoft, Dropbox, and Yahoo;
- Educational and research facilities including North Carolina State University, University of California, and Sandia National Labs;
- Finance companies Fidelity Investments and Visa; and
- Programmer-facing resources and consulting firms GitHub, O’Reilly Media, Pivotal, DevJam, Thoughtworks, Cognitect, Cake Solutions, SendGrid, Wallaroo Labs, New Relic, and Gremlin.
Some managers may be hesitant to implement Chaos Engineering in their organization, as the risks of failure are higher than that of Netflix. In the event something goes wrong with Netflix’s network, the customer is inconvenienced by not having a video play. The authors of the Chaos Engineering book cite medical trials as the foundation on which the discipline of Chaos Engineering is built.
We remind these engineers that many of the principles of western science that inspired our formalization of Chaos Engineering originated in medicine. Clinical trials are the highest standard for medical research. Without trivializing the potential ramifications of introducing chaos into a healthcare-related system, we remind them that there is a well-respected precedent for experimentation where lives literally are at stake.
Additional resources
- How to become a developer: A cheat sheet (TechRepublic)
- 5 habits of highly successful developers (TechRepublic)
- 15 books every programmer should read (free PDF) (TechRepublic)
- Essential reading for IT leaders: 10 books on cloud computing (free PDF) (TechRepublic)
- IT infrastructure spending shifting toward cloud deployments (ZDNet)
- Meet Kripa Krishnan, Google’s queen of chaos (Business Insider)