Cloud-native applications aren’t the monoliths of old, fitting neatly into client-server or three-tier categories. They’re now a conglomeration of services, mixing your code and platform tools, designed to manage and control errors and to scale around the world.
That’s wonderful for our users–they get applications that are fast and responsive, and that they can access from anywhere on any device. But it makes it hard for developers and operations teams, with complex webs of services that are hard to test at scale. We may design for failure, building redundancy into our systems but that adds complexity to architectures, with new servers and additional service instances.
SEE: Quick glossary: DevOps (TechRepublic Premium)
Testing complex systems by making them fail
More complexity demands more testing, and that can be an issue when we’re testing what happens when a service fails when under load. How do transactions fail when a shopping cart backend needs to switch databases in the middle of a purchase? How will a restaurant delivery tracker respond if its main messaging platform has an outage?
We need a testing model that looks at running systems, and then starts to fail elements, allowing us to track system behaviors. The idea is to inject little bits of failure into running systems, monitoring how they respond against a set of target conditions. It’s a technique known as chaos engineering, pioneered inside Netflix with its chaos monkey tool that randomly affected operations, aiming to unveil failure modes that weren’t considered and that DevOps teams weren’t prepared for.
The intent of chaos engineering techniques isn’t to explore how systems fail, though that can be a beneficial side effect; instead, it aims to show how resilient they are. Netflix needed to deliver a rock solid customer experience at all times, ensuring that users saw their movies and shows, no matter what was going on in the background.
It’s not surprising that those techniques have been picked up by other platforms, especially in hyperscale cloud providers like Microsoft Azure. If your applications are running on Azure, you want to be sure that even if a Microsoft server fails, your application will continue running. Microsoft’s own chaos engineering team regularly explores how failures affect the platform, with the aim of ensuring that the services your applications depend on will deal with failures gracefully.
Building your own chaos
But can you use the same techniques in your own applications, making sure that your code is as resilient as the services it uses? There’s no reason why not. While Microsoft may have its own teams of Site Reliability Engineers tasked to keep Azure up and running, once your code is running at scale you need your own SREs, who are familiar both with your software and with the services it uses.
If you’re running at scale, then you’re going to need to implement some form of chaos engineering to ensure that your applications are resilient. Microsoft provides guidance on how to think about using these techniques as part of its Azure documentation, with much of its thinking derived from the Netflix experience. Chaos, it says, is a process.
That’s not surprising. We may think of chaos as randomness, but when we’re using it to test resilience it needs to be planned, treating it much like security. Microsoft’s model talks in terms of attackers and defenders. Attackers are one side of the equation, injecting faults into a system with the aim of breaking it. On the other side, the defenders assess the effects of attacks, analyzing results and planning mitigations.
Tests need to be treated like scientific experiments. You need to start with a hypothesis, something like “the application will continue to operate if it loses a single backend database instance.” That then defines the fault that’s injected, here shutting down a database on a running application. Finally, you have an expected result: the application continuing to run. Your chaos engineering platform needs to manage all three steps, providing a way of starting and stopping tests and accessing test results.
One important aspect of chaos testing is remembering that tests have a blast radius. They’re deliberately destructive, so you need to be aware that they can go wrong. That means being able to pull the plug on a test at any time, reverting to normal operations as quickly as possible. Any chaos injection needs a way to roll back, preferably with a single button to automate the entire process.
Third-party tools for Azure DevOps show there’s interest in using these techniques as part of testing your applications. Proofdock’s tooling links chaos engineering’s turbulence with modern development concepts, working with observability tools to deliver what it calls “continuous verification,” running everything inside a familiar portal.
Introducing Azure Chaos Studio
Microsoft is currently previewing a set of chaos engineering tools for Azure applications with a selection of customers, based on its own internal tooling. Demonstrated by Azure CTO Mark Russinovich at Microsoft’s Spring virtual Ignite, it’s a mix of an Azure test management portal and a JSON-based test scripting language.
There are two elements to Azure Chaos Studio’s tests: an agent running on your virtual servers or embedded in your code and direct access to Azure’s own services. These are controlled by JSON experiment descriptions, for example testing failover of an application’s Cosmos DB backend by simulating a failure in one of an application’s regions. Alternatively, an experiment could use an agent to shut down a service host on a server running a node.js application or some .NET code, testing for resilience in your own application.
Experiments are made up of a series of steps, each of which has actions. Microsoft has developed a domain-specific declarative language for working with application infrastructures, which shares some similarity with its Bicep resource description language. You’ll be able to build experiments inside Visual Studio code, saving them into Azure where they’re listed in the Chaos Studio portal. From the portal, start by selecting experiments you want to run using other elements of Azure’s developer tools to monitor application operations, either using application monitoring built into your code or Azure’s own service tooling.
If you’re using Azure DevOps or another continuous integration/continuous development tool, like GitHub Actions, Azure Chaos Studio provides a REST API so you can use it as part of a set of integration tests when you build a new version of your code. Running Chaos Studio early in the application lifecycle makes sense, as it allows you to build resilience testing into your release process.
As cloud-native development matures, the way we build applications is becoming more and more the way big cloud platforms and services build their code. Techniques that used to only be needed by companies like Netflix or inside Azure are now necessary for everyone, and the arrival of Chaos Studio in Azure goes a long way to turning what used to be custom tooling into a platform that can be used by everyone, delivering on the promise of resilient systems.