At a virtual summit on cybersecurity, Jamie Dicken, manager of applied security at Cardinal Health, said that security chaos engineering is similar to software testing.
Image: Jamie Dicken

Chaos engineering is a way for security teams to replace continuous firefighting with continuous learning, according to two industry experts. At the RSA 365 Virtual Summit this week, Aaron Rinehart, CTO and co-founder Verica, and Jamie Dicken, manager of applied security at Cardinal Health, explained how this approach to IT security works. Instead of waiting for daily operations or attacks from outsiders to highlight security flaws, security teams can get ahead of these problems by experimenting to see how a system really works.

During the presentation, Navigating the Unknowable: Resilience through Security Chaos Engineering, Dicken said one reason security teams are constantly running from one security incident to the next is the traditional design-oriented mindset.

“What we’re doing is fighting battle after battle when what we need is a new, radical way to secure and stabilize our systems if we ever want to get ahead of this,” she said.

Dicken used advice from management expert Dave Snowden to explain why security chaos engineering works: The only way to understand a complex system is to interact with it. She said part of the problem is that no network engineer or security expert really understands the complex networks they manage due in part to incomplete documentation.

“If your process of evaluating a system depends on an outdated or incorrect documentation of a system, your evaluation of that system is going to fail,” she said.

SEE: Identity theft protection policy (TechRepublic Premium)

Rinehart said chaos engineering tests assumptions about a system and how it works.

“We are proactively introducing problems into a distributed system to try to determine the conditions under which the system will fail before it actually does,” he said.

This method of testing security systems reduces uncertainty and increases the chance of spotting a weakness before an attacker does.

“We’re not simulating attacks, we’re just injecting small points of failure,” Rinehart said.

Rinehart listed these use cases for security chaos engineering:

  • Incident response
  • Security control validation
  • Security observability
  • Compliance monitoring

“Every chaos engineering experiment has compliance value because you are verifying that the system worked the way you had it documented,” he said.

In December, O’Reilly published a report about security chaos engineering by Rinehart and Kelly Shortridge.

In the RSA session, Dicken explained how she introduced chaos engineering to her team and Rinehart described how he built an open source tool ChaoSlinger to implement this approach.

Getting started with chaos engineering

Dicken said that her team at Cardinal Health started using this approach to security in the summer 2020. Part of her inspiration for making this shift was the IBM Security 2020 Cost of a Data Breach Report. Dicken said many people focused on the number of attacks that came from outsiders: 52%.

“What I take away is that 48% of data breaches are caused by mistakes and accidents or preventable failure,” she said.

Her team started with security control validation as the focus of its experimentation. The plan was to make sure existing security controls had not degraded over time and to build a new process for building these controls.

Dicken said she built a multi-disciplinary team of engineers covering network security, network architecture, systems engineering, and risk and privacy.

The goals for the team were:

  • Identify indisputable, critical security gaps
  • Illustrate the “big picture” of security gaps
  • Ensure security gaps weren’t re-opened later

She said the team realized that front line security engineers didn’t always have enough information about a particular system to describe where the risk was and drive it to remediation.

Having a multidisciplinary team was one way to address that information gap. Dicken said the team developed a five-step process to guide each chaos engineering experiment:

  1. Select a control to validate
  2. Define the standards
  3. Build automation to validate the standards
  4. Create a dashboard and use analytics to tell the story
  5. Alert on non-compliance to technical standards

Dicken said it’s easy to start with small chaos experiments such as testing a single alert function or one component of the incident response process.

Think about how you can simulate a security incident in the context of your organization,” she said.

Her long-term goal is to use the security chaos experimentation to move to test-driven development.

“We want to partner with our security architecture team so that as we develop new standards, we’re writing our tests up front and as those new controls are built, we see our tests start to pass,” she said.

Verica CTO Aaron Rinehart built this open source tool to conduct security chaos engineering tests and determine whether his network defenses would operate as expected.
Image: Aaron Rinehart

Tools to use for chaos engineering experiments

Once a security team has decided to try this approach, the next step is to find a tool to use. Rinehart explained during the presentation how he developed the open source tool ChaoSlinger when he was the chief security architect at UnitedHealth Group. He built the tool to verify and validate cloud security measures.

He used the tool to see what would happen with the injection of a misconfigured port change, a common issue in cloud operations. He learned four specific things about his network from this experiment:

  1. There was a drift issue between commercial and non-commercial environments
  2. A cloud native configuration management tool caught and blocked this kind of change every time
  3. A home-grown solution was alerting the security about these changes as expected
  4. A SOC analyst couldn’t determine what AWS account a particular alert was coming from

Rinehart said the solution to the fourth finding was simply adding metadata to the alert. Finding this information was simple but time-consuming.

“Had this actually been an outage or an incident, spending 30 minutes to an hour could have been very costly,” he said.

Dicken suggested looking at tools from attack emulation providers or other tools that Red Teams often use.

“You want to build in the idea of continuous validation as opposed to the one-and-done engagement,” she said.