The phrase ‘built to fail’ doesn’t inspire confidence, suggesting a product moments away from imploding.

But firms hosting distributed systems on top of cloud infrastructure should prepare for the worse.

That’s the argument of online video giant Netflix, which relies on Amazon Web Services to stream movies and TV shows to more than 50 million homes worldwide.

To harden itself against catastrophic failure, Netflix employs its Simian Army, software that deliberately attempts to wreak havoc on its systems. Simian Army attacks Netflix infrastructure on many fronts – Chaos Monkey randomly disables production instances, Latency Monkey induces delays in client-server communications, and the big boy, Chaos Gorilla, simulates the outage of an entire Amazon availability zone.

Netflix uses these virtual vandals to test whether its automated systems can cope with real-life failures without impacting its customers. The firm likens it to poking a hole in your car tyres once a week to see if you’re able to replace them.

But why does using cloud infrastructure require such an approach? According to Netflix, it’s the lack of control over the underlying hardware, the inability to configure it to try to ensure 100 percent uptime.

Netflix has had several opportunities to see how robust this intentional breakage has left its systems – from the AWS Elastic Load Balancer outage in 2012 to AWS recently rebooting machines running one in 10 EC2 instances.

This reboot gave Netflix a chance to find out whether its databases could operate in the same fault-tolerant manner as its other systems.

“Databases have long been the pampered and spoiled princes of the application world,” Netflix engineering manager for cloud database engineering Christos Kalantzis and engineering manager for chaos engineering Bruce Wong said in a blog post.

“They received the best hardware, copious amounts of personalised attention and no one would ever dream of purposely mucking around with them. In the world of democratised public clouds, this is no longer possible. Node failures are not just probable, they are expected. This requires database technology that can withstand failure and continue to perform.”

Netflix uses Apache Cassandra, an open-source NoSQL distributed database. Distributed systems offer a trade-off between consistency of data held on each node in the system, availability of the system and partition tolerance – the ability of a system to continue operating after a subset becomes unavailable.

“By trading away C (Consistency), we’ve made a conscious decision to design our applications with eventual consistency in mind,” Kalantzis and Wong said.

“Our expectation is that Cassandra would live up to its side of the bargain and provide strong availability and partition tolerance.”

So how did Cassandra fare during the EC2 reboot? According to Kalantzis and Wong, out of more than 2,700 production Cassandra nodes, 218 were rebooted. Of those nodes, 22 were on hardware that did not reboot successfully.

Netflix’s automation software detected the failed nodes and replaced them, “with minimal human intervention”. Overall Netflix experienced no downtime during the weekend of the reboots, they said.

Building that resilience into the database layer took a lot of chaos testing of systems by Neflix engineers, according to Kalantzis and Wong, with the firm having to expose the workings of its Cassandra clusters, build reliable monitoring to track those workings for failures, and construct software to create and set up replacement nodes automatically.

“Repeatedly and regularly exercising failure, even in the persistence layer, should be part of every company’s resilience planning. If it wasn’t for Cassandra’s participation in Chaos Monkey, this story would have ended much differently.”

AWS recently recommended firms using its infrastructure test their resilience by using Chaos Monkey to induce failures. Here are some of Netflix’s tips for what it calls chaos engineering.

Set up virtuous chaos cycles

After disruption to its infrastructure, Netflix holds ‘blameless post-mortems’ to establish how to prevent a recurrence.

Alongside developing resilience patches and work to prevent a repeat, it builds new chaos tools to test resilience regularly and systematically, to detect regressions or new conditions.

Use reliability design patterns

Use design patterns that improve reliability in a distributed environment hosting loosely-coupled services.

Netflix praises Hystrix as “a fantastic example of a reliability design pattern that helps to create consistency in our micro-services ecosystem”.

Anticipate unseen failures

Netflix works to develop deep understanding of distributed systems and apply that understanding to anticipate failures it has yet to experience.

Doing so allows it to “anticipate failure modes, determine ways to inject these conditions in a controlled manner and evolve our reliability design patterns”.