Data Centers

The major lesson IT can learn from Netflix's high availability testing methodology

High availability events are more likely to be triggered then disaster recovery events, but often aren't tested for as much. Here's what tech leaders can take away from Netflix's approach to the problem.

Image: iStockphoto/heizfrosch

I heard a quote a few days ago that I couldn't help but relate to system high availability. It went something to the effect, "The destiny of glass is to break."

The reason that organizations implement technical disaster recovery (DR) solutions and high availability (HA) systems is to deal with situations where computers systems break. Mature IT organizations invest untold resources into testing for the rare DR event, however, how much time do organizations invest in testing for the more likely HA event?

Disaster recovery vs. high availability

It's important to establish a baseline of what's considered a DR event vs. an HA event. DR is a subcomponent of the business continuity (BC) plan. The BC program is a much larger program than IT DR. BC planning includes human resources, business processes, and facilities, in addition to IT. Activation of a technology DR plan is usually part of a broader decision to execute the BC plan.

From a technology perspective, DR technology frequently focuses on the availability of a physical building. DR solutions consider terms such as recovery point objective (RPO), which considers the amount of data lost, and recovery time objective (RTO), which measures how long the service is unavailable.

HA typically considers the availability of the individual system components. RAID is an example of an HA feature. Combining multiple HA technologies such as RAID, OS clustering, and redundant network paths contributes to the overall availability of an application or service. HA provides protection against common subsystem failures. As such, the events that trigger enactment of HA are typically more frequent than those of DR.

Best practices

DR testing helps to ensure that organization operations continue in the case of an extraordinarily rare set of events. Applying similar logic, should a team should test HA more frequently, as an HA trigger is more likely than a DR trigger?

With the movement toward building highly available systems on microservices, HA system complexity grows. An excellent example of best practices for HA testing is Netflix.

Netflix leverages their Chaos Monkey system to randomly disrupt individual components to test HA. The output allows Netflix to validate or improve their HA designs. While most enterprises aren't ready for the randomness of a system like Chaos Monkey, there are takeaways from the concept that apply to traditional infrastructures.

SEE: Crash your cloud, before it crashes itself: Netflix shares tool to help find unknown bugs (ZDNet)

The primary principle is performing tests of individual system outages. The cost to the business is that HA will fail. Technology managers must communicate the potential impact as well as the benefit.

The effect is evident. If a test fails, then the time required to remediate the failure impacts the business. However, this time is measured as planned downtime. In a best case scenario, users aren't affected as HA works as expected. The worst case scenario is that users experience the pre-planned downtime communicated as part of the test plan.

The primary benefit is the reduction of risk of unplanned downtime. Testing makes perfect, and testing HA is no exception. Testing HA helps ensure systems and processes work as designed.

Your thoughts

Does your organization have a detailed method and schedule for testing HA capability? Share your thoughts in the comments below.

Also see

About Keith Townsend

Keith Townsend is a technology management consultant with more than 15 years of related experience designing, implementing, and managing data center technologies. His areas of expertise include virtualization, networking, and storage solutions for Fo...

Editor's Picks

Free Newsletters, In your Inbox