Amazon has just experienced its second major outage within a year, bringing down popular services that include Netflix and Echo. One immediate reaction on social media was that the outage presented an opportunity for hybrid cloud vendors. I question whether this is truly the case, if the outage shows public cloud’s vulnerability, or if service outages are an accepted cost of doing business on the public cloud.

Designing for failure is hard

A common design theme of applications running on AWS-like cloud infrastructures is to assume the underlying infrastructure will fail. But, it doesn’t matter if you are designing infrastructure that doesn’t fail or applications that can survive an infrastructure failure, designing for failure is difficult. Designing applications for unreliable infrastructure is a foreign concept for most enterprise developers; enterprise infrastructure organizations have spent decades maturing redundant network, storage, and compute.

The redundancy of the infrastructure has allowed application developers to focus on an application’s features and security. In a blog post, eBay’s Chief Cloud Engineer Subbu Allamaraju documented some of the challenges of programming for cloud infrastructures. Just as it took enterprise infrastructure decades to offer today’s infrastructure resiliency, it will take application developers time to adjust to unreliable cloud infrastructure. Organizations have to make a business decision on what attributes of the cloud are of most value.

The cloud value proposition

I’ve long maintained that the value of cloud, especially AWS, isn’t cost savings. If an organization’s primary rationale for using cloud is to save infrastructure costs, that organization is headed for failure.

I believe the strongest value proposition for the cloud is agility. The cloud model puts control of the infrastructure in developers’ hands. The control of the infrastructure allows developers to quickly move ideas from the whiteboard to running code.

AWS has excelled in offering a frictionless experience, but that experience comes at the cost of infrastructure availability. Netflix had to invest in its chaos framework, which allows Netflix to build resiliency at the application vs. depending on the infrastructure.

Data center managers can’t underestimate the power of friction. For example, Facebook has spent years developing a Dislike button to remove friction. Commenting on a post adds friction vs. clicking on a Dislike button for posts that have a negative social impact such as tragic news. Enterprise IT consumption is similar in the sense that developers desire the least amount of friction when consuming infrastructure.

It’s my opinion that agility and frictionless experiences come at a cost. There’s no magic way to the reduce complexity of highly redundant infrastructure and avoid sacrificing some reliability. The classic metaphor of squeezing each end of the balloon applies; the complexity moves from one end of enterprise IT to another. eBay discovered this on its cloud journey, and the recent AWS outage demonstrates the same challenge with the public cloud.

The hybrid cloud compromise

Enter the hybrid cloud model to save the day. One idea from vendors promoting hybrid cloud is to build highly redundant infrastructure and place a cloud wrapper around it (e.g., VMware’s vCloud Air and EMC’s Hybrid Cloud).

By putting a cloud management platform in front of your highly available infrastructure and public cloud, organizations have the best of both worlds — at least, that’s the theory behind many hybrid cloud vendors’ marketing. Does the hybrid approach offer a frictionless experience and the agility of public cloud while providing reliability of traditional infrastructure? I don’t have an answer for this question yet.

I’m a fan of the theory of hybrid cloud. I want to believe that it’s possible to have the best of both worlds. I’m also an infrastructure-focused architect, so I’m biased. One challenge I see with the hybrid cloud model is scale. If you have a huge application the size of Netflix, hybrid cloud isn’t a realistic option. Building an infrastructure that could handle the failover load of an AWS failure is a massive undertaking.

The other challenge is even more practical. Enterprise hybrid clouds bring all the baggage of the highly redundant infrastructure and add the complexity of cloud management. Going back to my earlier statement, designing redundant systems is hard work. So, designing redundant systems that offer frictionless consumption is that much more difficult. I question whether most organizations have the technical resources to undertake the challenge. It took Netflix years to master the public cloud, and it still experienced several hours of outage.


Hybrid cloud has many advantages on paper: It provides the luxury of a highly available infrastructure that is consumable directly by developers; it also brings the promise of seemingly infinite scale of the public cloud. However, this is a simplistic look at a very challenging technical problem and doesn’t address the problem of scale for the largest public cloud applications.

I want to hear from you

Do you consider hybrid cloud the best of both worlds, or do you take the public cloud at face value — a service designed for applications that can afford frequent downtime? Share your thoughts in the comments.

Note: ZDNet and TechRepublic are CBS Interactive properties.