To say that it was difficult to miss the recent issues with certain Amazon Web Services (AWS) cloud-based services recently would be an understatement. Starting April 21, a number of AWS services had a series of service interruptions and performance issues. The most obvious indicator was that a number of social media services, which leverage cloud technology, were interrupted.
Before we start to jump to any conclusions, let’s zero in on what happened. Early on April 21, status reports started showing up on the AWS status console of issues with a number of services. The affected services were Amazon CloudWatch (N. Virginia), Amazon Elastic Compute Cloud (EC2) (N. Virginia), Amazon Elastic MapReduce (EMR) (N. Virginia), Amazon Relational Database Service (RDS) (N. Virginia), AWS CloudFormation (N. Virginia),and AWS Elastic Beanstalk. As of the time that I am writing this blog (late Sunday night April 24), all services are back online except for a “limited number of customers” and each service is back to a green status with only a few notes.
The hidden issue lies with the Elastic Block Store (EBS) volumes, however. Many of the status updates for a number of the services make mention of EBS volumes; yet EBS itself doesn’t have its own entry on the status page. The primary use case for EBS volumes is be provisioned directly to EC2 instances as block disk resources. The ECS2 instances are effectively virtual machines on the Internet hosted by AWS.
This compound series of service interruptions tells us a few things. First of all, failures happen to both big and small datacenters. This is because in the end, the AWS services are run in datacenters. Chances are, it is nothing like the datacenters you and I have worked in; but nonetheless it is a datacenter. The second thing that this tells us is that if we design a service for the cloud, we need to be ready to accommodate an outage. This should sound eerily familiar to what we have always done in the datacenter: Architect around domains of failure.
Is the lack of a true cloud standard the issue?
Not necessarily. Federated clouds sound good, but in practice are different point solutions that I rarely see real-world use cases that leverage two public clouds for one solution. A more realistic approach would be to leverage the same public cloud, such as AWS for multiple independent cloud infrastructures using regions (discussed in a bit). The fact is that AWS is still the most refined offering of public cloud services, and it is successful; in spite of this incident. Further, I think that it will continue to be the most refined offering and recover from this incident.
The good news for Amazon is that all of the affected services, with the exception of Elastic Beanstalk, are available in other regions such as Northern California, Ireland, Singapore and Tokyo. Further, within each of the regions; there are specific availability zones. The Northern Virginia AWS cloud has four availability zones for EC2 for example. The fact that AWS is distributed is probably the best thing going for it. For this specific incident, a number of availability zones were impacted in the Northern Virginia region; also referred to as US-EAST-1 in the status reports. This means that if a cloud solution was split across regions and did not require Elastic Beanstalk; it may not have been impacted. Cluster a la cloud, if you will.
If we are to architect cloud solutions around multiple domains of failures, then the best approach would be to leverage two AWS regions. This in theory sounds easy, but in reality may be quite complicated. First of all, the pricing differs for each region. Secondly, any transfers to other regions incur a bandwidth cost. Transfers within a region are free. So, transferring data from US-EAST-1A to US-EAST-1D is no cost; but transferring that same data from Northern Virginia to Northern California would incur a transfer cost. Keeping in mind that the data and systems in the cloud are ultimately our own, we do need to take it upon ourselves to plan for these types of things if we don’t want to endure an outage.
For the naysayers: You told us so, right?
Surely there are bloggers and opinionated individuals enjoying the incident and shouting, “I told you so!” The fact is, if we don’t architect for domains of failure properly in our own datacenter, how are we going to do it in the public cloud?
What we have learned from this incident is that failures happen; how we change our behavior is a token to how well we learn from our mistakes – even if we weren’t impacted by this incident. What do you take of the AWS incident? Share your comments below.