Ever since people started paying real attention to cloud computing, it seems like we can’t go a full year without some sort of major outage from Amazon. In 2011, we saw a major outage happen in April, taking down several popular sites. Then, earlier this year, there were problems due to a power outage. Now, this previous Monday, October 22nd, all signs point to a failure related to issues on the servers or the underlying platform itself.

The issues were confined to a single one of their data centers, the one in North Virginia. This corresponds to their US-EAST-1 Region, and was the same region affected earlier this year. Looking at Amazon’s Service Health Dashboard (if you are an Amazon customer, you definitely should be using it!), we can get a step-by-step overview of what happened. All times are PDT (UTC -7).

  • At 10:38 AM, Amazon first reports that it is investigating degraded performance for some instances of its Elastic Block Store (EBS) service;
  • At 11:11 AM, they confirm the performance degradation for a “small number of volumes”, informing customers that instances that rely on affected volumes will also suffer performance degradation;
  • At 11:26 AM, they send out a message warning about degraded performance for EBS volumes at the availability zone. This is no longer for a “small number”, but for volumes in general.
  • Around this time, other Amazon services on the same Datacenter are either failing or about to fail:
    • At 11:06 AM, API failures and delays are reported on the Elastic Beanstalk service
    • At 11:39 AM, connectivity issues are reported on the ElastiCache service
    • At 11:45 PM, Amazon reports connectivity, performance and latency issues on the Relational Data Store (RDS) service
    • At 12:03 PM, delays in metrics for the CloudWatch service are reported
    • At 12:07 PM, we see elevated error rates for the AWS Management Console
    • At 12:25 PM, they report they’re investigating issues on the Cloud Search Service

The timeframe clearly shows that the issues are related to each other. Whatever affected the Elastic Block Store, which is Amazon’s virtual storage service, ended up bringing down all other services that either rely on or are somehow related to it. Issues continued for the next couple of hours, only starting to be resolved around 3:00 PM:

  • At 3:15 PM, CloudWatch metrics is back to normal operation;
  • At 3:45 PM, the Cloud Management Console is back to normal;
  • At 3:46 PM, the connectivity issues on the ElastiCache service are reported resolved.
  • Major services, however, wouldn’t be fully back online until much later:
    • By 11:47 PM, Elastic Beanstalk is reported at normal operation;
    • By 12:32 AM of October 23rd, the Cloud Search service was back to normal;
    • By 11:08 AM of October 23rd, the EBS and EC2 services were back to normal operation;
    • Finally, by 2:53 PM of October 23rd, the RDS service was fully restored.

For the services that were affected the most, Elastic Block Store and Relational Data Store services, the outage, as reported on the AWS Status Dashboard, ran at 24 hours and 20 minutes and 27 hours 50 minutes, or 0.28% and 0.32% of the year, respectively. These numbers break the promised 99.95% availability SLA for the running year.

The aftermath

Right now, only Amazon still knows what exactly happened to cause all these failures and issues. They have promised to update their status dashboard with further information as soon as they are done with the root cause analysis. The sheer number of services affected and the way in which the errors propagated seem to indicate something related to their storage service (either at the hardware or software level), but the underlying complexity of these systems is such that the true cause of the problem may be masked by other issues.

What doesn’t make sense is why several major sites were once again taken out. All trouble happened on a single availability zone, which means that either they hadn’t properly set up replicated environments, or the replicated environment couldn’t handle the load alone – which once again doesn’t make sense, since the whole idea of the cloud is the easy scalability. If a service was taken out by this outage, they’ve no-one to blame but themselves. Regardless of the cloud provider you are using, you should be replicating your environment in multiple locations to safeguard against exactly this kind of failure.

Finally, all Amazon customers should pay attention to see if they are eligible to receive service credits. The current outage alone may result on a dip in uptime numbers that lead to this eligibility, so customers should make sure they put in their claims in the next 30 days. Here is the link to Amazon’s SLA page; the “Credit Request and Payment Procedures” section details the procedures to request these credits.