Amazon reveals reason for last week's major AWS outage

The trigger for the service interruption was a small addition of capacity to Amazon Kinesis, but the problem snowballed from there.

Cloud.

Image: Getty Images/iStockphoto

Last week's huge AWS outage that clobbered a host of Internet of Things (IoT) devices and online services was caused by some snafus with an Amazon service called Kinesis. Tasked with the job of collecting and analyzing real-time streaming data on AWS, Kenesis hiccuped after Amazon started adding just a small bit of capacity to it. Though it affected a wide range of devices and services, the disruption occurred specifically at the AWS facility in Amazon's Northern Virginia, US-East-1, region.

SEE: Managing the multicloud (ZDNet/TechRepublic special feature) | Download the free PDF version (TechRepublic)

In its long and complex explanation, Amazon referred to the small addition of capacity to Kinesis as the trigger but not the root cause for the problem. Describing the path used for this service, the company said that Kinesis uses an array of "back-end" cell-clusters that process streams.

These streams are spread across the back end through a sharding mechanism owned by a "front-end" fleet of servers. The job of the front end is to handle authentication, throttling, and request-routing to the correct stream-shards on the back-end clusters. The additional capacity was made to this front-end fleet.

Each server in the front-end fleet caches certain data, including membership details and shard ownership for the back-end clusters. Every front-end server creates operating system threads for each of the other servers in the front-end fleet. When new capacity is added, the servers that are already part of the fleet take as long as an hour to learn about any new participants.

The first alarm bells went off at 5:15 am PT on the day before Thanksgiving, Nov. 25, indicating errors with Kinesis records. The new capacity was a likely suspect for the glitches, prompting Amazon to start removing it but at the same time looking into other potential causes.

By 9:39 am PST, Amazon had nailed its culprit, discovering that the new capacity had caused all of the servers in the fleet to exceed the maximum number of threads allowed by an operating system configuration. As this limit kept being exceeded, the cache construction kept failing, and so the front-end servers were unable to route requests to back-end clusters.

To resolve the problem, Amazon engineers restarted the front-end servers following the removal of the additional capacity that started the collapse. The first group of servers was added at 10:07 am PST. From there, Amazon slowly added servers, only a few hundred per hour. As traffic was gradually added, the error rate started dropping steadily, leaving Kinesis fully up and restored as normal at 10:23 pm PST.

As with any type of service disruption, there are lessons learned and fixes to be made. Apologizing for the impact to its AWS customers, Amazon described several measures to ensure that this type of event won't happen again.

First, the company will move Kinesis to larger CPU and memory servers to reduce the total number of servers required. Second, the company said it's adding fine-grained alarming for thread consumption in the service. Third, Amazon plans to increase the thread count limits upon completion of the necessary testing. Fourth, the company is making changes to improve the cold-start time for the front-end fleet, such as moving the front-end server cache to a dedicated fleet and moving large AWS services such as CloudWatch to a separate front-end fleet.

Alerting customers to the issue was also a task that didn't fare too well.

"Outside of the service issues, we experienced some delays in communicating service status to customers during the early part of this event," Amazon said, pointing out that it uses its Service Health Dashboard to alert customers of broad operational issues and its Personal Health Dashboard to communicate directly with impacted customers.

In this type of event, Amazon usually posts to its Service Health Dashboard. However, this tool couldn't be updated the typical way as it uses a service called Cognito, which was itself affected by the outage. Though the company turned to a manual backup method of updating this dashboard, the updates were delayed as Amazon's support engineers weren't sufficiently familiar with this tool.

"Going forward, we have changed our support training to ensure that our support engineers are regularly trained on the backup tool for posting to the Service Health Dashboard," Amazon added.

As more organizations and individuals rely on the cloud to use key devices and services, an event like this inevitably affects a growing number of people. AWS customers are especially susceptible as Amazon's cloud product is among the most popular and commonly used. This means that organizations need to prepare for online disruptions just as they would prepare for problems with on-premises services. And that requires the proper data protection and recovery plan to ensure that your own business doesn't suffer when trouble arises in the cloud.

Also see