Five lessons from Microsoft's Azure cloud outage

Microsoft has released a preliminary report from its investigation into how a fierce storm in southern Texas caused the recent service brownout.

Here's why the public cloud is growing rapidly

The large public cloud platforms have multiple layers of redundancy, with spare compute, networking, and infrastructure waiting in the wings if disaster strikes.

So how can a major platform such as Microsoft's Azure suffer the sort of outages that happened earlier this month?

Microsoft has released a preliminary report from its investigation into how a fierce storm in southern Texas caused the recent service brownout.

Here are five lessons from the report about the limitations of cloud resiliency.

Power surges can overwhelm multiple systems

Lightning strikes close to one of the datacenters for Azure's South Central US region generated a voltage surge in the center's power supply that triggered a switch to generator power.

At the same time these surges also shut down the datacenter's mechanical cooling system, despite surge suppressors being in place.

SEE: Information security policy (Tech Pro Research)

A 'thermal buffer' designed to protect systems was quickly depleted and the temperature rose so rapidly that an automated shutdown was unable to protect some hardware, and a "significant number of storage servers were damaged, as well as a small number of network devices and power units".

Microsoft is now undertaking a "detailed forensic analysis of the impacted datacenter hardware and systems, in addition to a thorough review of the datacenter recovery procedures".

Failover datacenters don't guarantee uninterrupted service

In the aftermath of the initial outage, Microsoft had to recover Azure software load balancers (SLBs) for storage scale units, which "are critical in the Azure networking stack, managing the routing of both customer and platform service traffic".

However, replacing damaged infrastructure and recovering customer data took "time due to the number of servers damaged, and the need to work carefully to maintain customer data integrity above all else".

Microsoft says it was necessary to focus on recovering data, rather than shifting the service to a failover datacenter, due to the failover datacenter holding an incomplete copy of the data.

"A failover would have resulted in limited data loss due to the asynchronous nature of geo replication," it says.

Microsoft is now evaluating the future hardware design of storage scale units to increase resilience to environmental factors and "determining software changes to automate and accelerate recovery".

Not every cloud service is equally resilient

Microsoft says the issues customers worldwide experienced accessing Azure Service Manager (ASM) were due to "insufficient resiliency" in the global service, whose primary data store is in Azure's South Central US region.

It contrasts ASM to the newer Azure Resource Manager APIs that have been made available in recent years, and which store data "in every Azure region".

"Although ASM is a global service, it does not support automatic failover. As a result of the impact in South Central US, ASM requests experienced higher call latencies, timeouts or failures when performing service management operations," the report states.

Redundancy doesn't guarantee a service won't degrade

As traffic for the Azure Active Directory service was routed from South Central US to other datacenters, the significant increase in authentication requests triggered a throttling of the service.

"Our automatic throttling mechanisms engaged, so some customers continued to experience high latencies and timeouts," the report states, adding problems were alleviated and resolved within hours.

A small number of failures can ripple throughout the platform

This interconnected nature of cloud services and their dependencies led to a number of failures.

Data analysis and ingestion in Azure Application Insights was impacted by the outage, due to "a dependency on Azure Active Directory and platform services that provide data routing", according to the report.

Visual Studio Team Services hosted in the South Central US region were also down, which in turn meant customers hosted in the US were unable to use Release Management and Package Management services, and that build and release pipelines using the Hosted macOS queue failed.

Additionally, a number of services dependent on Azure Service Manager (ASM) were unavailable, and Microsoft says it's now reviewing every internal service to identify dependencies on the ASM API.

Also see