Microsoft has detailed the root causes of its Windows Azure service interruption that occurred on 29 February — and I’m not sure whether I’m reading a report from a three-person service desk, or a multinational computing platform.
The original bug was found to be due to incrementing the year without any context, and producing a date of 29 February 2013 on some encryption certificates.
“The leap-day bug is that the GA [guest agent] calculated the valid-to date by simply taking the current date and adding one to its year. That meant that any GA that tried to create a transfer certificate on leap day set a valid-to date of February 29, 2013, an invalid date that caused the certificate creation to fail,” wrote Microsoft server and cloud corporate vice president Bill Laing in a blog post.
The bug triggered at midnight, Universal Standard Time, on 29 February, and began to spread throughout Azure’s clusters. Once the system’s internal-error threshold was hit 75 minutes later, the affected virtual machines were tagged with a hardware-fault tag and automatically replicated in other clusters, thereby spreading the issue and increasing internal traffic that would affect customer-facing tools’ responsiveness.
Twelve hours after the mayhem started, a fix was pushed to Azure, and Microsoft declared the service restored for a majority of customers.
But another bug was lurking, waiting to be unleashed on the seven clusters that were partially updated at this time.
Because these servers were in the process of automatic updating when the leap-year bug affected them, Microsoft chose to “blast” updates directly to these servers to bypass the usual hours-long deployment constraints.
“Unfortunately, in our eagerness to get the fix deployed, we had overlooked the fact that the update package we created with the older HA [host agent] included the networking plug-in that was written for the newer HA, and the two were incompatible. The networking plug-in is responsible for configuring a VM’s virtual network, and without its functionality a VM has no networking capability. Our test of the single server had not included testing network connectivity to the VMs on the server, which was not working.” wrote Laing.
The update was pushed to the seven clusters and every virtual machine, including previously healthy VMs, causing them to be disconnected from the network. Almost three hours later, a fix was “blasted” to the clusters, which had them mostly working another two and a half hours later — a number of servers needed to be manually restored and validated by Microsoft staff.
The last update from the saga was posted to the Windows Azure dashboard 34 hours after it began.
All Azure customers will receive a 33 per cent credit for the billing month, regardless of whether they were affected by the outage or not.
As a fellow programmer, I’m sure that somewhere in my code archive, I’ve incremented a year variable poorly and concatenated it to produce a bad date. A lot of the time, those errors would go unnoticed, as some languages automatically correct for this error by translating a bad date to a correct one when appropriate.
Using a bad date in a certificate, though, changes all that. For an organisation as large as Microsoft, with its focus on Azure as the computing platform of the future, someone inside should have tested against edge cases exactly like this one — they do happen every four years.
To not have used a timestamp, like Epoch time, and translated it to text at the last possible moment is a bit baffling, but that’s spilled milk, and Microsoft says that it’s fixed now.
I do reserve my sympathy for the people who were involved in trying to fix the system with minimal downtime and in the process managed to introduce a new bug. Any time that a human is involved this can happen, and it’s a shame that it had to occur with a platform as large as Windows Azure.