Microsoft Offers Credits for 'Leap Day' Azure Outage
To make up for a string of outages that were caused by a software bug in its Azure cloud services Microsoft is granting affected customers a 33% credit for the time they were left stranded during the Feb. 29 failure.
[BACKGROUND: Microsoft's Azure cloud suffers serious outage]
The problem stemmed from two overlapping circumstances: that Feb. 29 comes around only every four years and that when Azure initializes virtual machines for customer applications, a certificate is exchanged and given a valid-to date of one year. When certificates were issued starting 4 p.m. PST on Feb. 28, they were given a valid-to date of Feb. 29, 2013 which won't occur and was therefore interpreted as invalid.
This glitch set off a series of retries that also failed and led to the conclusion by the system that the hardware on which the virtual machines were running had failed. That led to attempts to migrate the failed virtual machines to other server hardware within the same Azure cluster, which consists of about 1,000 physical servers.
The migrated VMs also failed to initialize for the same reason and more and more hardware was judged failed until a threshold was reached that halted all attempts to reincarnate virtual machines anywhere in affected clusters. That allowed those clusters to stay in service at reduced levels, the blog says.
Azure also shut down the customer service management platform so customers couldn't add applications or expand capacity for running applications, both of which would have made the problem worse by calling for new virtual machines. "This is the first time we've ever taken this step," the blog says. Running applications were left intact.
It took 13 hours and 23 minutes to patch the bug in all but seven Azure clusters. Those seven were in the midst of a software upgrade, and so posed a separate problem. Should the host agents and guest agents that were exchanging the invalid certificates be upgraded to the newest patched versions or restored to the old versions but patched?
They decided on the latter, but that didn't work out because they didn't also revert to an earlier version of the network plug-in that configures a virtual machine's network. The new network plug-in was incompatible with the old host agents and guest agents. The result was that all virtual machines in those seven clusters were disconnected from the network.
The affected clusters included servers for Access Control Service (ACS) and Windows Azure Service Bus, both of which failed as a result. Cleaning up this problem entirely took until 2:15 a.m. March 1, the blog says.
Microsoft is taking three steps to prevent a similar problem. First, it will test for time incompatibilities in its software. It will also change its fault isolation so the system doesn't assume a hardware failure in this type of circumstance. And third, it will allow for a graceful degradation of customer management rather than turning the platform off altogether. This will allow blocking new virtual machines or expansion of old ones but continue to allow management of existing virtual machines.
The company is also upgrading its detection so issues are discovered and addressed more quickly. It is also upgrading the customer dashboard to remain more available in crises.
Because customer service lines were swamped, customers had to wait a long time for help, so the company is reevaluating staffing and considering better use of blogs, Twitter and Facebook to get the word out about problems.
To help during recovery from outages, the company is creating internal software tools, setting priorities to reestablish customer services more quickly and give customers better visibility into what progress is being made to restore services.
Read more about software in Network World's Software section.