A system configuration mistake caused the outage that affected Windows Azure customers in western Europe last week, according to Microsoft.
As a result, the Microsoft public cloud application hosting and development platform was unavailable for about two and a half hours on Thursday. Microsoft didn't say how many customers were impacted.
At issue was a "safety valve" mechanism in the Azure network infrastructure designed to prevent cascading network failures. It does so by capping the number of connections that network hardware devices accept.
"Prior to this incident, we added new capacity to the West Europe sub-region in response to increased demand. However, the limit in corresponding devices was not adjusted during the validation process to match this new capacity," wrote Mike Neil, Windows Azure general manager, in a blog post.
A sudden rise in the affected cluster's usage led to the "safety valve" threshold being exceeded, which generated a storm of network management alerts. "The increased management traffic in turn triggered bugs in some of the cluster's hardware devices, causing these to reach 100% CPU utilization impacting data traffic," Neil wrote.
At the time, Microsoft solved the problem by increasing the affected cluster's "safety valve" limits. To prevent the situation from recurring, Microsoft is patching the identified bugs in the networking hardware devices, and it is also improving the network monitoring systems, so that they can identify and address connectivity issues before they cause outages.
Forrester Research analyst James Staten said that PaaS (platform as a service) clouds such as Azure are very complex and highly automated environments, and sometimes glitches crop up in production that can't be anticipated in test environments. "This appears to be one of those cases," he said via email.
Over time as new features, greater use and other factors enter the equation, administrators have to take steps to adjust and optimize the running system, and occasionally something will break, he said.
"Should it be something clients should be concerned about? Not really. It is an example of the kinds of things that can happen in a cloud environment. But far worse things are more common in a typical enterprise data center," Staten said.
IT chiefs and developers planning to host applications in the cloud need to configure them and design them to be fault tolerant. "That is a fundamental shift in thinking most developers and enterprise operations teams need to understand when embarking on cloud deployments," he said.
"These types of outages are learning opportunities for both the cloud admins and cloud customers. Rather than view these incidents as indictments of cloud, they should be seen as opportunities to improve your use of the cloud," he added.
Juan Carlos Perez covers enterprise communication/collaboration suites, operating systems, browsers and general technology breaking news for The IDG News Service. Follow Juan on Twitter at @JuanCPerezIDG.