The second of two lengthy outages that hit Visual Studio Online in November was caused by the same kind of issues as the first, Microsoft has just disclosed.
Visual Studio Online, which allows developers to plan and track software development projects and share code, was unavailable for just over seven and a half hours on Monday of last week. It was preceded by another outage the prior week.
In both cases, updates to Azure were to blame, and procedures that hadn’t been followed delayed the process of finding a fix.
That’s according to Brian Harry , a Microsoft Technical Fellow who this week detailed what went wrong in a very candid and detailed blog post and also in an interview.
His postmortem account is a model of transparency for cloud services, with details of what went wrong and why last week’s outage was “probably 3-4 hours longer than it had to be due to inefficiencies.” The detail is key to running cloud services, according to Harry, who works as the Product Unit Manager for Team Foundation Server.
“With service outages, in a modern devops world it’s all about root cause analysis. It’s important we get the service back up, but root cause comes first, because just getting the service back up with the threat it will happen again is not a victory,” he said on Thursday.
It may also become a model for other Microsoft services, including Azure, where some customers were disappointed with communications during the last outage. “When we have a major issue with a high-profile service everybody cares; you get all kinds of people involved in trying to help communicate. The end result was less transparent and less empathetic communications than I think we would want. As a result you’re going to see changes in the way we communicate about Azure outages. You really have to have someone willing to stand out there and say ‘I own this and this is what I’m doing about it,’” he said.
In the latest outage case, an update to the SQL Azure cloud database service included a new feature designed to automatically find and repair databases with unusually high numbers of errors. Using the gradual rollout system that Microsoft calls “flighting,” this was loaded in one SQL Azure region, where it caused problems for a Visual Studio Online procedure that generates a lot of duplicate record errors—but is designed to ignore them. Trying to handle the 170,000 exceptions per minute being generated—and successfully ignored—took so many resources that it made a key database lock up in a few hours.
Some things in Microsoft’s procedure for handling cloud problems went the way they were supposed to. Monitoring systems spotted the problem late Sunday night, over an hour before customers in Europe started tweeting about not being able to access their accounts. The update was being rolled out one region at a time, unlike the previous Azure update mistakenly deployed in multiple locations. Once the Visual Studio Online problem was identified, the update was stopped before it deployed in the next region, and Harry believes no other Azure customers were affected. The new feature also came with the option to turn it off.
But it took more than three hours to identify that the problem was caused by the SQL Azure change, because the Azure operations team didn’t have documentation covering the new feature or even showing it was part of the update. Even when they knew the update was the problem and the feature could be turned off, it took another 45 minutes to work out how, and then another hour to find and fix an unrelated bug in the roaming feature that lets developers have their Visual Studio settings automatically synced to another machine.
Coping with the always-changing cloud
The details about the outage are a window into how Microsoft is building cloud services.
“In any large-scale system there will be failure, and you have got to be resilient to those failures,” Harry said. “There are things we are doing to become more resilient, but the level of investment to do this well is quite high, and it takes time.”
All but one of the main Visual Studio Online systems have been converted into smaller redundant services; the last Shared Platform Service was affected here. A “circuit breaker” system, for turning off specific features or groups of users to keep the rest of the system available, won’t cover all features for another two to three months and isn’t yet mature enough to trip automatically.
As Azure becomes a critical underpinning for other Microsoft services, there are also questions about coordinating changes, which are staggered across different regions. Harry is keen to see an Azure-wide “canary” region, similar to the fast ring in the Windows 10 technical preview and the Office 365 First Release program. “Imagine if any customer could sign up to have resources in that region, so that not only do we get to test our services all together as we roll out, but our customers who are building on Azure could choose to have some fraction of testing or production in the Azure canary region and get an early peek at changes that are coming.”
The role of developers in devops
Most cloud outages are caused either by changes or by combinations of problems outside the service that expose hidden problems. The trick is spotting something going wrong before the situation becomes critical, and responding. “We get many terabytes of telemetry a day,” Harry said. “You need tools to search it, mine it and understand what it’s telling you.”
Getting this right involves developers as well as operations, something that makes sense of the pattern of layoffs at Microsoft earlier in the year, aimed at bringing those teams closer together. “We’re on a journey of transferring more responsibility for things you would traditionally call ops to the development team,” he said. The engineering team on Visual Studio Online is now in charge of deploying the code they write.
Another change came after an outage where an alert hours in advance indicated a problem, “but it was buried in noisy alerts and it looked like alerts that are traditionally ignored—so they ignored it.” Now developers are responsible for closing alerts, and if an alert is too easily triggered and gets ignored, that includes fixing it to be more useful.
This is all part of the way Microsoft is doing devops at scale. It’s not just the operations team that gets paged in the middle of the night when things go wrong. Even senior executives take turns carrying pagers overnight for major incidents. As well as the 24-7 monitoring team, Harry has developers around the world who can assess the problem, with an engineer on call for each service available in 15 minutes.
That 15-minute window is the Visual Studio Online team policy. “Each team is finding their way to how they manage this,” he says, and reaching someone who understood the SQL Azure change took over an hour.
Making that work comes down to not just rotating who is on call, but how leaders focus on understanding what went wrong—and not who was to blame.
“When I say we, I often mean we, Microsoft,” Harry explains. “It’s not my purpose to point fingers and say that team needs to improve, but to really think as one company and to think about accountability in a slightly bigger way. One of my first rules is, everybody is allowed to make a mistake; nobody is allowed to repeat a mistake.”