A Gmail glitch that took about 10 hours to fix and hit close to 50 percent of the webmail service’s users has been fixed, ending one of the longest, most widespread Gmail disruptions in years.
Affected users endured email delivery delays and difficulties downloading attachments due to a still unexplained bug first acknowledged by Google at around 10:30 a.m. U.S. Eastern Time Monday. The company declared it patched at 10 p.m.
On its Google Apps Status site, the company pegged the start of the problem at close to 9 a.m. and its resolution at 6:30 p.m.
On Tuesday, Google offered more details about the cause of the problem and the steps it’s taking to prevent it from happening again.
The cause was a “very rare” dual network failure, which brought down two separate, redundant network paths, according to a blog post from Sabrina Farmer, senior site reliability engineering manager for Gmail.
“The two network failures were unrelated, but in combination they reduced Gmails capacity to deliver messages to users,” she wrote.
Over the next few weeks, Google staffers will work on bulking up network and backup capacity for Gmail, as well as on making Gmail’s message delivery more resilient in the event of a network crash, according to Farmer.
“Finally, were updating our internal practices so that we can more quickly and effectively respond to network issues,” she wrote.
The issue affected individuals who use the free version of Gmail as well as businesses, schools and government agencies that pay for it as part of the Google Apps cloud collaboration and email suite.
In the U.S., the disruption covered most of the workday on both coasts, which heightened the impact of the bug for millions.
People who depend on Gmail for critical tasks took to Twitter, discussion groups and other online forums to express their frustration.
The last time Google gave an official figure for active Gmail users was more than a year ago, when it said there were more than 425 million.
Assuming conservatively that the service now has about 450 million active users, Monday’s disruption likely affected more than 200 million users, plus senders on other email platforms whose messages weren’t received in a timely fashion.
Even Google gets data outages
Google said that the severity and length of the impact varied among users. About 29 percent of messages received were delayed by an average of 2.6 seconds, but some mail was “severely delayed.”
“We apologize for the duration of today’s event; we’re aware that prompt email delivery is an important part of the Gmail experience, and today’s experience fell far short of our standards,” the company wrote on the status site.
The incident is a big deal for both Google and those affected, but it shouldn’t on its own dissuade CIOs from using the suite, said Forrester Research analyst TJ Keitt.
“Data centers hosting multi-tenant collaboration services aren’t immune to disruptions. So, when they happen, the way to judge the vendor is on how well they identify and resolve the problem, and then inform the public to how they resolved the issue,” Keitt said.
Using that criteria, Google’s updates throughout the duration of the incident could have been more transparent and detailed regarding the nature of the problem and the strength of the fix that was put in place, he said via email.
“They have clearly not communicated this publicly, so I hope they’ve been forthcoming with this information with their clients,” Keitt said.
Meanwhile, Matthew Cain, a Gartner analyst, said the incident raises fundamental questions about what is considered downtime, especially as it relates to service-level agreements from cloud application vendors.
“If message delivery is delayed 15 minutes, is that considered downtime? What about 2 hours?,” he said via email. “The move to cloud email puts a spotlight on these essential questions about how to meter and compensate for subpar messaging performance that is not traditionally classified as ‘downtime.'”
Updated 10:15 a.m. 9/24/2013 with information from Google’s Sabrina Farmer