What have we learned from Google's latest outage? That 99.9 percent uptime doesn't matter during the other one-tenth of one percent.
Gmail is the core of the Google Apps suite that is targeting Microsoft Office. Imagine Google does that successfully and tens, maybe hundreds of millions of users' connected offices go offline simultaneously due to some Google glitch.
(My colleague Ian Paul agrees that the outage casts a dark cloud over cloud computing).
That prospect ought to be enough for sensible people to let others enjoy Google's growing pains. Which is also why Gmail and Google Apps users are wise to retain other ways of getting their work done. But, if we can't rely on Google Apps, why are we using them?
Yes, I know Exchange servers can crash, too. And I don't want to slam Google too hard, just balance what Google can give us with what the company can also inadvertently take away.
In a blog post, Gmail's "site reliability czar" tells us that Google noticed the outage almost immediately and began steps to deal with the problem, which still took "about 100 minutes" to solve.
"I'd like to apologize to all of you — today's outage was a Big Deal," Czar Ben Treynor wrote in his mea culpa, which concluded the outage was caused by server overload that itself was caused by workers taking other servers offline for maintenance.
The remaining servers couldn't rebalance the load and we know what followed. This is very similar to what happened in May, when 14 percent of Google's search capability was lost.
While Google's claim of 99.9 percent availability sounds laudable, it is clearly not enough. Google needs to add another nine, to 99.99 or better, if they want to earn and keep customer trust.
The outage calls into question the whole idea of cloud computing and whether Google and other vendors are asking users to put too many eggs into baskets that can't reliably hold them. It's one thing for a single company's e-mail system to fail, another when "the majority" of Gmail users to loss mailbox access.
For companies that rely on Google Apps for all their productivity applications, a prolonged outage could result in employees being sent home, perhaps at full pay, for the duration.
Google won't say how many users were impacted yesterday, which makes me wonder whether they even really know. The Google Apps Status Dashboard didn't provide much realtime detail about the outage, giving the impression that the company had some trouble determining what the problem actually was.
Google needs to find a way to offer more information to users, more quickly, and make it more easily obtainable.
Again, I don't want to be too hard on Google. This technology, at least the way Google implements it and the scale at which it is being done, is still fairly new. Google should, over time, solve the problem and someday, mass outages like yesterday's may become a distant memory.
But, if Google fails to dramatically improve reliability and somehow manages to successfully challenge Microsoft Office, many businesses could find themselves in a world of hurt someday.
I wonder how well Czar Treynor slept last night?