A lightning strike in Dublin took out a power transformer. In and of itself, that isn’t all that unusual or noteworthy, but this particular lightning strike also impacted the backup power systems at Amazon’s cloud data center, knocking the service offline. Looking back, there are some lessons to be learned both for Amazon, and for businesses that rely on cloud services.
We’re talking about a massive Amazon data center. Data centers are built from the ground up with backups and failovers designed to address virtually any scenario and ensure the survivability and availability of the data center no matter what sort of catastrophe strikes. Amazon, of course, has redundant mechanisms in place, but obviously they didn’t work in this case.
On its Service Health Dashboard site for the European EC2 cloud service, Amazon explains, “Normally, upon dropping the utility power provided by the transformer, electrical load would be seamlessly picked up by backup generators. The transient electric deviation caused by the explosion was large enough that it propagated to a portion of the phase control system that synchronizes the backup generator plant, disabling some of them. Power sources must be phase-synchronized before they can be brought online to load. Bringing these generators online required manual synchronization.”
In a nutshell, the lightning strike was direct and powerful enough that it simultaneously took out the transformer, and phase control system necessary for initiating the backup generator system. Amazon is in the process of restoring service and data for customers–a process that is taking longer than expected, and has required Amazon to add additional server capacity to handle the load.
So, what are the lessons to be learned here? Well, Amazon should do a post mortem once the service is fully recovered. First, Amazon should analyze the circumstances that led to both primary and backup power being impacted at the same time. It should determine the likelihood of such an event occurring again, and what–if anything–can be done to avoid it. Perhaps the backup power should be on a different grid from the primary power, or maybe this is such a fluke incident that such an investment is cost-prohibitive.
Next, Amazon should review the recovery and restoration process. It should consider the hurdles and stumbling blocks it has encountered–like needing additional server capacity to handle the load more efficiently–and it should revise incident response processes and procedures to make any future disaster recovery operations more effective and efficient.
If you are a customer of Amazon, or Microsoft–which was also affected by the Dublin lightning storm, or any other cloud data or server service, there are lessons to be learned as well. As I explained a few months ago following a cloud outage for Amazon in the United States, “Don’t use cloud services unless you can adequately answer the question “what happens to my business if the cloud service in unavailable?””
You should have your own redundancy and disaster recovery systems in place. Depending on how crucial your cloud server or data storage are to normal business operations, you could contract with more than one cloud service provide to hedge your bets and prevent an outage at one provider from taking down all of your operations at once.
You should also make sure you understand the failover and redundancy mechanisms offered by your cloud provider. Amazon offers Availability Zones that enable customers to set up their own redundancy within the cloud.
The ultimate lesson, though, is that nothing is 100 percent guaranteed. Even the most reliable service can be knocked offline by a fluke natural disaster, or even catastrophic human error. Your mission is to develop a system that enables you to continue business operations no matter what.