My initial thoughts are related to disaster recovery and Amazon services. In their last significant outage in April, they had a network configuration change that led to an outage of services in the eastern United States. This outage begs other questions. Why isn't Amazon deploying a redundant power source, like a diesel powered backup? Maybe they did, but the fire blew out a portion of that utility service. So a more serious disaster emerged from an initial transformer explosion.
[Related: Creating a cloud SLA from diagnostic data]
How could this be addressed? How about fail-over to services in another geographic location in Europe. This didn't happen. I can only guess that building out another data center is cost prohibitive at this time and that is why Amazon doesn't have another European data center. The rest of the article mentions that it will take Amazon up to two more days to bring up the remaining servers.
It mentions that a significant period of time is being taken to start all of the servers up again. It also states that Microsoft, who has services in the same data center, does not have the same weakness. I wonder why this is; data replication should be a high priority, especially when Amazon lacks full-scale data center disaster recovery.
So, it looks as if Amazon has more cloud services weaknesses that are bubbling up due to operational stresses. How can mid-sized and small businesses that outsource their web applications to Amazon's cloud protect themselves? It's clear that Amazon supports cloud applications where profitable. I suggest that those firms create a very detailed, per application SLA (Service Level Agreement) that lists global up-time, performance, and penalties when service isn't meeting objectives.
In my last couple of articles, I outlined questions to ask the service provider that reveal a current application's architecture. These questions can be asked for all of the applications that a company wants to be managed by a cloud provider. This information along with up-time requirements and performance statistics can be combined to form the SLA.
It is likely that Amazon and other major cloud providers will not support extensive disaster recovery plans until the SLAs penalize them into delivering that service well. Well defined SLAs lead to global trade growth because they ensure business is running well globally. This business handshake leads to trust between the two parties. And we all know that 'Trust is Trade.'
This story, "Lessons Learned From a Recent Amazon Outage" was originally published by CSO.