The uproar surrounding the partial outage of Amazon’s EC2 cloud services platform got some new life last Friday, when the company released a detailed post mortem of the incident. The summary includes a surprising level of detail into the root cause of the outage, information on a service credit for impacted customers, and, finally, an apology from Amazon.

To recap: on April 21st, Amazon made a configuration change during a network upgrade that caused a cascading series of events that resulted in what it calls a “re-mirroring storm.” As a consequence, Amazon’s storage service was essentially “stuck” and unable to locate new storage space for either new or existing customers. This led to a significant period of degraded functionality and downtime for major Web 2.0 sites such as Reddit, Foursquare and HootSuite, as well as a plethora of bad PR for Amazon, a cloud computing pioneer.

Interestingly, though, some other sites were only moderately affected, or not affected at all, most notably Netflix. Why did some sites sink, while others sailed through the storm? As with any complex system, there is no one answer. However, organizations that “architected for failure” tended to fare better than those did not. Netflix, which recently released its lessons learned from the outage, built its platform around the assumption that services and/or zones within EC2 could be unavailable for extended periods of time.

Clearly, not every EC2 customer has the technical chops to build Netflix-like applications, and Amazon needs to make it easier to increase redundancy by taking advantage of multiple availability zones. However, this is a public cloud platform, and customers that did not take full advantage of Amazon’s redundant architecture, or that did not create their own replicated solutions, ended up paying the price.

While architecting for failure in the cloud may appear to be a purely IT responsibility, it’s not. TPI views cloud as one component of your service delivery strategy, and regardless if they are delivered in-house or are outsourced, effective planning across all components is vital.

Business continuity planning (BCP) and Disaster Recovery (DR) are two critical parts of this planning process. Unlike a traditional outsourced agreement, in which the supplier takes on a significant level of responsibility in the delivery of the BCP and DR plans, the public cloud requires that customers retain a significant amount, if not all, of this responsibility, as well as the associated risk. In return they get a highly scalable, cost-effective and fast-to-provision computing platform.

Bottom line: Some organizations are finding out the hard way what happens when they don’t integrate cloud with their overall service delivery strategy. By including business continuity planning and disaster recovery in the cloud architecture design process, enterprises can significantly reduce the risk of business disruption.

Originally posted by Stanton Jones, Chief Information Officer, TPI on Consider the Source


Amazon’s Outage: Architecting for Failure in the Cloud

Categories

All

General

Accessibility

Business events

Business innovation

Cloud computing

Communications

Copyright

Data centers

Digital economy strategy

Economic development Canada

eCommerce

eHealth

eLearning

Enterprise Resource Planning (ERP)

Gadgets

Geo-blocking

Green technology

Investment

Mashups

Mobility

New technologies

Olympic technology

Outsourcing

Project management

Sales and marketing

Security

SMB

Social media

Social networking

Software as a Service (SaaS)

Speakers Corner

Start Up Innovation Campaign

Tech events

Technology law

Technology start-ups

Trends

Unified Communications

Usage based billing

Web 2.0

Wireless


Archives

May 2012

April 2012

March 2012

February 2012

January 2012

December 2011

November 2011

October 2011

September 2011

August 2011

July 2011

June 2011

May 2011

April 2011

March 2011

February 2011

January 2011

December 2010

November 2010

October 2010

September 2010

August 2010

July 2010

June 2010

May 2010

April 2010

March 2010

February 2010

January 2010

May 6, 2011 12:30 AM

The uproar surrounding the partial outage of Amazon’s EC2 cloud services platform got some new life last Friday, when the company released a detailed post mortem of the incident. The summary includes a surprising level of detail into the root cause of the outage, information on a service credit for impacted customers, and, finally, an apology from Amazon.

To recap: on April 21st, Amazon made a configuration change during a network upgrade that caused a cascading series of events that resulted in what it calls a “re-mirroring storm.” As a consequence, Amazon’s storage service was essentially “stuck” and unable to locate new storage space for either new or existing customers. This led to a significant period of degraded functionality and downtime for major Web 2.0 sites such as Reddit, Foursquare and HootSuite, as well as a plethora of bad PR for Amazon, a cloud computing pioneer.

Interestingly, though, some other sites were only moderately affected, or not affected at all, most notably Netflix. Why did some sites sink, while others sailed through the storm? As with any complex system, there is no one answer. However, organizations that “architected for failure” tended to fare better than those did not. Netflix, which recently released its lessons learned from the outage, built its platform around the assumption that services and/or zones within EC2 could be unavailable for extended periods of time.

Clearly, not every EC2 customer has the technical chops to build Netflix-like applications, and Amazon needs to make it easier to increase redundancy by taking advantage of multiple availability zones. However, this is a public cloud platform, and customers that did not take full advantage of Amazon’s redundant architecture, or that did not create their own replicated solutions, ended up paying the price.

While architecting for failure in the cloud may appear to be a purely IT responsibility, it’s not. TPI views cloud as one component of your service delivery strategy, and regardless if they are delivered in-house or are outsourced, effective planning across all components is vital.

Business continuity planning (BCP) and Disaster Recovery (DR) are two critical parts of this planning process. Unlike a traditional outsourced agreement, in which the supplier takes on a significant level of responsibility in the delivery of the BCP and DR plans, the public cloud requires that customers retain a significant amount, if not all, of this responsibility, as well as the associated risk. In return they get a highly scalable, cost-effective and fast-to-provision computing platform.

Bottom line: Some organizations are finding out the hard way what happens when they don’t integrate cloud with their overall service delivery strategy. By including business continuity planning and disaster recovery in the cloud architecture design process, enterprises can significantly reduce the risk of business disruption.

Originally posted by Stanton Jones, Chief Information Officer, TPI on Consider the Source

Blogger Profile: Consider the Source
TPI is the leader in guiding organizations through effective, lasting transformation of their business support operations. Around the globe we have helped hundreds of clients reduce operating risks, streamline complex operations, improve the cost of support functions, achieve sustainable improvements and make competitive gains. Decisions to change and successful transition of existing operations to new service delivery models is hard — and replete with risks. While the decisions are never formulaic, the hard-earned lessons of hundreds of prior evaluations are invaluable.

Posted by Sue Ansell at May 6, 2011 12:30 AM

Categories: Cloud computing Outsourcing

Comments

Name
URL (remove the http://)
Email
Comments (field is limited to 2000 characters)
   

TrackBack Link

Bookmark and Share           Print Page          Email To A Friend
Start Me Up Innovation Campaign winner

WCIT C200 Investment Forum


Insightful business speaker Jim Harris talks innovation in 
Speaker's Corner 

Backbone magazine Speakers' Corner 

Backbone magazine latest digital issue

Backbone's Cloud Portal

Backbone's Digital Economy Acceleration Committee

Backbonemag on Twitter