Exclusive Deal While supplies last, save 40% off over 40 iPhone 5 and iPhone 4/4S cases and chargers as well as Samsung S III cases at Kensington.com. Use coupon code 'SAVE40%' at checkout to receive this exclusive discount.      

Amazon explains its cloud storage outage

updated 11:25 am EDT, Fri April 29, 2011

 

Issues credit to customers, promises changes


Amazon has offered an apology and detailed explanation for the disruption of its cloud storage service last week. During an operation to increase network capacity, traffic to the Elastic Block Storage system was incorrectly rerouted to a slower secondary backup network. The service outage affected a number of websites, including Foursquare and Reddit.

Normally, data is simultaneously synchronized across several nodes. If a node detects that its partner isn't responding, it assumes a failure and requests the server to spawn another. Normally, this process is automatic and happens so quickly that human intervention isn't needed. However, the shift to the slower network overwhelmed the peer-to-peer synchronization system, resulting in a massive bottleneck beginning at 12:47 AM PDT on April 21.

Engineers quickly diagnosed the nature of the problem, but restoring the balance of network traffic manually is a delicate and involved task. Within two hours, Amazon engineers were able to tamp down the network traffic without affecting other functions, and by 11:30AM had resolved the problem for all but 13 percent of the EBS volumes. Finding extra capacity to fully restore service required physically moving new servers into the affected data storage clusters. That operation didn't begin until 2:00 AM the following day, April 22.

Rebalancing the network traffic manually took most of the next two days, but by 6:50 PM on April 23, normal system operation was finally restored. However, 2.2 percent of the "stuck" volumes would have to be recovered manually. Eventually, all but 0.07 percent of the data was recovered.

The company is reviewing its procedures for making changes to its network, stating that it will "increase automation" to avoid similar human errors from happening in the future. Amazon is promising a list of other improvements, such as having more capacity available for recovery.

Amazon also issued an apologizy to its EC2 customers, stating:

Last, but certainly not least, we want to apologize. We know how critical our services are to our customers’ businesses and we will do everything we can to learn from this event and use it to drive improvement across our services. As with any significant operational issue, we will spend many hours over the coming days and weeks improving our understanding of the details of the various parts of this event and determining how to make changes to improve our services and processes.


Customers are getting a 10 day credit, whether their service was affected or not. [via All Things Digital]



By Electronista Staff

Post tools:

TAGS :  

Amazon, enterprise, upgrades/storage, networking, cloud, EC2, EBS
toggle

Previous Comments

 
close
Photo
toggle

Network Headlines

toggle

Most Popular

Sponsor

Recent Reviews

MaxUpgrades MaxConnect for 2006-2008 Mac Pro

Nobody outside of Cupertino's privileged bunch knows the future of the Mac Pro line for sure. Despite Apple's reluctance to tell us wh ...

Brother HL-3170CDW LED Printer

We've mentioned before that we are far from a paperless society. For now, at least, there are tasks that require a piece of paper for ...

HTC One

It is hard to overstate just how critically important the HTC One is to the Taiwanese company’s fortunes. Despite its alarming decline ...

Sponsor

 
toggle

Popular News