Printed from

Amazon explains its cloud storage outage

updated 11:25 am EDT, Fri April 29, 2011

Issues credit to customers, promises changes

Amazon has offered an apology and detailed explanation for the disruption of its cloud storage service last week. During an operation to increase network capacity, traffic to the Elastic Block Storage system was incorrectly rerouted to a slower secondary backup network. The service outage affected a number of websites, including Foursquare and Reddit.

Normally, data is simultaneously synchronized across several nodes. If a node detects that its partner isn't responding, it assumes a failure and requests the server to spawn another. Normally, this process is automatic and happens so quickly that human intervention isn't needed. However, the shift to the slower network overwhelmed the peer-to-peer synchronization system, resulting in a massive bottleneck beginning at 12:47 AM PDT on April 21.

Engineers quickly diagnosed the nature of the problem, but restoring the balance of network traffic manually is a delicate and involved task. Within two hours, Amazon engineers were able to tamp down the network traffic without affecting other functions, and by 11:30AM had resolved the problem for all but 13 percent of the EBS volumes. Finding extra capacity to fully restore service required physically moving new servers into the affected data storage clusters. That operation didn't begin until 2:00 AM the following day, April 22.

Rebalancing the network traffic manually took most of the next two days, but by 6:50 PM on April 23, normal system operation was finally restored. However, 2.2 percent of the "stuck" volumes would have to be recovered manually. Eventually, all but 0.07 percent of the data was recovered.

The company is reviewing its procedures for making changes to its network, stating that it will "increase automation" to avoid similar human errors from happening in the future. Amazon is promising a list of other improvements, such as having more capacity available for recovery.

Amazon also issued an apologizy to its EC2 customers, stating:

Last, but certainly not least, we want to apologize. We know how critical our services are to our customers' businesses and we will do everything we can to learn from this event and use it to drive improvement across our services. As with any significant operational issue, we will spend many hours over the coming days and weeks improving our understanding of the details of the various parts of this event and determining how to make changes to improve our services and processes.

Customers are getting a 10 day credit, whether their service was affected or not. [via All Things Digital]

By Electronista Staff


Login Here

Not a member of the MacNN forums? Register now for free.


Network Headlines


Most Popular


Recent Reviews

Seagate Wireless

It seems like no matter how much internal storage is included today's mobile devices, we, as users, will always find a way to fill the ...

Lenovo Yoga Tablet 2 (Android, 10.1-inch)

Lenovo is building a bigger name for itself year after year, including its devices expanding beyond desktop computers. The company's l ...

Brother HL-L8250CDN Color Laser Printer

When it comes to selecting a printer, the process is not exactly something most people put a lot of thought into. Printers are often t ...



Most Commented


Popular News