(The Hosting News) – Amazon has released a report that details last week’s server crash that resulted in massive downtime for sites hosted on its cloud-based Amazon Web Services.
The company reported that the outage was due to incorrectly executing a traffic shift while performing a capacity upgrade on Amazon Web Service’s primary network.
The accident was catastrophic for services reliant on the cloud platform. The report states, “Unlike a normal network interruption, this change disconnected both the primary and secondary network simultaneously, leaving the affected nodes completely isolated from one another”.
In the aftermath, Amazon was able to contain the crash to fewer availability zones and a majority of the services were back up and running by later on the 21st.
After detailing its response, AWS described efforts to improve its system in order to prevent similar events from occurring in the future.
“We will audit our change process and increase the automation to prevent this mistake from happening in the future”, AWS stated.
The company also noted there was room to improve when it came to coordinating with its customers and stated, “We switched to more regular updates part of the way through this event and plan to continue with similar frequency of updates in the future. In addition, we are already working on how we can staff our developer support team more expansively in an event such as this, and organize to provide early and meaningful information, while still avoiding speculation.”
The AWS Team apologized for the inconvenience and concluded by noting, “We know how critical our services are to our customers’ businesses and we will do everything we can to learn from this event and use it to drive improvement across our services. As with any significant operational issue, we will spend many hours over the coming days and weeks improving our understanding of the details of the various parts of this event and determining how to make changes to improve our services and processes”.
With the apology, AWS has promised a 10 day credit to customers effected by the outage.
The recent crash with Amazon Web Services has resulted with some analysts questioning the overall stability of cloud and what can be done to improve it.
For the entire report, visit: http://aws.amazon.com/message/65648/