Earlier this week, Amazon’s Web Services suffered from a significant outage, bringing down a number of online services and plaguing Apple’s own iCloud platform. While not much was known about what caused the lengthy outage at the time, Amazon has published a new blog post detailing what exactly went wrong, pinpointing it on a human error.
In a note posted on the Amazon Web Services blog, Amazon explained that the Amazon Simple Storage Service (S3) team was in the process of debugging an issue that was causing the S3 payment platform to perform slowly. It was during this process that an S3 team member executed an incorrect command and ended up removing a larger set of servers than what was originally intended.
Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems. One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests.
The company goes on to explain that S3 subsystems are designed to support the removal or failure of significant capacity with no customer impact, but that because of the exponential growth that Amazon has experienced, the process of restarting the servers and running safety checks took longer than expected.
In order to prevent such an issue from happening in the future, Amazon has modified its subsets to remove server capacity more slowly and have additional safeguards to perform checks and expedite the process of restarting and running those checks. Amazon is also re-partitioning the index subsystem to divide it into smaller sections, thus speeding up recovery time.
Amazon’s Web Services outage had a significant impact on the Internet on Tuesday, primarily in the eastern portion of the United States. Apple relies on AWS for some of its iCloud operations and thus iCloud performance was slowed for some users, as well. Amazon ends its post today apologizing for the problems:
Finally, we want to apologize for the impact this event caused for our customers. While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further.