On 28 February 2016 the Amazon Web Services (AWS) S3 cloud storage service failed, bringing down major websites and services, including Imgur, Medium and the Docker Registry Hub. The reason why might just surprise you!
A detailed postmortem reveals what brought down the S3 service-- an employee apparently typed an input command "incorrectly," causing a larger-than-expected set of servers to be removed from the service. The servers in question supported other AWS products and services, leading to a chain reaction demanding reboot of critical AWS systems.
The reboot is what lead to the S3 failure. The cheap and popular Amazon service finds use mainly for file storage, which is why websites were technically still working even if they were loading slowly or not loading images. At one point even the AWS dashboard was down because of the S3 issue.
"While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years," an explanation by the AWS team reads. "S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected."
Amazon says it is putting safeguards to ensure such an outage will not happen in the future. These include limiting the power of the debugging tools to take multiple subsystems offline and partitioning S3 into smaller "cells" engineers can take offline and debug individually without affecting other parts of the service.
Either way, if Amazon's explanation is really the case, talk about the awesome power of a simple typo!