Introduction
Most IT professionals, and AWS users in particular, will now be aware of Amazon’s significant outage on 28th February, when they managed to “break the internet” as a result of a major outage of AWS S3. The official incident report is available here:
The key factors behind the outage, and the associated impact, were as follows:
- The AWS S3 storage service in the North Virginia (us-east-1) region was unavailable for several hours.
- The service outage affected the entire region, spanning all Availability Zones (AZ’s).
- The root cause of the outage was human error within the AWS internal team, while performing an apparently routine maintenance task.
- The S3 service is used by hundreds of high profile websites, internet services, and other corporate infrastructure, which were unable to serve content as a result of the outage.
- Non-Amazon services that were effected by the outage included various Apple services (App Store, Apple Music and iCloud backups), Slack, Docker, Yahoo mail, and other high-profile websites and services.
What the S3 Outage Teaches Us
Amazon’s incident report is interesting not because of what is tells us about the incident, but because of what it tell us about the limitations and risks of using cloud-based infrastructure in general, and AWS in particular.
Below we outline some key points to consider when planning your cloud infrastructure, to protect you from similar outages in future.
1. Design and Plan for Human Factors
Two decades ago the aviation industry identified the “automation paradox”. As automation was increasingly introduced to commercial aircraft systems, big improvements in reliability and safety were achieved. By removing the human pilot from the loop during stages of flight such as flying an instrument approach, the possibility for human error was eliminated, and safety was improved.
But the automation also results in far more complex systems. This increases the possibility of humans being unable to deal with systems problems quickly enough when they do arise, along with the possibility of simple human errors and oversights having a magnified impact as compared with a more manual system. A simple data entry error on a flight management system can create a lethal flight path.
AWS and other cloud infrastructure has the same problem. Everything is automated. Everything is software defined. Manual steps are eliminated, and human errors and related outages can be avoided. But at the same time, one small human error while operating the automation – one click of a mouse button, or running one script incorrectly – can have very serious consequences.
Probably nobody has ever accidentally gone to a datacentre and then accidentally unplugged a whole rack of servers, and then accidentally thrown them in the trash. But you can guarantee that there are AWS EC2 instances being accidentally trashed every day of the week, made possible by the double-edged-sword of automation.
When Amazon themselves have just suffered from these human factors issues combined with the dangers of automation, you shouldn’t kid yourself that they can’t happen in your own organisation.
Understand the human factors behind your systems reliability. Identify where automation makes it super easy to make huge mistakes, as well as super easy to get the job done.
2. Multiple Availability Zones != Full Redundancy
Regions are a core concept of AWS, with most AWS resources living inside a specific region that you must choose at creation time. Within a Region, we have Availability Zones which in theory provide fault tolerance within that Region, since they are geographically separated and use independent infrastructure including power and networking. However, hardware failures are only one possible failure mode that high-reliability systems need to be designed for. As hardware reliability continues to improve due to new technologies such as solid state storage, hardware failures become increasingly rare. But at the same time, the complexity of distributed software systems running on this hardware means that the software still has lots of failure modes, as do the humans who run the systems.
Don’t be fooled into thinking that a “multi AZ” architecture on AWS automatically equates to a “high availability” architecture.
3. Region-Wide Failures are Possible
In the recent outage, there wasn’t a failure of an AZ at all. The failure was entirely caused by non-hardware issues, and propagated across the entire North Virginia region. There is little to no isolation of services within a region at a logical (software) level, which means that a common-mode failure with a region, due to non-hardware problems, is very possible.
Wherever possible, use a multi-region architecture on AWS, not just a multi-AZ architecture.
4. Most AWS Services are Tightly Coupled to a Region
Whenever you create an S3 bucket, you have to choose a region where you want that bucket to live. The same is true of Amazon’s DynamoDB (NoSQL) service – you have to pick a region before you create your DynamoDB infrastructure. Most AWS services are architected in this way, forcing you to “put all your eggs in one basket”, where the basket is an AWS region.
So AWS infrastructure can be highly distributed within a region, but it’s not so easy to produce true globally-distributed infrastructure. For example, S3 and DynamoDB both support “cross-region replication”, but you have to design, implement and manage this yourself. You can’t just say “give me an S3 bucket and a DynamoDB database that is anywhere in the world that it needs to be”.
By comparison, Google Cloud Storage (equivalent to AWS S3) does provide the concept of “Multi-Regional” buckets. So with no extra effort you can create a storage location that spans multiple regions, although they will always both be on the same continent. Google Cloud Datastore (equivalent to AWS DynamoDB) also provides cross-region data storage, without the need to manually configure replication.
Fully managed or “serverless” infrastructure still requires considerable effort to implement with a multi-region architecture.
5. Design for Common Mode Failures
Adding redundancy adds little value and minimal increase in reliability, if the redundant instances that you have created, are all subject to the same failure modes, and where these failures can happen simultaneously. Common mode failures are typically either software-level faults that can propagate across all your infrastructure, or human-error events where an honest mistake can create a failure across all of your redundant and otherwise very reliable infrastructure.
Think beyond hardware when identifying common mode failure. Consider software faults and human error.
6. Multi-Cloud Architectures for Business Critical Systems
There is no question that AWS and GCP are inherently reliable, high availability platforms. The vast majority of organisations would not be able to beat them in this area. But as the recent S3 outage showed that they are not perfect. Think carefully before going all-in on a single cloud platform.
Where possible, architect your software in a “cloud-neutral” way, or at least with an appropriate abstraction layer, so that you can move cloud platform later if you need to, or, even better, can run your infrastructure across both AWS and GCP, all of the time.
This can be tricky to achieve in practice, especially in the area of data storage when a single-source-of-truth is required. But it is very achievable in other areas such as commodity compute resources, or read-only content delivery to the web.
Take advantage of the multitude of mature cloud platforms that are now available. Build systems to work on two or more of them.
7. The Cloud is Beyond your Control
There are numerous benefits to cloud based infrastructure, and software defined infrastructure in particular, that we don’t need to cover here. Thanks to AWS and Google Cloud Platform, Priocept is able to deliver projects for its clients faster, better and cheaper than it was before, and has more fun in the process.
But the fact is that you are relinquishing control of your infrastructure to AWS or Google or Microsoft, and they are not infallible. The centralised, shared nature of cloud infrastructure also means that they have tasks that they need to do, that are essential for supporting infrastructure or other customers that you don’t care about, but which introduce risks for your own system. If you have a very stable, well proven system, which “just works” and never changes, and for which you have invested years of effort making it rock-solid, perhaps you should keep it that way and maintain control, even if it means swimming against the tide of “move everything to the cloud”.
If you decide to relinquish control of your business operation to AWS or GCP, maybe have a Plan B ready.
The Mysterious “Established Playbook”
The S3 incident report includes the following:
At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.
We are intrigued by what “established playbook” refers to exactly. “Playbook” is not official AWS product terminology, although in the general sense it implies some kind of document that describes a set of procedures to be followed manually, which could be the root of the issue. “Playbook” could also refer to Ansible if it relates to an automated process, but we would have expected Amazon to have been using their own technology (CloudFormation?) to manage the servers (presumably EC2 instances?) that they refer to. The playbook also appears to have been lacking any kind of protection for inappropriate input parameters. If you have any insight into how AWS operate their infrastructure in this area, please let us know.