Amazon outage and the auto-immune vulnerabilities of resiliency
Today is Judgment Day, when Skynet becomes self-aware. It is, apparently, also a very, very bad day for Amazon Web Services.
Lots of people have raised questions today about what Amazon’s difficulties today mean for the future of cloud IaaS. My belief is that this doesn’t do anything to the adoption curve — but I do believe that customers who rely upon Amazon to run their businesses will, and should, think hard about the resiliency of their architectures.
It’s important to understand what did and did not happen today. There’s been a popular impression that “EC2 is down”. It’s not. To understand what happened, though, some explanation of Amazon’s infrastructure is necessary.
Amazon divides its infrastructure into “regions”. You can think of a region as basically analogous to “a data center”. For instance, US-East-1 is Amazon’s Northern Virginia data center, while US-West-1 is Amazon’s Silicon Valley data center. Each region, in turn, is divided into multiple “availability zones” (AZs). You can think of an AZ is basically analogous to “a cluster” — it’s a grouping of physical and logical resources. Each AZ is designated by letters — for instance, US-East-1a, US-East-1b, etc. However, each of these designations are customer-specific (which is why Amazon’s status information cannot easily specify which AZ is affected by a problem).
Amazon’s virtual machine offering is the Elastic Compute Cloud (EC2). When you provision an EC2 “instance” (Amazon’s term for a VM), you also get an allocation of “instance storage”. Instance storage is transient — it exists only as long as the VM exists. Consequently, it’s not useful for storing anything that you actually want to keep. To get persistent storage, you use Amazon’s Elastic Block Store (EBS), which is basically just network-attached storage. Many people run databases on EC2 that are backed by EBS, for instance. Because that’s such a common use case, Amazon offers the Relational Database Service (RDS), which is basically an EC2 instance running MySQL.
Amazon’s issues today are with EBS, and with RDS, both in the US-East-1 region. (My guess is that the issues are related, but Amazon has not specifically stated that they are.) Customers who aren’t in the US-East-1 region aren’t affected (customers always choose which region and specific AZs they run in). Customers who don’t use EBS or RDS are also unaffected. However, use of EBS is highly commonplace, and likely just about everyone using EC2 for a production application or Web site is reliant upon EBS. Consequently, even though EC2 itself has been running just fine, the issues have nevertheless had a major impact on customers. If you’re storing your data on EBS, the issues with EBS have made your data inaccessible, or they’ve made access to that data slow and unreliable. Ditto with RDS. Obviously, if you can’t get to your data, you’re not going to be doing much of anything.
In order to get Amazon’s SLA for EC2, you, as a customer, have to run your application in multiple AZs within the same region. Running in multiple AZs is supposed to isolate you from the failure of any single AZ. In practice, of course, this only provides you so much protection — since the AZs are typically all in the same physical data center, anything that affects that whole data center would probably affect all the AZs. Similarly, the AZs are not totally isolated from one another, either physically or logically.
However, when you create an EBS volume, you place it in a specific availability zone, and you can only attach that EBS volume to EC2 instances within that same availability zone. That complicates resiliency, since if you wanted to fail over into another AZ, you’d still need access to your data. That means if you’re going to run in multiple AZs, you have to replicate your data across multiple AZs.
One of the ways you can achieve this is with the Multi-AZ option of RDS. If you’re running a MySQL database and can do so within the constraints of RDS, the multi-AZ option lets you gain the necessary resiliency for your database without having to replicate EBS volumes between AZs.
As one final caveat, data transfer within a region is free and fast — it’s basically over a local LAN, after all. By contrast, Amazon charges you for transfers between regions, which goes over the Internet and has the attendant cost and latency.
Consequently, there are lots of Amazon customers who are running in just a single region. A lot of those customers may be running in just a single AZ (because they didn’t architect their app to easily run in multiple AZs). And of the ones who are running in multiple AZs, a fair number are reliant upon the multi-AZ functionality of RDS.
That’s why today’s impacts were particularly severe. US-East-1 is Amazon’s most popular region. The problems with EBS impacted the entire region, as did the RDS problems (and multi-AZ RDS was particularly impacted), not just a single AZ, so if you were multiple-AZ but not multi-region, the resiliency you were theoretically getting was of no help to you. Today, people learned that it’s not necessarily adequate to run in multiple AZs. (Justin Santa Barbara has a good post about this.)
My perspective on this is pretty much exactly what I would tell a traditional Web hosting customer who’s running only in one data center: If you want more resiliency, you need to run in more than one data center. And on Amazon, if you want more resiliency, you need to not only be multi-AZ but also multi-region.
Amazon’s SLA for EC2 is 99.95% for multi-AZ deployments. That means that you should expect that you can have about 4.5 hours of total region downtime each year without Amazon violating their SLA. Note, by the way, that this outage does not actually violate their SLA. Their SLA defines unavailability as a lack of external connectivity to EC2 instances, coupled with the inability to provision working instances. In this case, EC2 was just fine by that definition. It was EBS and RDS which weren’t, and neither of those services have SLAs.
So how did Amazon end up with a problem that affected all the AZs within the US-East-1 region? Well, according to their status dashboard, they had some sort of network problem last night in their east coast data center. That problem resulted in their automated resiliency mechanisms attempting to re-mirror a large number of EBS volumes. This impacted one of the AZs, but it also overloaded the control infrastructure for EBS in that region. My guess is that RDS also uses this same storage infrastructure, so the capacity shortages and whatnot created by all of this activity ended up also impacting RDS.
My colleague Jay Heiser, who follows, among other things, risk management, calls this “auto-immune disease” — i.e., resiliency mechanisms can sometimes end up causing you harm. (We’ve seen auto-immune problems happen before in a prior Amazon S3 outage, as well as a Google Gmail outage.) The way to limit auto-immune damage is isolation — ensuring limits to the propagation.
Will some Amazon customers pack up and leave? Will some of them swear off the cloud? Probably. But realistically, we’re talking about data centers, and infrastructure, here. They can and do fail. You have to architect your app to have continuous availability across multiple data centers, if it can never ever go down. Whether you’re running your own data center, running in managed hosting, or running in the cloud, you’re going to face this issue. (Your problems might be different — i.e., your own little data center isn’t going to have the kind of complex problem that Amazon experienced today — but you’re still going to have downtime-causing issues.)
There are a lot of moving parts in cloud IaaS. Any one of them going wrong can bork your entire site/application. Your real problem is appropriate risk mitigation — the risk of downtime and its attendant losses, versus the complications and technical challenges and costs created by infrastructure redundancy.