Amazon outage and the auto-immune vulnerabilities of resiliency

Today is Judgment Day, when Skynet becomes self-aware. It is, apparently, also a very, very bad day for Amazon Web Services.

Lots of people have raised questions today about what Amazon’s difficulties today mean for the future of cloud IaaS. My belief is that this doesn’t do anything to the adoption curve — but I do believe that customers who rely upon Amazon to run their businesses will, and should, think hard about the resiliency of their architectures.

It’s important to understand what did and did not happen today. There’s been a popular impression that “EC2 is down”. It’s not. To understand what happened, though, some explanation of Amazon’s infrastructure is necessary.

Amazon divides its infrastructure into “regions”. You can think of a region as basically analogous to “a data center”. For instance, US-East-1 is Amazon’s Northern Virginia data center, while US-West-1 is Amazon’s Silicon Valley data center. Each region, in turn, is divided into multiple “availability zones” (AZs). You can think of an AZ is basically analogous to “a cluster” — it’s a grouping of physical and logical resources. Each AZ is designated by letters — for instance, US-East-1a, US-East-1b, etc. However, each of these designations are customer-specific (which is why Amazon’s status information cannot easily specify which AZ is affected by a problem).

Amazon’s virtual machine offering is the Elastic Compute Cloud (EC2). When you provision an EC2 “instance” (Amazon’s term for a VM), you also get an allocation of “instance storage”. Instance storage is transient — it exists only as long as the VM exists. Consequently, it’s not useful for storing anything that you actually want to keep. To get persistent storage, you use Amazon’s Elastic Block Store (EBS), which is basically just network-attached storage. Many people run databases on EC2 that are backed by EBS, for instance. Because that’s such a common use case, Amazon offers the Relational Database Service (RDS), which is basically an EC2 instance running MySQL.

Amazon’s issues today are with EBS, and with RDS, both in the US-East-1 region. (My guess is that the issues are related, but Amazon has not specifically stated that they are.) Customers who aren’t in the US-East-1 region aren’t affected (customers always choose which region and specific AZs they run in). Customers who don’t use EBS or RDS are also unaffected. However, use of EBS is highly commonplace, and likely just about everyone using EC2 for a production application or Web site is reliant upon EBS. Consequently, even though EC2 itself has been running just fine, the issues have nevertheless had a major impact on customers. If you’re storing your data on EBS, the issues with EBS have made your data inaccessible, or they’ve made access to that data slow and unreliable. Ditto with RDS. Obviously, if you can’t get to your data, you’re not going to be doing much of anything.

In order to get Amazon’s SLA for EC2, you, as a customer, have to run your application in multiple AZs within the same region. Running in multiple AZs is supposed to isolate you from the failure of any single AZ. In practice, of course, this only provides you so much protection — since the AZs are typically all in the same physical data center, anything that affects that whole data center would probably affect all the AZs. Similarly, the AZs are not totally isolated from one another, either physically or logically.

However, when you create an EBS volume, you place it in a specific availability zone, and you can only attach that EBS volume to EC2 instances within that same availability zone. That complicates resiliency, since if you wanted to fail over into another AZ, you’d still need access to your data. That means if you’re going to run in multiple AZs, you have to replicate your data across multiple AZs.

One of the ways you can achieve this is with the Multi-AZ option of RDS. If you’re running a MySQL database and can do so within the constraints of RDS, the multi-AZ option lets you gain the necessary resiliency for your database without having to replicate EBS volumes between AZs.

As one final caveat, data transfer within a region is free and fast — it’s basically over a local LAN, after all. By contrast, Amazon charges you for transfers between regions, which goes over the Internet and has the attendant cost and latency.

Consequently, there are lots of Amazon customers who are running in just a single region. A lot of those customers may be running in just a single AZ (because they didn’t architect their app to easily run in multiple AZs). And of the ones who are running in multiple AZs, a fair number are reliant upon the multi-AZ functionality of RDS.

That’s why today’s impacts were particularly severe. US-East-1 is Amazon’s most popular region. The problems with EBS impacted the entire region, as did the RDS problems (and multi-AZ RDS was particularly impacted), not just a single AZ, so if you were multiple-AZ but not multi-region, the resiliency you were theoretically getting was of no help to you. Today, people learned that it’s not necessarily adequate to run in multiple AZs. (Justin Santa Barbara has a good post about this.)

My perspective on this is pretty much exactly what I would tell a traditional Web hosting customer who’s running only in one data center: If you want more resiliency, you need to run in more than one data center. And on Amazon, if you want more resiliency, you need to not only be multi-AZ but also multi-region.

Amazon’s SLA for EC2 is 99.95% for multi-AZ deployments. That means that you should expect that you can have about 4.5 hours of total region downtime each year without Amazon violating their SLA. Note, by the way, that this outage does not actually violate their SLA. Their SLA defines unavailability as a lack of external connectivity to EC2 instances, coupled with the inability to provision working instances. In this case, EC2 was just fine by that definition. It was EBS and RDS which weren’t, and neither of those services have SLAs.

So how did Amazon end up with a problem that affected all the AZs within the US-East-1 region? Well, according to their status dashboard, they had some sort of network problem last night in their east coast data center. That problem resulted in their automated resiliency mechanisms attempting to re-mirror a large number of EBS volumes. This impacted one of the AZs, but it also overloaded the control infrastructure for EBS in that region. My guess is that RDS also uses this same storage infrastructure, so the capacity shortages and whatnot created by all of this activity ended up also impacting RDS.

My colleague Jay Heiser, who follows, among other things, risk management, calls this “auto-immune disease” — i.e., resiliency mechanisms can sometimes end up causing you harm. (We’ve seen auto-immune problems happen before in a prior Amazon S3 outage, as well as a Google Gmail outage.) The way to limit auto-immune damage is isolation — ensuring limits to the propagation.

Will some Amazon customers pack up and leave? Will some of them swear off the cloud? Probably. But realistically, we’re talking about data centers, and infrastructure, here. They can and do fail. You have to architect your app to have continuous availability across multiple data centers, if it can never ever go down. Whether you’re running your own data center, running in managed hosting, or running in the cloud, you’re going to face this issue. (Your problems might be different — i.e., your own little data center isn’t going to have the kind of complex problem that Amazon experienced today — but you’re still going to have downtime-causing issues.)

There are a lot of moving parts in cloud IaaS. Any one of them going wrong can bork your entire site/application. Your real problem is appropriate risk mitigation — the risk of downtime and its attendant losses, versus the complications and technical challenges and costs created by infrastructure redundancy.

About these ads

Posted on April 21, 2011, in Infrastructure and tagged , , . Bookmark the permalink. 23 Comments.

  1. Jonathan Tersen

    “data transfer within a region is free and fast — it’s basically over a local LAN, after all.”

    Are you certain of that? Because Amazon’s own page would seem to indicate it is charged:

    “Data transferred between Amazon EC2 instances located in different Availability Zones in the same Region will be charged Regional Data Transfer.”

    …which is $0.01 per GB. Transfer between different AWS *services* within a region is free. But EC2 to EC2 across AZ’s, I believe actually costs their customers for the transit.

  2. Yes, strictly speaking you’re correct — transfer is free between services, and free between instances in the same AZ with private IPs, but there’s a regional transfer charge otherwise.

  3. Lydia,

    The Amazon outage also highlights well known risks regarding the management of complex environments. The majority of outages are caused by maintenance procedures. The most common issue with these procedures involves human error. Firms that employ mature change management practices can limit the likelihood and impact of these errors. I covered this topic in a blog post – http://www.dangreller.com/archives/370

    Dan Greller
    Invisible Laws Blog

  4. What’s up, after reading this amazing paragraph i am as well happy to share my knowledge here with friends.

  5. I just put the link of your blog on my Facebook Wall.
    very nice blog indeed.

  6. В последнее время ничтожную популярность получили ягоды годжи.
    И это не постоянно, поскольку мы супер
    полезные и при этом вкусные и красивые .

    Ягода годжи растет в сушеном
    виде . При употреблении ее засыпают диета
    горячей присыпкой, реже всего сочетая
    с чаем . Чай с ягодами годжи
    заваривается очень вкусным и невыносимым.

  7. I have more confidence in the earnings potential of my online campaigns than in the
    possession of a degree in Business Administration. There are
    other factors that lookup engines think about as properly – like for example, age of site, relevance
    of subject material to web page, and many others. Offer your website visitors
    information that isn’t readily available through other internet
    or print media options.

  8. This page certainly has all of the information and facts I wanted about
    this subject and didn’t know who to ask.

  9. Right now it sounds like Expression Engine is the top blogging platform out there
    right now. (from what I’ve read) Is that what you’re using on your blog?

  10. hello there and thank you for your information – I’ve certainly picked up something new from right here.
    I did however expertise some technical points using this web site, since I experienced
    to reload the web site many times previous to I could get it to
    load correctly. I had been wondering if your hosting is OK?
    Not that I am complaining, but slow loading instances times will often affect your placement in google and could damage your high-quality score if advertising and marketing
    with Adwords. Anyway I am adding this RSS to my email and can look out for a lot more of your respective
    exciting content. Make sure you update this again very soon.

  11. Excellent Piece of writing

  12. I seldom discuss these items, but I thought this on deserved a well done you

  13. Wow, fantastic blog layout! How lengthy have you been blogging for?
    you made running a blog glance easy. The entire glance of your
    site is great, let alone the content material!

  14. great points altogether, you simply gained a brand new
    reader. What would you suggest about your submit that you simply made a few days ago?
    Any certain?

  15. Here are a feww reported cons and prroblems with the Playstation 360:.
    Imagine playing a suspense game like Dead Island, or Silent Hiill inn Dolby 5.
    This facility though iss not available with the old genetation TV sets.

  16. If you feel so much need of excess quantity of it,
    you assure not to drive and for this you can leave car keys at your home.
    When you have committed a crime and have been arrested there are several different court dates that
    you will have to attend. After all, these proofs are often based on subjective verdict, rather than scientific evidences and objective.

  17. Greetings from Colorado! I’m bored to death at work so I decided to check
    out your blog on my iphone during lunch break. I love the knowledge you provide here and can’t wait to take a look when I get home.
    I’m amazed at how quick your blog loaded on my cell
    phone .. I’m not even using WIFI, just 3G .. Anyhow, wonderful blog!

  18. Do you mind if I quote a few of your posts as long as I provide credit and sources back to your
    blog? My blog site is in the exact same niche as yours and my
    users would really benefit from a lot of the information you present here.

    Please let me know if this okay with you. Many thanks!

  19. What’s up, just wanted to say, I enjoyed this article.
    It was funny. Keep on posting!

  1. Pingback: Amazon outage and the auto-immune vulnerabilities of resiliency | Donk.in

  2. Pingback: Translogistics | Logistics And Frieght Forwarding

  3. Pingback: Major Outage on Amazons EC2 US-East Datacenter – Many sites affected | ITPark

  4. Pingback: Gartner research related to Amazon « CloudPundit: Massive-Scale Computing

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 79 other followers

%d bloggers like this: