Amazon outage and the auto-immune vulnerabilities of resiliency

Today is Judgment Day, when Skynet becomes self-aware. It is, apparently, also a very, very bad day for Amazon Web Services.

Lots of people have raised questions today about what Amazon’s difficulties today mean for the future of cloud IaaS. My belief is that this doesn’t do anything to the adoption curve — but I do believe that customers who rely upon Amazon to run their businesses will, and should, think hard about the resiliency of their architectures.

It’s important to understand what did and did not happen today. There’s been a popular impression that “EC2 is down”. It’s not. To understand what happened, though, some explanation of Amazon’s infrastructure is necessary.

Amazon divides its infrastructure into “regions”. You can think of a region as basically analogous to “a data center”. For instance, US-East-1 is Amazon’s Northern Virginia data center, while US-West-1 is Amazon’s Silicon Valley data center. Each region, in turn, is divided into multiple “availability zones” (AZs). You can think of an AZ is basically analogous to “a cluster” — it’s a grouping of physical and logical resources. Each AZ is designated by letters — for instance, US-East-1a, US-East-1b, etc. However, each of these designations are customer-specific (which is why Amazon’s status information cannot easily specify which AZ is affected by a problem).

Amazon’s virtual machine offering is the Elastic Compute Cloud (EC2). When you provision an EC2 “instance” (Amazon’s term for a VM), you also get an allocation of “instance storage”. Instance storage is transient — it exists only as long as the VM exists. Consequently, it’s not useful for storing anything that you actually want to keep. To get persistent storage, you use Amazon’s Elastic Block Store (EBS), which is basically just network-attached storage. Many people run databases on EC2 that are backed by EBS, for instance. Because that’s such a common use case, Amazon offers the Relational Database Service (RDS), which is basically an EC2 instance running MySQL.

Amazon’s issues today are with EBS, and with RDS, both in the US-East-1 region. (My guess is that the issues are related, but Amazon has not specifically stated that they are.) Customers who aren’t in the US-East-1 region aren’t affected (customers always choose which region and specific AZs they run in). Customers who don’t use EBS or RDS are also unaffected. However, use of EBS is highly commonplace, and likely just about everyone using EC2 for a production application or Web site is reliant upon EBS. Consequently, even though EC2 itself has been running just fine, the issues have nevertheless had a major impact on customers. If you’re storing your data on EBS, the issues with EBS have made your data inaccessible, or they’ve made access to that data slow and unreliable. Ditto with RDS. Obviously, if you can’t get to your data, you’re not going to be doing much of anything.

In order to get Amazon’s SLA for EC2, you, as a customer, have to run your application in multiple AZs within the same region. Running in multiple AZs is supposed to isolate you from the failure of any single AZ. In practice, of course, this only provides you so much protection — since the AZs are typically all in the same physical data center, anything that affects that whole data center would probably affect all the AZs. Similarly, the AZs are not totally isolated from one another, either physically or logically.

However, when you create an EBS volume, you place it in a specific availability zone, and you can only attach that EBS volume to EC2 instances within that same availability zone. That complicates resiliency, since if you wanted to fail over into another AZ, you’d still need access to your data. That means if you’re going to run in multiple AZs, you have to replicate your data across multiple AZs.

One of the ways you can achieve this is with the Multi-AZ option of RDS. If you’re running a MySQL database and can do so within the constraints of RDS, the multi-AZ option lets you gain the necessary resiliency for your database without having to replicate EBS volumes between AZs.

As one final caveat, data transfer within a region is free and fast — it’s basically over a local LAN, after all. By contrast, Amazon charges you for transfers between regions, which goes over the Internet and has the attendant cost and latency.

Consequently, there are lots of Amazon customers who are running in just a single region. A lot of those customers may be running in just a single AZ (because they didn’t architect their app to easily run in multiple AZs). And of the ones who are running in multiple AZs, a fair number are reliant upon the multi-AZ functionality of RDS.

That’s why today’s impacts were particularly severe. US-East-1 is Amazon’s most popular region. The problems with EBS impacted the entire region, as did the RDS problems (and multi-AZ RDS was particularly impacted), not just a single AZ, so if you were multiple-AZ but not multi-region, the resiliency you were theoretically getting was of no help to you. Today, people learned that it’s not necessarily adequate to run in multiple AZs. (Justin Santa Barbara has a good post about this.)

My perspective on this is pretty much exactly what I would tell a traditional Web hosting customer who’s running only in one data center: If you want more resiliency, you need to run in more than one data center. And on Amazon, if you want more resiliency, you need to not only be multi-AZ but also multi-region.

Amazon’s SLA for EC2 is 99.95% for multi-AZ deployments. That means that you should expect that you can have about 4.5 hours of total region downtime each year without Amazon violating their SLA. Note, by the way, that this outage does not actually violate their SLA. Their SLA defines unavailability as a lack of external connectivity to EC2 instances, coupled with the inability to provision working instances. In this case, EC2 was just fine by that definition. It was EBS and RDS which weren’t, and neither of those services have SLAs.

So how did Amazon end up with a problem that affected all the AZs within the US-East-1 region? Well, according to their status dashboard, they had some sort of network problem last night in their east coast data center. That problem resulted in their automated resiliency mechanisms attempting to re-mirror a large number of EBS volumes. This impacted one of the AZs, but it also overloaded the control infrastructure for EBS in that region. My guess is that RDS also uses this same storage infrastructure, so the capacity shortages and whatnot created by all of this activity ended up also impacting RDS.

My colleague Jay Heiser, who follows, among other things, risk management, calls this “auto-immune disease” — i.e., resiliency mechanisms can sometimes end up causing you harm. (We’ve seen auto-immune problems happen before in a prior Amazon S3 outage, as well as a Google Gmail outage.) The way to limit auto-immune damage is isolation — ensuring limits to the propagation.

Will some Amazon customers pack up and leave? Will some of them swear off the cloud? Probably. But realistically, we’re talking about data centers, and infrastructure, here. They can and do fail. You have to architect your app to have continuous availability across multiple data centers, if it can never ever go down. Whether you’re running your own data center, running in managed hosting, or running in the cloud, you’re going to face this issue. (Your problems might be different — i.e., your own little data center isn’t going to have the kind of complex problem that Amazon experienced today — but you’re still going to have downtime-causing issues.)

There are a lot of moving parts in cloud IaaS. Any one of them going wrong can bork your entire site/application. Your real problem is appropriate risk mitigation — the risk of downtime and its attendant losses, versus the complications and technical challenges and costs created by infrastructure redundancy.

Posted on April 21, 2011, in Infrastructure and tagged Amazon, cloud, risk. Bookmark the permalink. 31 Comments.

Leave a comment
Trackbacks 4
Comments 27

Jonathan Tersen | April 22, 2011 at 9:25 pm

“data transfer within a region is free and fast — it’s basically over a local LAN, after all.”

Are you certain of that? Because Amazon’s own page would seem to indicate it is charged:

“Data transferred between Amazon EC2 instances located in different Availability Zones in the same Region will be charged Regional Data Transfer.”

…which is $0.01 per GB. Transfer between different AWS *services* within a region is free. But EC2 to EC2 across AZ’s, I believe actually costs their customers for the transit.

LikeLike
Lydia Leong | April 22, 2011 at 11:40 pm

Yes, strictly speaking you’re correct — transfer is free between services, and free between instances in the same AZ with private IPs, but there’s a regional transfer charge otherwise.

LikeLike
Dan Greller | May 1, 2011 at 1:01 pm

Lydia,

The Amazon outage also highlights well known risks regarding the management of complex environments. The majority of outages are caused by maintenance procedures. The most common issue with these procedures involves human error. Firms that employ mature change management practices can limit the likelihood and impact of these errors. I covered this topic in a blog post – http://www.dangreller.com/archives/370

Dan Greller
Invisible Laws Blog

LikeLike
Carmine | July 27, 2013 at 1:14 am

What’s up, after reading this amazing paragraph i am as well happy to share my knowledge here with friends.

LikeLike
Drusilla | May 23, 2014 at 8:37 pm

I just put the link of your blog on my Facebook Wall.
very nice blog indeed.

LikeLike
диета | May 28, 2014 at 11:33 am

В последнее время ничтожную популярность получили ягоды годжи.
И это не постоянно, поскольку мы супер
полезные и при этом вкусные и красивые .

Ягода годжи растет в сушеном
виде . При употреблении ее засыпают диета
горячей присыпкой, реже всего сочетая
с чаем . Чай с ягодами годжи
заваривается очень вкусным и невыносимым.

LikeLike
internet marketing | June 7, 2014 at 5:40 am

I have more confidence in the earnings potential of my online campaigns than in the
possession of a degree in Business Administration. There are
other factors that lookup engines think about as properly – like for example, age of site, relevance
of subject material to web page, and many others. Offer your website visitors
information that isn’t readily available through other internet
or print media options.

LikeLike
L'essentiel de la gestion publique locale fiches de cours & exercices corrigés pdf | June 18, 2014 at 3:48 am

This page certainly has all of the information and facts I wanted about
this subject and didn’t know who to ask.

LikeLike
L'Union de l'encre et du pinceau pdf | June 25, 2014 at 2:11 pm

Right now it sounds like Expression Engine is the top blogging platform out there
right now. (from what I’ve read) Is that what you’re using on your blog?

LikeLike
pdf | June 26, 2014 at 3:41 am

hello there and thank you for your information â€“ I’ve certainly picked up something new from right here.
I did however expertise some technical points using this web site, since I experienced
to reload the web site many times previous to I could get it to
load correctly. I had been wondering if your hosting is OK?
Not that I am complaining, but slow loading instances times will often affect your placement in google and could damage your high-quality score if advertising and marketing
with Adwords. Anyway I am adding this RSS to my email and can look out for a lot more of your respective
exciting content. Make sure you update this again very soon.

LikeLike
Elizabeth King | July 18, 2014 at 12:21 pm

Excellent Piece of writing

LikeLike
Ryan Brown | July 18, 2014 at 9:15 pm

I seldom discuss these items, but I thought this on deserved a well done you

LikeLike
this domain | July 26, 2014 at 8:42 am

Wow, fantastic blog layout! How lengthy have you been blogging for?
you made running a blog glance easy. The entire glance of your
site is great, let alone the content material!

LikeLike
займ | July 28, 2014 at 4:01 am

great points altogether, you simply gained a brand new
reader. What would you suggest about your submit that you simply made a few days ago?
Any certain?

LikeLike
ps2 emulator for android galaxy s3 | July 29, 2014 at 2:51 pm

Here are a feww reported cons and prroblems with the Playstation 360:.
Imagine playing a suspense game like Dead Island, or Silent Hiill inn Dolby 5.
This facility though iss not available with the old genetation TV sets.

LikeLike
Blair | August 1, 2014 at 7:32 am

If you feel so much need of excess quantity of it,
you assure not to drive and for this you can leave car keys at your home.
When you have committed a crime and have been arrested there are several different court dates that
you will have to attend. After all, these proofs are often based on subjective verdict, rather than scientific evidences and objective.

LikeLike
drag racing cheats | August 5, 2014 at 6:02 am

Greetings from Colorado! I’m bored to death at work so I decided to check
out your blog on my iphone during lunch break. I love the knowledge you provide here and can’t wait to take a look when I get home.
I’m amazed at how quick your blog loaded on my cell
phone .. I’m not even using WIFI, just 3G .. Anyhow, wonderful blog!

LikeLike
gta 5 money cheat | August 5, 2014 at 6:31 am

Do you mind if I quote a few of your posts as long as I provide credit and sources back to your
blog? My blog site is in the exact same niche as yours and my
users would really benefit from a lot of the information you present here.

Please let me know if this okay with you. Many thanks!

LikeLike
Lachlan | August 6, 2014 at 1:18 pm

What’s up, just wanted to say, I enjoyed this article.
It was funny. Keep on posting!

LikeLike
cheap campervan hire | September 18, 2014 at 9:19 pm

My brother recommended I might like this web site.
He used to be totally right. This put up actually made my
day. You cann’t imagine just how a lot time I had spent for this information!
Thank you!

LikeLike
Detroit Michigan lawyers | September 25, 2014 at 3:53 am

I visited many webb pages howevr the audio quality forr auxio solngs current at this site is really marvelous.

LikeLike
Madison Perry | September 27, 2014 at 4:44 am

Hello, important information and an exciting post, it is going to be fascinating if this is still the situation in a
few years time

LikeLike
find the perfect gift | September 28, 2014 at 11:57 pm

Whichever is more suitable for you there is no shortage
of places you can go to find a gift basket
that will grab your interest. To make the gift even more special personalise
the gift with a special message of your choice. We can make this will little more than a day and it will be a very well received
gift for our parents.

LikeLike
Amazing Selling Machine bonus | October 7, 2014 at 11:33 am

Thanks for every other magnificent post. The place else could anyone get
that type of information in such an ideal manner of writing?

I have a presentation next week, and I am at the look for such information.

LikeLike
Tammi | November 6, 2014 at 1:41 pm

Very great post. I just stumbled upon your blog and wished to mention that I’ve truly enjoyed surfing around your weblog posts.
In any case I will be subscribing on your rss feed and I’m hoping
you write once more soon!

LikeLike
Uta | December 30, 2014 at 2:25 am

After the second three minute cooking, lower the
heat to medium low and let them cook covered for a full
seven (7) minutes. You can use a basic charcoal grill or
even a fire pit to grill your hamburgers. Ideally a wheat bread must be used instead of the white buns that are normally used for burgers.

LikeLike
mobitrans dubai | February 10, 2016 at 9:06 pm

Hello mates, how is the whole thing, and what you wish for
to say on the topic of this piece of writing,
in my view its genuinely awesome for me.

LikeLike

CloudPundit: Massive-Scale Computing

the business of Internet infrastructure, cloud computing, and data centers

Leave a comment

Trackbacks 4

Comments 27

Leave a comment

About the Author

Search

Categories

Archives

Recent Posts

Recent Comments

More Comments

Top Clicks

Meta

Follow me on Twitter

CloudPundit: Massive-Scale Computing

the business of Internet infrastructure, cloud computing, and data centers

Amazon outage and the auto-immune vulnerabilities of resiliency

Related

Leave a comment

Trackbacks 4

Comments 27

Leave a comment

About the Author

Search

Categories

Archives

Recent Posts

Recent Comments

More Comments

Top Clicks

Meta

Follow me on Twitter