Cloud IaaS is not magical, and the Amazon reboot-a-thon
Randy Bias has blogged about Amazon mandating instance reboots for hundreds, perhaps thousands, of instances (Amazon’s term for VMs). Affected instances seem to be scheduled for reboots over the next couple of weeks. Speculation is that the reboots are to patch a recently-reported vulnerability in the Xen hypervisor, which is the virtualization technology that underlies Amazon’s EC2. The GigaOm story gives some links, and the CRN story discusses customer pain.
Maintenance reboots are not new on Amazon, and are detailed on Amazon’s documentation about scheduled maintenance. The required reboots this time are instance reboots, which are easily accomplished — just point-and-click to reboot on your own schedule rather than Amazon’s (although you cannot delay past the scheduled reboot). Importantly, instance reboots do not result in a change of IP address nor do they erase the data in instance storage (which is normally non-persistent).
For some customers, of course, a reboot represents a headache, and it results in several minutes of downtime for that instance. Also, since this is peak retail season, it is already a sensitive, heavy-traffic time for many businesses, so the timing of this widespread maintenance is problematic for many customers.
However, cloud IaaS isn’t magical. If these customers were using dedicated hosting, they would still be subject to mandated reboots for security patches — hosting providers generally offer some flexibility on scheduling such reboots, but not aa lot (and sometimes none at all if there’s an exploit in the wild). If these customers were using a provider that uses live migration technology (like VMotion on a VMware-virtualized cloud), they might be spared reboots for system reasons, but they might still be subject to reboots for mandated operating system patches.
Given that what’s underlying EC2 are ordinary physical servers running virtualization without a live migration technology in use, customers should reasonably expect that they will be subject to reboots — server-level (what Amazon calls a system reboot), as well as instance-level — and also anticipate that they may sometimes need to reboot for their own guest OS patches and the like (assuming that they don’t simply patch their AMIs and re-launch their instances, arguably a more “cloudy” way to approach this problem).
What makes this rolling scheduled maintenance remarkable is its sheer scale. Hosting providers typically have a few hundred customers and a few thousand servers. Mass-market VPS hosters have lots of VPS containers, but there’s a roughly 1:1 VPS:customer ratio and a small-business-centricity that doesn’t lead to this kind of hullabaloo. Amazon’s largest competitor is estimated to be around the 100,000 VM mark. Only the largest cloud IaaS providers have more than 2,000 VMs. Consequently, this involves a virtually unprecedented number of customers and mission-critical systems.
Amazon has actually been very good about not taking down its cloud customers for extended maintenance windows. (I can think of one major Amazon competitor that took down one whole data center for an eight-hour maintenence evidently involving a total outage this past weekend, and which regularly has long-downtime maintenance windows in general.) A reboot is an inconvenience, but if you are running production infrastructure, you should darn well think about how to handle the occasional reboot, including reboots that affect a significant percentage of your infrastructure, because reboots are not likely to go away in IaaS anytime soon.
To hammer on the point again: Cloud IaaS is not magical. It still requires management, and it still has some of the foibles of both physical servers and non-cloud virtualization. Being able to push a button and get infrastructure is nice, but the responsibility to manage that infrastructure doesn’t go away — it’s just that many cloud customers manage to delay the day of reckoning when the attention they haven’t paid to management comes back to bite them.
If you run infrastructure, regardless of whether it’s in your own data center, in hosting, or in cloud IaaS, you should have a plan for “what happens if I need to mass-reboot my servers?” because it is something that will happen. And add “what if I have to do that immediately?” to the list, because that is also something that will happen, because mass exploits and worms certainly have not gone away.
(Gartner clients only: Check out a note by my security colleagues, “Address Concentration Risk in Public Cloud Deployments and Shared-Service Delivery Models to Avoid Unacceptable Losses“.)
Amazon and the power of default choices
Estimates of Amazon’s revenues in the cloud IaaS market vary, but you could put it upwards of $1 billion in 2011 and not cause too much controversy. That’s a dominant market share, comprised heavily of early adopters but at this point, also drawing in the mainstream business — particularly the enterprise, which has become increasingly comfortable adopting Amazon services in a tactical manner. (Today, Amazon’s weakness is the mid-market — and it’s clear from the revenue patterns, too, that Amazon’s competitors are mostly winning in the mid-market. The enterprise is highly likely to go with Amazon, although it may also have an alternative provider such as Terremark for use cases not well-suited to Amazon.)
There are many, many other providers out there who are offering cloud IaaS, but Amazon is the brand that people know. They created this market; they have remained synonymous with it.
That means that for many organizations that are only now beginning to adopt cloud IaaS (i.e., traditional businesses that already run their own data centers), Amazon is the default choice. It’s the provider that everyone looks at because they’re big — and because they’re #1, they’re increasingly perceived as a safe choice. And because Amazon makes it superbly easy to sign up and get started (and get started for free, if you’re just monkeying around), there’s no reason not to give them a whirl.
Default choices are phenomenally powerful. (You can read any number of scientific papers and books about this.) Many businesses believe that they’ve got a low-risk project that they can toss on cloud IaaS and see what happens next. Or they’ve got an instant need and no time to review all the options, so they simply do something, because it’s better than not doing something (assuming that the organization is one in which people who get things done are not punished for not having filled out a form in triplicate first).
Default choices are often followed by inertia. Yeah, the company put a project on Amazon. It’s running fine, so people figure, why mess with it? They’ve got this larger internal private cloud story they’re working on, or this other larger cloud IaaS deal they’re working on, but… well, they figure, they can migrate stuff later. And it’s absolutely true that people can and do migrate, or in many cases, build a private cloud or add another cloud IaaS provider, but a high enough percentage of the time, whatever they stuck out there remains at Amazon, and possibly begins to accrete other stuff.
This is increasingly leaving the rest of the market trying to pry customers away from a provider they’re already using. It’s absolutely true that Amazon is not the ideal provider for all use cases. It’s absolutely true that any number of service providers can tell me endless stories of customers who have left Amazon for them. It’s probably true, as many service providers claim, that customers who are experienced with Amazon are better educated about the cloud and their needs, and therefore become better consumers of their next cloud provider.
But it does not change the fact that Amazon has been working on conquering the market one developer at a time, and that in turn has become the bean-counters in business saying, hey, shouldn’t we be using these Amazon guys?
This is what every vendor wants: For the dude at the customer to be trying to explain to his boss why he’s not using them.
This is increasingly my client inquiry pattern: Client has decided they are definitively not using Amazon (for various reasons, sometimes emotional and sometimes well thought out) and are looking at other options, or they are looking at cloud IaaS and are figuring that they’ll probably use Amazon or have even actually deployed stuff on Amazon (even if they have done zero reading or evaluation). Two extremes.
Results of Symposium workshop on Amazon
I promised the attendees at my Gartner Symposium workshop, called “Using Amazon Web Services“, that I would post the notes from the session, so here they are — with some context for public consumption.
A workshop is a structured, facilitated discussions that are designed to assist participants in working through a problem, coming up with best practices, etc. This one had thirty people, all from IT organizations that were either using Amazon or planning to use Amazon.
Because I didn’t know what level of experience with Amazon the workshop attendees would have, I actually prepared two workshops in advance. One of them was a highly structured work-through of preparing to use Amazon in a more formal way (i.e., not a single developer with a credit card or the like), and the other was a facilitated sharing of challenges and best practices amongst current adopters. As the room skewed heavily towards people who already had a deployment well under way, this workshop focused on the latter.
I started the workshop with introductions — people, companies, current use cases. Then, I asked attendees to share their use cases in more details in their smaller working groups. This turned into a set of active discussions that I allowed extra time for, before I asked each of the group to make a list of their most significant challenges in adopting/using Amazon, and their solutions if any. Throughout, I circulated the room, listening and, rarely, commenting. Each group then shared their findings, and I offered some commentary and then did an open Q&A (with some more participant sharing of their answers to questions).
Broadly, I would say that we had three types of people in the room. We had folks from the public sector and education, who were at a relatively early stage in adoption; we had people who were test/dev oriented but in a significant way (i.e., formal adoption, not a handful of developers doing their thang); and we had people who were more e-business oriented (including people from net-native businesses like SaaS, as well as traditional businesses with a hosting type of need), although that could be test/dev or production. Most of the people were mid-level IT management with direct responsibility for the Amazon services.
Some key observations:
Dealing with the financial aspects of moving to the cloud is hard. Understanding the return on investment, accurately estimating costs in advance, comparing costs to internal costs, and understanding the details of billing were common challenges of the participants. Moreover, it raises the issue of “is capital king or is expense king?” Although the broader industry is constantly talking about how people are trying to move to expense rather than capital, workshop participants frequently asserted that it was easier for them to get capital than to up their recurring expenses. (As a side note, I have found that to be a frequent assertion in both inquiry and conference 1-on-1s.) Finally, user management, cost control, and turning resources on/off appropriately were problematic in the financial context.
Move low-risk workloads first. The workshop participants generally assessed Amazon as being suitable only to test/dev, non-mission-critical workloads, and things that had specifically been designed with Amazon’s characteristics in mind. Participants recommended a risk profile of apps, and moving low-risk apps first. They also saw their security organizations as being a barrier to adoption. Many had issues with their Legal departments either trying to prevent use of services or causing issues in the contracting process (what Amazon calls an Enterprise Agreement); participants recommended not involving Legal until after adopting the service.
Performance is a problem. Performance was cited as a frequent issue, especially storage performance, which participants noted was unsuitable to their production applications, and one participant made the key point that many test/dev situations also require highly performant storage (something he had first discovered when his ILM strategy placed test/dev storage at a lower more commodity tier and it impacted his developers).
Know what your SLA isn’t. Amazon’s limited SLAs were cited as an issue, particularly the mismatch in what many users thought the SLA was versus what it actually was, and what it’s actually turned out to be in practice (given Amazon’s outages this year). Participants also stressed business continuity planning in this context.
Integration is a challenge. Participants noted that going to test/dev in the cloud, while maintaining production in an internal data center, splits the software development lifecycle across data centers. This can be overcome to some degree with the appropriate tools, but still creates challenges and sometimes outright problems. Also, because speed of deployment is such a driving factor to go to the cloud, there is a resulting fragmentation of solutions. A service catalog would help some of these issues.
Data management can be a challenge. Participants were worried about regulatory compliance and the “where is my data?” question. Inexperienced participants were often not aware that non-S3 data is generally local to an availability zone. But even beyond that, there’s the question of what data is being put where by the cloud users. Participants with larger amounts of data also faced challenges in moving data in and out of the cloud.
Amazon isn’t the right provider for all workloads in the cloud. Several workshop participants used other cloud IaaS providers in addition to Amazon, for a variety of other reasons — greater ease of use for users who didn’t need complex things, enterprise-grade availability and performance, better manageability, security capabilities, and so forth.
I have conducted cloud workshops and what Gartner calls analyst/user roundtables at a bunch of our conferences now, and it’s always interesting what the different audiences think about, and how much it’s evolving over time. Compared to last year’s Symposium, the state of the art of Amazon adoption amongst conference attendees has clearly advanced hugely.
Amazon and Equinix partner for Direct Connect
Amazon has introduced a new connectivity option called AWS Direct Connect. In plain speak, Direct Connect allows an Amazon customer to get a cross-connect between his own network equipment and Amazon’s, in some location where the two companies are physically colocated. In even plainer speak, if you’re an Equinix colocation customer in their Ashburn, Virginia (Washington DC) data center campus, you can get a wire run between your cage and Amazon’s, which gives you direct connectivity between your router and theirs.
This is relatively cheap, as far as such things go. Amazon imposes a “port charge” for the cross-connect at $0.30/hour for 1 Gbps or $2.25/hour for 10 Gbps (on a practical level, since cross-connects are by definition nailed up 100% of the time, about $220/month and $1625/month respectively), plus outbound data transfer at $0.02/GB. You’ll also pay Equinix for the cross-connect itself (I haven’t verified the prices for these, but I’d expect they would be around $500 and $1500 per month). And, of course, you have to pay Equinix for the colocation of whatever equipment you have (upwards of $1000/month+ per rack).
Direct Connect has lots of practical uses. It provides direct, fast, private connectivity between your gear in colocation and whatever Amazon services are in Equinix Ashburn (and non-Internet access to AWS in general), vital for “hybrid cloud” use cases and enormously useful for people who, say, have PCI-compliant e-commerce sites with huge databases Oracle RAC and black-box encryption devices, but would like to put some front-end webservers in the cloud. You can also buy whatever connectivity you want from your cage in Equinix, so you can take that traffic and put it over some less expensive Internet connection (Amazon’s bandwidth fees are one of the major reasons customers leave them), or you can get private networking like ethernet or MPLS VPN (an important requirement for enterprise customers who don’t want their traffic to touch the Internet at all).
This is not a completely new thing — Amazon has quietly offered private peering and cross-connects to important customers for some time now, in Equinix. But this now makes cross-connects into a standard option with an established price point, which is likely to have far greater uptake than the one-off deals that Amazon has been doing.
It’s not a fully-automated service — the sign-up is basically used to get Amazon to grant you an authorization so that you can put in an Equinix work order for the cross-connect. But it’s an important step in the right direction. (I’ve previously noted the value of this partnership in a blog post called “Why Cloud IaaS Customers Care About a Colo Option“. Also, for Gartner clients, see my research note “Customers Need Hybrid Cloud Compute IaaS” for a detailed analysis.)
This is good for Equinix, too, for the obvious reasons. For quite some time now, I’ve been evangelizing the importance of carrier-neutral colocation as a “cloud hub”, envisioning a future where these providers facilitate cross-connect infrastructures between cloud users and cloud providers. Widespread adoption of this model would allow an enterprise to say, get a single rack of network equipment at Equinix (or Telecity or Interxion, etc.), and then cross-connect directly to all of their important cloud suppliers. It would drive cross-connect density, differentiation and stickiness at the carrier-neutral colo providers who succeed in being the draw for these ecosystems.
It’s worth noting that this doesn’t grant Amazon a unique capability, though. Just about every other major cloud IaaS provider already offers colocation and private connectivity options. But it’s a crucial step for Amazon towards being suitable for more typical enterprise use cases. (And as a broader long-term ecosystem play, customers may prefer using just one or two “cloud hubs” like an Equinix location for their “cloud backhaul” onto private connectivity, especially if they have gateway devices.)
Gartner research related to Amazon’s outage
In the wake of Amazon’s recent outage, we know we have Gartner clients who are interested in what we’ve written about Amazon in the past, and our existing recommendations for using cloud IaaS, and managing cloud-related risks. While we’re comfortable with our current advice, we’re also in the midst of some internal debate about what new recommendations may emerge out of this event, I’m posting a list of research notes that clients may find helpful as they sort through their thinking. This is just a reading list; it is by no means a comprehensive list of Gartner research related to Amazon or cloud IaaS. If you are a client, you may want to do your own search of the research, or ask our client services folks for help.
I will mark notes as “Core” (available to regular Gartner clients), “GBL” (available to technology and service provider clients who have subscribed to Gartner for Business Leaders or a product with similar access to research targeted at vendors), or “ITP” (available to clients of the Burton Group’s services, known as Gartner for IT Professionals post-acquisitions).
If you are specifically concerned about this particular Amazon outage and its context, and you want to read just one cautionary note, read Will Your Data Rain When the Cloud Bursts?, by my colleague Jay Heiser. It’s specifically about the risk of storage failure in the public cloud, and what you should ask your provider about their recoverability.
You might also be interested in our Cloud Computing: Infrastructure as a Service research round-up, for research related to both external cloud IaaS, and internal private clouds.
We first profiled Amazon EC2 in-depth in the November 2008 note, Is Amazon EC2 Right For You? (Core). It provides a brief overview of EC2, and examines the business case for using it, what applications are suited to using it, and the operational considerations. While some of the information is now outdated, the core questions outlined there are still valid. I am currently in the process of writing an update to this note, which will be out in a few weeks.
A deeper-dive profile can be found in the November 2009 note, Amazon EC2: Is It Ready For the Enterprise? (ITP). This goes into more technical detail (although it is also slightly out of date), and looks at it from an “enterprise readiness” standpoint, including suitability to run certain types of workloads, and a view on security and risk.
Amazon was one of the vendors profiled in our December 2010 multi-provider evaluation, Magic Quadrant for Cloud Infrastructure as a Service and Web Hosting (Core). The evaluation is focused in the context of EC2. This is the most recent competitive view of the market that we’ve published. Our thinking on some of these vendors has changed since the time it was published (and we are working on writing an update, in the form of an MQ specific to public cloud); if you are currently evaluating cloud IaaS, or any part of Amazon Web Services, we encourage you to call and place an inquiry.
We did an in-depth profile for Amazon S3 in the November 2008 note, A Look at Amazon’s S3 Cloud-Computing Storage Service (Core). This note is now somewhat outdated, but please do make a client inquiry if you want to get our current thinking.
The October 2010 note, in Cloud Storage Infrastructure-as-a-Service Providers, North America (Core), provides a “who’s who” list of quick profiles of the major cloud storage providers.
An in-depth examination of cloud storage, focused on the technology and market more so than the vendors (although it does have a chart of competitive positioning), is given in the December 2010 note, Market Profile: Cloud-Storage Service Providers, 2011 (ITP).
The major cloud storage vendors are profiled in some depth in the June 2010 note, Competitive Landscape: Cloud Storage Infrastructure as a Service, North America, 2010 (GBL).
Other Amazon-Specific Things
The June 2009 note, Software on Amazon’s Elastic Compute Cloud: How to Tell Hype From Reality (Core), explores the issues of running commercial software on Amazon EC2, as well as how to separate vendor claims of Amazon partnerships from the reality of what they’re doing.
Amazon was one of the vendors who responded to the cloud rights and responsibilities published by the Gartner Global IT Council for Cloud Services. Their response, and Gartner commentary on it, can be found in Vendor Response: How Providers Address the Cloud Rights and Responsibilities (Core).
Amazon’s Elastic MapReduce service is profiled in the January 2011 note, Hadoop and MapReduce: Big Data Analytics (ITP).
Cloud IaaS, in General
A seven-part note, the top-level note of which is Evaluating Cloud Infrastructure as a Service (Core), goes into extensive detail about the range of options available in cloud IaaS provider, and how to evaluate those providers. You are highly encouraged to read it to understand the full range of market options; there’s a lot more to the market than just Amazon.
To understand the breadth of the market, and the players in particular segments, read Market Insight: Structuring the Cloud Compute IaaS Market (GBL). This is targeted at vendors ho want to understand buyer profiles and how they map to the offerings in the market.
Help with evaluating what type of data center solution is right for you can be found in the framework laid out in Data Center Sourcing: Cloud, Host, Co-Lo, or Do It Yourself (ITP).
Help with evaluating your application’s suitability for a move to the cloud can be found in Migrating Applications to the Cloud: Rehost, Refactor, Revise, Rebuild, or Replace? (ITP), which takes an in-depth look at the factors you should consider when evaluating your application portfolio in a cloud context.
We’ve recently produced a great deal of research related to cloud sourcing. A catalog of that research can be found in Manage Risk and Unexpected Costs During the Cloud Sourcing Revolution (Core). There’s a ton of critical advice there, especially with regard to contracting, that make these notes a must-read.
We provide a framework for evaluating cloud security and risks in Developing a Cloud Computing Security Strategy (ITP). This offers a deep dive into security and compliance issues, including how to build a cross-functional team to deal with these issues.
We take a look at assessment and auditing frameworks for cloud computing, in Determining Criteria for Cloud Security Assessment: It’s More than a Checklist (ITP). This goes deep into detail on risk assessment, assessment of provider controls, and the emerging industry standards for cloud security.
We caution about the risks of expecting that a cloud provider will have such a high level of reliability that a business continuity and recoverability are no long necessary, in Will Your Data Rain When the Cloud Bursts? (Core). This note is specifically primarily focused on data recoverability.
We provide a framework for cloud risk mitigation in Managing Availability and Performance Risks in the Cloud: Expect the Unexpected (ITP). This provides solid advice on planning your bail-out strategy, distributing your applications/data/services, and buying cyber-risk insurance.
If you are using a SaaS provider, and you’re concerned about their underlying infrastructure, we encourage you to ask them a set of Critical Questions. There are three research notes, covering Infrastructure, Security, and Recovery (all Core). These notes are somewhat old, but the questions are still valid ones.
Amazon outage and the auto-immune vulnerabilities of resiliency
Today is Judgment Day, when Skynet becomes self-aware. It is, apparently, also a very, very bad day for Amazon Web Services.
Lots of people have raised questions today about what Amazon’s difficulties today mean for the future of cloud IaaS. My belief is that this doesn’t do anything to the adoption curve — but I do believe that customers who rely upon Amazon to run their businesses will, and should, think hard about the resiliency of their architectures.
It’s important to understand what did and did not happen today. There’s been a popular impression that “EC2 is down”. It’s not. To understand what happened, though, some explanation of Amazon’s infrastructure is necessary.
Amazon divides its infrastructure into “regions”. You can think of a region as basically analogous to “a data center”. For instance, US-East-1 is Amazon’s Northern Virginia data center, while US-West-1 is Amazon’s Silicon Valley data center. Each region, in turn, is divided into multiple “availability zones” (AZs). You can think of an AZ is basically analogous to “a cluster” — it’s a grouping of physical and logical resources. Each AZ is designated by letters — for instance, US-East-1a, US-East-1b, etc. However, each of these designations are customer-specific (which is why Amazon’s status information cannot easily specify which AZ is affected by a problem).
Amazon’s virtual machine offering is the Elastic Compute Cloud (EC2). When you provision an EC2 “instance” (Amazon’s term for a VM), you also get an allocation of “instance storage”. Instance storage is transient — it exists only as long as the VM exists. Consequently, it’s not useful for storing anything that you actually want to keep. To get persistent storage, you use Amazon’s Elastic Block Store (EBS), which is basically just network-attached storage. Many people run databases on EC2 that are backed by EBS, for instance. Because that’s such a common use case, Amazon offers the Relational Database Service (RDS), which is basically an EC2 instance running MySQL.
Amazon’s issues today are with EBS, and with RDS, both in the US-East-1 region. (My guess is that the issues are related, but Amazon has not specifically stated that they are.) Customers who aren’t in the US-East-1 region aren’t affected (customers always choose which region and specific AZs they run in). Customers who don’t use EBS or RDS are also unaffected. However, use of EBS is highly commonplace, and likely just about everyone using EC2 for a production application or Web site is reliant upon EBS. Consequently, even though EC2 itself has been running just fine, the issues have nevertheless had a major impact on customers. If you’re storing your data on EBS, the issues with EBS have made your data inaccessible, or they’ve made access to that data slow and unreliable. Ditto with RDS. Obviously, if you can’t get to your data, you’re not going to be doing much of anything.
In order to get Amazon’s SLA for EC2, you, as a customer, have to run your application in multiple AZs within the same region. Running in multiple AZs is supposed to isolate you from the failure of any single AZ. In practice, of course, this only provides you so much protection — since the AZs are typically all in the same physical data center, anything that affects that whole data center would probably affect all the AZs. Similarly, the AZs are not totally isolated from one another, either physically or logically.
However, when you create an EBS volume, you place it in a specific availability zone, and you can only attach that EBS volume to EC2 instances within that same availability zone. That complicates resiliency, since if you wanted to fail over into another AZ, you’d still need access to your data. That means if you’re going to run in multiple AZs, you have to replicate your data across multiple AZs.
One of the ways you can achieve this is with the Multi-AZ option of RDS. If you’re running a MySQL database and can do so within the constraints of RDS, the multi-AZ option lets you gain the necessary resiliency for your database without having to replicate EBS volumes between AZs.
As one final caveat, data transfer within a region is free and fast — it’s basically over a local LAN, after all. By contrast, Amazon charges you for transfers between regions, which goes over the Internet and has the attendant cost and latency.
Consequently, there are lots of Amazon customers who are running in just a single region. A lot of those customers may be running in just a single AZ (because they didn’t architect their app to easily run in multiple AZs). And of the ones who are running in multiple AZs, a fair number are reliant upon the multi-AZ functionality of RDS.
That’s why today’s impacts were particularly severe. US-East-1 is Amazon’s most popular region. The problems with EBS impacted the entire region, as did the RDS problems (and multi-AZ RDS was particularly impacted), not just a single AZ, so if you were multiple-AZ but not multi-region, the resiliency you were theoretically getting was of no help to you. Today, people learned that it’s not necessarily adequate to run in multiple AZs. (Justin Santa Barbara has a good post about this.)
My perspective on this is pretty much exactly what I would tell a traditional Web hosting customer who’s running only in one data center: If you want more resiliency, you need to run in more than one data center. And on Amazon, if you want more resiliency, you need to not only be multi-AZ but also multi-region.
Amazon’s SLA for EC2 is 99.95% for multi-AZ deployments. That means that you should expect that you can have about 4.5 hours of total region downtime each year without Amazon violating their SLA. Note, by the way, that this outage does not actually violate their SLA. Their SLA defines unavailability as a lack of external connectivity to EC2 instances, coupled with the inability to provision working instances. In this case, EC2 was just fine by that definition. It was EBS and RDS which weren’t, and neither of those services have SLAs.
So how did Amazon end up with a problem that affected all the AZs within the US-East-1 region? Well, according to their status dashboard, they had some sort of network problem last night in their east coast data center. That problem resulted in their automated resiliency mechanisms attempting to re-mirror a large number of EBS volumes. This impacted one of the AZs, but it also overloaded the control infrastructure for EBS in that region. My guess is that RDS also uses this same storage infrastructure, so the capacity shortages and whatnot created by all of this activity ended up also impacting RDS.
My colleague Jay Heiser, who follows, among other things, risk management, calls this “auto-immune disease” — i.e., resiliency mechanisms can sometimes end up causing you harm. (We’ve seen auto-immune problems happen before in a prior Amazon S3 outage, as well as a Google Gmail outage.) The way to limit auto-immune damage is isolation — ensuring limits to the propagation.
Will some Amazon customers pack up and leave? Will some of them swear off the cloud? Probably. But realistically, we’re talking about data centers, and infrastructure, here. They can and do fail. You have to architect your app to have continuous availability across multiple data centers, if it can never ever go down. Whether you’re running your own data center, running in managed hosting, or running in the cloud, you’re going to face this issue. (Your problems might be different — i.e., your own little data center isn’t going to have the kind of complex problem that Amazon experienced today — but you’re still going to have downtime-causing issues.)
There are a lot of moving parts in cloud IaaS. Any one of them going wrong can bork your entire site/application. Your real problem is appropriate risk mitigation — the risk of downtime and its attendant losses, versus the complications and technical challenges and costs created by infrastructure redundancy.
Amazon’s dedicated instances
Back in December, I blogged about the notion of Just Enough Privacy — the idea that cloud IaaS customers could share a common pool of physical servers, yet have the security concerns of shared infrastructure addressed through provisioning rules that would ensure that once a “private” customer got a virtual machine provisioned on a physical server, no other customers would then be provisioned onto that server for the duration of that VM’s life. Customers are far more willing to share network and storage than they are compute, because they’re worried about hypervisor security, so this approach addresses a significant amount of customer paranoia with no real negative impact to the provider.
Amazon has just added EC2 Dedicated Instances, which are pretty much exactly what I wrote about previously. For $10 an hour per region with single-tenancy, plus a roughly 20% uplift to the normal Amazon instance costs, you can have single-tenant servers. There are some minor configuration complications, and dedicated reserved instances have their own pricing (and are therefore separate from regular reserved instances), but all in all, these combine with the recently-released VPC features for a reasonably elegant set of functionality.
The per-region charge carries a significant premium over any wasted capacity. An extra-large instance is a full physical server; it’s 8x larger than a small instance, and its normal pricing is exactly 8x, $0.68/hour vs. a small’s $0.085/hour (Linux pricing). Nothing costs more than a quadruple extra large high-memory instance ($2.48/hour), also a full physical server. Dedicated tenancy should never waste more than a full physical server’s worth of capacity, so the “wasted” capacity carries around a 15x premium on normal instances and a 4x premium on the expensive high-memory instances, compared to if that capacity had simply been sold as a multi-tenant server. It’s basically a nuisance charge for really small customers, and not even worth thinking about by larger customers (it’s a lot less than the cost of a cocktail at a nice bar in San Francisco). All in all, it’s pretty attractive financially for Amazon, since they’re getting a 20%-ish premium on the instance charges themselves, too. (And if retail is the business of pennies, those pennies still add up when you have enough customers.)
Amazon has been on a real roll since the start of the year — the extensive VPC enhancements, the expansion of the Identity and Access Management features, and the CloudFormation templates are among the key enhancements. And the significance of the Citrix/Amazon partnership announcement shouldn’t be overlooked, either.
Amazon Simple Email Service
Last week, Amazon launched its Simple Email Service (SES). SES is an outbound SMTP service, accessible via API or easily integrated into common SMTP servers (Amazon provides instructions for sendmail and postfix). It has built-in rate-limiting and feedback loop statistics (rejected, bounced, complaints). It’s $0.10 per thousand messages. EC2 customers get 2000 messages for free each month. You do, however, have to pay for data transfer.
Sending email from EC2 has long been a challenge. For the obvious reasons, Amazon has had anti-spam measures in place, and the EC2 infrastructure itself is also likely to be automatically eyeballed with suspicion by the anti-spam mechanisms on the receiving email servers. Although addressing issues with Elastic IPs and reverse DNS has helped somewhat, Amazon has struggled with reputation management for its EC2 address blocks, despite attempting to police outbound SMTP from those blocks.
There are various third-party email services (bare-bones as well as sophisticated) that EC2 users have used to work around the problem. Sometimes it’s thrown in as part of another service; for instance, DataPipe includes an external SMTP service as part of its managed services for EC2. Pricewise, though, SES wins hands-down over both a raw delivery service like AuthSMTP and a fancier one like Sendgrid.
Amazon isn’t providing the super-sophisticated capabilities that email marketing campaign companies can provide, but it is providing one really vital element — feedback loop statistics, something that is useful to companies sending both transactional and bulk email. For some customers, that’s all they’re looking for — raw sends and the feedback loop. When you look apples-to-apples, though, Amazon is more than a full order of magnitude cheaper than the comparable traditional services. That represents a real potential shake-up for that industry, whether the target customer is a small business or an enterprise. Also, it’s potentially a very interesting way for those companies to offer a simple service on somebody else’s low-cost infrastructure, as Mailchimp STS now does.
My colleagues Matt Cain (email infrastructure) and Adam Sarner (e-marketing) and I will be issuing an event note to Gartner clients in the future, looking at this development in greater detail.
Amazon’s Elastic Beanstalk
Amazon recently released a new offering called the Elastic Beanstalk. At its heart, it is a simplified interface to EC2 and its ancillary services (load-balancing, auto-scaling, and monitoring integrated with alerts), along with an Amazon-maintained AMI containing Linux and Apache Tomcat (an open source Java EE application server), and a deployment mechanism for a Java app (in the form of a WAR file), which notably adds tighter integration with Eclipse, a popular IDE.
Many people are calling this Amazon’s PaaS foray. I am inclined to disgree that it is PaaS (although Amazon does have other offerings which are PaaS, such as SimpleDB and SQS). Rather, I think this is still IaaS, but with a friendlier approach to deployment and management. It is developer-friendly, although it should be noted that in its current release, there is no simplification of any form of storage persistence — no easy configuration of EBS or friendly auto-adding of RDS instances, for example. Going to the database tab in the Elastic Beanstalk portion of Amazon’s management console just directs you to documentation about storage options on AWS. Almost no one is going to be running a real app without a persistence mechanism, so the Beanstalk won’t be truly turnkey until this is simplified accordingly.
Because Elastic Beanstalk fully exposes the underlying AWS resources and lets you do whatever you want with them, the currently-missing feature capabilities aren’t a limitation; you can simply use AWS in the normal way, while still getting the slimmed-down elegance of the Beanstalk’s management interfaces. Also notably, it’s free — you’re paying only for the underlying AWS resources.
Amazon exemplifies the idea of IT services industrialization, but in order to address the widest possible range of use cases, Amazon needs to be able to simplify and automate infrastructure management that would otherwise require manual work (i.e., either the customer needs to do it himself, or he needs managed services). I view Elastic Beanstalk and its underlying technologies as an important advancement along Amazon’s path towards automated management of infrastructure. In its current incarnation, it eases developer on-boarding — but in future iterations, it could become a key building-block in Amazon’s ability to serve the more traditional IT buyer.
Gartner is NOT dissing Amazon’s cloud
I’ve now seen a number of press reports and some related writing, about the Magic Quadrant for Cloud Infrastrucure as a Service and Web Hosting, that I feel mischaracterize statements made on the MQ in ways that they were certainly not intended to be taken, and in some cases, mischaracterize the nature of a Magic Quadrant itself. I feel compelled to try to attempt to make some things explicitly clear. Specifically:
This is not just a Cloud IaaS MQ. As the title says, this is a Cloud IaaS AND Web Hosting MQ. You should not interpret a vendor who is a Leader in the MQ as necessarily being a “cloud leader”, for instance; they are simply a leader in the context of the overall market, of which we have forecasted 25% of the revenue to be cloud IaaS by the end of 2011. You should look at the execution axis as favoring vendors whose positioning fits well with the immediate, relatively conservative needs of typical Gartner clients, and the vision axis as favoring vendors whose strategy makes them likely to succeed in a more cloud-oriented future.
The Magic Quadrant is not tiered. Specifically, the Challengers quadrant is not “better” than the Visionaries quadrant, nor is the reverse true. Indeed, Visionaries may be better positioned to succeed in the future than Challengers are, since they tend to be companies who have good roadmaps and are evolving quickly. (And Niche Players might be highly focused and fantastic at what they do, note.)
The Magic Quadrant rates relative positions of vendors in an overall market. Importantly, it does not rate products or services; these are only a component of the overall rating (in this particular MQ, about a third). You should never judge a vendor’s position as indicating that it necessarily has a better service, especially with respect to your specific use case. It’s especially important in this particular MQ because most of the vendors are not pure-plays, and their cloud IaaS service might be much better or much worse than their portfolio as a whole.
Strengths and Cautions are statements about a vendor, not the reasons for their rating. The statements are things that we think it is important for a prospective customer to know when they’re thinking about working with this vendor. They are distinct from the criteria scores that underly the graph. In many cases, the vendor has not lost or gained points specifically for the thing that is called out, but it’s something distinctive, or that readers might not be aware, or is a common misunderstanding from readers, or is even just simply pointing out a best practice when dealing with a particular vendor.
At no point do we say that Amazon’s cloud service is unproven. Amazon is positioned as a Visionary, and as a category, Visionaries are typically companies who have a relatively short track record in the evaluated market as a whole (yes, the boilerplate language for the category uses “unproven”). Pure-play cloud vendors are still emerging, which makes this characterization fit pretty well. While Amazon has obviously been at the pure-cloud-IaaS business longer than any other vendor on the Magic Quadrant, they are newcomers to the overall market assessed by the MQ, which is now about 15 years old. MQ Visionaries are pioneering new territory. That shouldn’t be regarded as a bad thing.
We are not “dissing” Amazon. Some writers have been trying to imply that we don’t think much of Amazon’s cloud service. Nowhere does the report state this. The report certainly attempts to present meaningful strengths and cautions for Amazon, as it does for every vendor. Amazon has by far the highest rating on vision. Its execution score is based on its ability to serve the whole host of evaluated enterprise use cases in the Magic Quadrant, which, if you think about it, indicates that Amazon must have scored well on the self-managed IaaS use case, since that is the only one of the three evaluated use cases that are considered by the Magic Quadrant, and Amazon doesn’t serve the other two use cases at all. Use of Amazon or any other vendor should be considered in light of your use case and requirements.
Obviously, there are plenty of people who are interested in understanding more about the thinking and market observations that led to this Magic Quadrant, and possibly some substantial confusion on the part of people who don’t have access to the larger body of research as to Gartner’s views on cloud IaaS and so forth. I’ve blogged a fair amount recently to try to clear up some points of confusion. However, I am mindful of Gartner’s policies for social media use by analysts, and believe that it would be inappropriate for me to methodically blog about market evolution, market segmentation, and the use cases and adoption patterns that we are seeing from our clients — the things that would be necessary in order to fully lay out an explanation of our market view and how it led to this particular MQ. Instead, I should be writing research notes for paying clients, and I intend to do exactly that.
If you are a Gartner client, you are welcome to place an inquiry to discuss the Magic Quadrant and any other related matters; I’m happy to discuss these at length. And do please read the full Magic Quadrant and related research. (Non-clients can only read the non-interactive document.)