Author Archives: Lydia Leong

App categorization and the commodity vs. dependable cloud

At the SLA@SOI conference, my colleague Drue Reeves gave a presentation on the dependable cloud, which he defined as “a cloud service that has the availability, security, scalabilty, and risk management necessary to host enterprise applications… at a reasonable price.” We’ll be publishing research on this in the months to come, so this blog post contains relatively early-stage musings on my part.

We need enterprise-grade, dependable cloud infrastructure as as service (IaaS). But there’s also a place in the world for commodity cloud IaaS. They serve different sorts of use cases, different categories of applications. (Everything in this post refers to IaaS, but I’m just saying “cloud” for convenience.)

There are four types of applications that will move into the cloud:

  • Existing enterprise applications, capable of being virtualized
  • New enterprise-class applications, almost certainly Web-based
  • Internet-class applications, Web 1.0 and early Web 2.0
  • Global-class applications, highly sophisticated super-scalable Web 2.0 and beyond

Enterprise-class applications are generally characterized by the expectation that the underlying infrastructure is at least as reliable, performant, and secure as traditional enterprise data center infrastructure. They expect resilience at the infrastructure layer. Over the last decade, applications of this type have generally been written as three-tier, Web-based apps. Nevertheless, these apps often scale vertically rather than horizontally (scale up rather than scale out), but a very large percentage of them are small applications — ones that use a core or less of a modern CPU — and so even if they could scale out on multiple VMs, it often doesn’t make sense, from a capacity efficiency standpoint, to deploy them that way.

In the future, while an increasing percentage of new business applications will be obtained as SaaS, rather than being internally-hosted COTS apps or in-house-written apps, and more will be deployed onto business process management (BPM) suite platforms or the like, businesses will still be writing custom apps of this sort. So we will continue to need dependable infrastructure.

Moreover, many enterprise-class applications are written not just by business IT, but also by external vendors, whether ISVs, SaaS, or otherwise. Even tech companies that make their living off their websites may write enterprise-class apps. Indeed, many such apps have previously used managed hosting for the underlying infrastructure, and these companies have infrastructure dependability as an expectation.

By contrast, Internet-class applications are written to scale out. They might or might not be written to be easily distributed. They assume sufficient scale that there is an expectation that at least some things can fail without causing widespread failure, although there may still be particularly vulnerable points in the app and underlying infrstracture — the database, for instance. Resilience is generally built into the application, but these are not apps designed to withstand the Chaos Monkey.

Finally, global-class applications are written to be scale-out, highly-distributed, and to withstand massive infrastructure failures. All the resiliency is built into the application; the underlying infrastructure is assumed to be fragile. Simple underlying infrastructure components that fail cleanly and quickly (rather than dying slow deaths of degradation) are prized, because they are cheap to buy and cheap to replace; all the intelligence resides in software.

Global-class applications can use commodity cloud infrastructure, as can other use cases that do not expect a dependable cloud. Internet-class applications can also use commodity cloud infrastructure, but unless efforts are made to move more resiliency into the application layer, there are risk management issues here, and depending upon scale and needs, a dependable cloud may be preferable to commodity cloud. Enterprise-class applications need a dependable cloud.

Where resiliency resides is an architectural choice. There is no One True Way. Building resilience into the app may be the most cost-effective choice for applications which need to have “Internet scale”, but it may add unwarranted and unnecessary complexity to many other applications, making dependable infrastructure the more cost-effective choice.

Akamai and Riverbed partner on SaaS delivery

Akamai and Riverbed have signed a significant partnership deal to jointly develop solutions that combine Internet acceleration with WAN optimization. The two companies will be incorporating each other’s technologies into their platforms; this is a deep partnership with significant joint engineering, and it is probably the most significant partnership that Akamai has done to date.

Akamai has been facing increasing challenges to its leadership in the application acceleration market — what Akamai’s financial statements term “value added services”, including their Dynamic Site Accelerator (DSA) and Web Application Accelerator (WAA) services, which are B2C and B2B bundles, respectively, built on top of the same acceleration delivery network (ADN) technology. Vendors such as Cotendo (especially via its AT&T partnership), CDNetworks, and EdgeCast now have services that compete directly with what has been, for Akamai, a very high-margin, very sticky service. This market is facing severe pricing pressure, due not just to competition, but due to the delta between the cost of these services and standard CDN caching. (In other words, as basic CDN services get cheaper, application acceleration also needs to get cheaper, in order to demonstrate sufficient ROI, i.e., business value of performance, above just buying the less expensive solution.)

While Akamai has had interesting incremental innovations and value-adds since it obtained this technology via the 2007 acquisition of Netli, it has, until recently, enjoyed a monopoly on these services, and therefore hasn’t needed to do any groundbreaking innovation. While the internal enterprise WAN optimization market has been heavily competitive (between Riverbed, Cisco, and many others), other CDNs largely only began offering competitive ADN solutions in the last year. Now, while Akamai still leads in performance, it badly needs to open up some differentiation and new potential target customers, or it risks watching ADN solutions commoditize just the way basic CDN services have.

The most significant value proposition of the joint Akamai/Riverbed solution is this:

Despite the fundamental soundness of the value proposition of ADN services, most SaaS providers use only a basic CDN service, or no CDN at all. The same is true of other providers of cloud-based services. Customers, however, frequently want accelerated services, especially if they have end-users in far-flung corners of the globe; the most common problem is poor performance for end-users in Asia-Pacific when the service is based in the United States. Yet, today, doing so either requires that the SaaS provider buy an ADN service themselves (which it’s hard to do for only one customer, especially for multi-tenant SaaS), or requires the SaaS provider to allow the customer to deploy hardware in their data center (for instance, a Riverbed Steelhead WOC).

With the solution that this partnership is intended to produce, customers won’t need a SaaS provider’s cooperation to deploy an acceleration solution — they can buy it as a service and have the acceleration integrated with their existing Riverbed solution. It adds significant value to Riverbed’s customers, and it expands Akamai’s market opportunity. It’s a great idea, and in fact, this is a partnership that probably should have happened years ago. Better late than never, though.

3Crowd, a new fourth-generation CDN

3Crowd has unveiled its master plan with the recent launch of its CrowdCache product. Previously, 3Crowd had a service called CrowdDirector, essentially load-balancing for content providers who use multiple CDNs. CrowdCache is much more interesting, and it gives life and context to the existence of CrowdDirector. CrowdCache is a small, free, Java application that you can deploy onto a server, which turns it into a CDN cache. You then use CrowdDirector, which you pay for as-a-service on a per-object-request basis, to provide all the intelligence on top of that cache. CrowdDirector handles the request routing, management, analytics, and so forth. What you get, in the end, at least in theory, is a turnkey CDN.

I consider 3Crowd to be a fourth-generation CDN. (I started writing about 4th-gen CDNs back in 2008; see my blog posts on CDN overlays and MediaMelon, on the launch of CDN aggregator Aflexi, and 4th-gen CDNs and the launch of Conviva).

To recap, first-generation CDNs use a highly distributed edge model (think: Akamai), second-generation CDNs use a somewhat more concentrated but still highly distributed model (think: Speedera), and third-generation CDNs use a megaPOP model of many fewer locations (think: Limelight and most other CDNs founded in the 2005-2008 timeframe). These are heavily capital-intensive models that require owning substantial server assets.

Fourth-generation CDNs, by contrast, represent a shift towards a more software-oriented model. These companies own limited (or even no) delivery assets themselves. Some of these are not (and will not be) so much CDNs themselves, as platforms that reside in the CDN ecosystem, or CDN enablers. Fourth-generation CDNs provide software capabilities that allow their customers to turn existing delivery assets (whether in their own data centers, in the cloud, or sometimes even on clients using peer-to-peer) into CDN infrastructure. 3Crowd fits squarely into this fourth-generation model.

3Crowd is targeting three key markets: content providers who have spare capacity in their own data centers and would like to deliver content using that capacity before they resort to their CDN; Web hosters who want to add a CDN to their service offerings; and carriers who want to build CDNs of their own.

In this last market segment, especially, 3Crowd will compete against Cisco, Juniper (via the Ankeena acquisition), Alcatel-Lucent (via the Velocix acquisition), EdgeCast, Jet-Stream, and other companies that offer CDN-building solutions.

No doubt 3Crowd will also get some do-it-yourselfers who will decide to use 3Crowd to build their own CDN using cloud IaaS from Amazon or the like. This is part of what’s generating buzz for the company now, since their “Garage Startup” package is totally free.

I also think there’s potentially an enterprise play here, for those organizations who need to deliver content both internally and externally, who could potentially use 3Crowd to deploy an eCDN internally along with an Internet CDN hosted on a cloud provider, substituting for caches from BlueCoat or the like. There are lots of additional things that 3Crowd needs to be viable in that space, but it’s an interesting thing to think about.

3Crowd has federation ambitions, which is to say: Once they have a bunch of customers using their platform, they’d like to have a marketplace in which capacity-trading can be done, and, of course, also enable more private deals for federation, something which tends to be of interest to regional carriers with local CDN ambitions, who look to federation as a way of competing with the global CDNs.

Conceptually, what 3Crowd has done is not unique. Velocix, for instance, has similar hopes with its Metro product. There is certainly plenty of competition for infrastructure for the carrier CDN market (most of the world’s carriers have woken up over the last year and realize that they need a CDN strategy of some sort, even if their ambitions do not go farther than preventing their broadband networks from being swamped by video). What 3Crowd has done that’s notable is an emphasis on having an easy-to-deploy complete integrated solution that runs on commodity infrastructure resources, and the relative sophistication of the product’s feature set.

The baseline price seemed pretty cheap to me at first, and then I did some math. At the baseline pricing for a start-up, it’s about 2 cents per 10,000 requests. If you’re doing small object delivery at 10K per file, ten thousand requests is about 100 MB of content. So 1 GB of content of 10k-file requests would cost you 20 cents. That’s not cheap, since that’s just the 3Crowd cost — you still have to supply the servers and the network bandwidth. By comparison, Rackspace Cloud Files CDN-enabled delivery via Akamai, is 18 cents per GB for the actual content delivery. Anyone doing enough volume to actually have a full CDN contract and not pushing their bits through a cloud CDN is going to see pricing a lot lower than 18 cents, too.

However, the pricing dynamics are quite different for video. if you’re doing delivery of relatively low-quality, YouTube-like social video, for instance, your average file size is probably more like 10 MB. So 10,000 requests is 100 GB of content, making the per-GB surcharge a mere $0.02 cents. This is an essentially negligible amount. Consequently, the request-based pricing model makes 3Crowd far more cost-effective as a solution for video and other large-file-centric CDNs, than it does for small object delivery.

I certainly have plenty more thoughts on this, both specific to 3Crowd, and to the 4th-gen CDN and carrier CDN evolutionary path. I’m currently working on a research note on carrier CDN strategy and implementation, so keep an eye out for it. Also, I know many of the CDN watchers who read my blog are probably now asking themselves, “What are the implications for Akamai, Limelight, and Level 3?” If you’re a Gartner client, please feel free to call and make an inquiry.

Gartner research related to Amazon’s outage

In the wake of Amazon’s recent outage, we know we have Gartner clients who are interested in what we’ve written about Amazon in the past, and our existing recommendations for using cloud IaaS, and managing cloud-related risks. While we’re comfortable with our current advice, we’re also in the midst of some internal debate about what new recommendations may emerge out of this event, I’m posting a list of research notes that clients may find helpful as they sort through their thinking. This is just a reading list; it is by no means a comprehensive list of Gartner research related to Amazon or cloud IaaS. If you are a client, you may want to do your own search of the research, or ask our client services folks for help.

I will mark notes as “Core” (available to regular Gartner clients), “GBL” (available to technology and service provider clients who have subscribed to Gartner for Business Leaders or a product with similar access to research targeted at vendors), or “ITP” (available to clients of the Burton Group’s services, known as Gartner for IT Professionals post-acquisitions).

If you are specifically concerned about this particular Amazon outage and its context, and you want to read just one cautionary note, read Will Your Data Rain When the Cloud Bursts?, by my colleague Jay Heiser. It’s specifically about the risk of storage failure in the public cloud, and what you should ask your provider about their recoverability.

You might also be interested in our Cloud Computing: Infrastructure as a Service research round-up, for research related to both external cloud IaaS, and internal private clouds.

Amazon EC2

We first profiled Amazon EC2 in-depth in the November 2008 note, Is Amazon EC2 Right For You? (Core). It provides a brief overview of EC2, and examines the business case for using it, what applications are suited to using it, and the operational considerations. While some of the information is now outdated, the core questions outlined there are still valid. I am currently in the process of writing an update to this note, which will be out in a few weeks.

A deeper-dive profile can be found in the November 2009 note, Amazon EC2: Is It Ready For the Enterprise? (ITP). This goes into more technical detail (although it is also slightly out of date), and looks at it from an “enterprise readiness” standpoint, including suitability to run certain types of workloads, and a view on security and risk.

Amazon was one of the vendors profiled in our December 2010 multi-provider evaluation, Magic Quadrant for Cloud Infrastructure as a Service and Web Hosting (Core). The evaluation is focused in the context of EC2. This is the most recent competitive view of the market that we’ve published. Our thinking on some of these vendors has changed since the time it was published (and we are working on writing an update, in the form of an MQ specific to public cloud); if you are currently evaluating cloud IaaS, or any part of Amazon Web Services, we encourage you to call and place an inquiry.

Amazon S3

We did an in-depth profile for Amazon S3 in the November 2008 note, A Look at Amazon’s S3 Cloud-Computing Storage Service (Core). This note is now somewhat outdated, but please do make a client inquiry if you want to get our current thinking.

The October 2010 note, in Cloud Storage Infrastructure-as-a-Service Providers, North America (Core), provides a “who’s who” list of quick profiles of the major cloud storage providers.

An in-depth examination of cloud storage, focused on the technology and market more so than the vendors (although it does have a chart of competitive positioning), is given in the December 2010 note, Market Profile: Cloud-Storage Service Providers, 2011 (ITP).

The major cloud storage vendors are profiled in some depth in the June 2010 note, Competitive Landscape: Cloud Storage Infrastructure as a Service, North America, 2010 (GBL).

Other Amazon-Specific Things

The June 2009 note, Software on Amazon’s Elastic Compute Cloud: How to Tell Hype From Reality (Core), explores the issues of running commercial software on Amazon EC2, as well as how to separate vendor claims of Amazon partnerships from the reality of what they’re doing.

Amazon was one of the vendors who responded to the cloud rights and responsibilities published by the Gartner Global IT Council for Cloud Services. Their response, and Gartner commentary on it, can be found in Vendor Response: How Providers Address the Cloud Rights and Responsibilities (Core).

Amazon’s Elastic MapReduce service is profiled in the January 2011 note, Hadoop and MapReduce: Big Data Analytics (ITP).

Cloud IaaS, in General

A seven-part note, the top-level note of which is Evaluating Cloud Infrastructure as a Service (Core), goes into extensive detail about the range of options available in cloud IaaS provider, and how to evaluate those providers. You are highly encouraged to read it to understand the full range of market options; there’s a lot more to the market than just Amazon.

To understand the breadth of the market, and the players in particular segments, read Market Insight: Structuring the Cloud Compute IaaS Market (GBL). This is targeted at vendors ho want to understand buyer profiles and how they map to the offerings in the market.

Help with evaluating what type of data center solution is right for you can be found in the framework laid out in Data Center Sourcing: Cloud, Host, Co-Lo, or Do It Yourself (ITP).

Help with evaluating your application’s suitability for a move to the cloud can be found in Migrating Applications to the Cloud: Rehost, Refactor, Revise, Rebuild, or Replace? (ITP), which takes an in-depth look at the factors you should consider when evaluating your application portfolio in a cloud context.

Risk Management

We’ve recently produced a great deal of research related to cloud sourcing. A catalog of that research can be found in Manage Risk and Unexpected Costs During the Cloud Sourcing Revolution (Core). There’s a ton of critical advice there, especially with regard to contracting, that make these notes a must-read.

We provide a framework for evaluating cloud security and risks in Developing a Cloud Computing Security Strategy (ITP). This offers a deep dive into security and compliance issues, including how to build a cross-functional team to deal with these issues.

We take a look at assessment and auditing frameworks for cloud computing, in Determining Criteria for Cloud Security Assessment: It’s More than a Checklist (ITP). This goes deep into detail on risk assessment, assessment of provider controls, and the emerging industry standards for cloud security.

We caution about the risks of expecting that a cloud provider will have such a high level of reliability that a business continuity and recoverability are no long necessary, in Will Your Data Rain When the Cloud Bursts? (Core). This note is specifically primarily focused on data recoverability.

We provide a framework for cloud risk mitigation in Managing Availability and Performance Risks in the Cloud: Expect the Unexpected (ITP). This provides solid advice on planning your bail-out strategy, distributing your applications/data/services, and buying cyber-risk insurance.

If you are using a SaaS provider, and you’re concerned about their underlying infrastructure, we encourage you to ask them a set of Critical Questions. There are three research notes, covering Infrastructure, Security, and Recovery (all Core). These notes are somewhat old, but the questions are still valid ones.

Huawei’s cloud computing ambitions

I was recently in China, visiting Gartner clients and prospects in Beijing and Shanghai, and attending Huawei’s analyst summit.

Why Huawei? I am an analyst covering services, after all, not equipment. But the hardware and software vendors that enable the cloud — the entirety of the ecosystem, so to speak — has become more and more important to my coverage, particularly as service providers try to figure out what technology they should use in their cloud. And so, when Huawei extended an invitation to come hear about their plans in cloud computing, I agreed to go fly halfway around the world to listen.

For those of you who are not acquainted with Huawei, they’re a roughly $23B networking equipment manufacturer — the largest supplier of telecom operator gear in the world, having recently suprassed Ericsson for that position.

At their analyst summit, Huawei announced a number of grand ambitions — to become one of the major global device manufacturers (Huawei’s phones and tablets are based on Android), to aggressively grow into the enterprise networking equipment business, and to become an all-in-one solutions provider for cloud computing. That includes Huawei’s product portfolio of modular and container-based data center solutions, servers, storage (via the Huawei-Symantec JV), data center and wide-area networking equipment, and the “cloud stack” of software needed to offer cloud IaaS, whether for an enterprise building a private cloud, or a service provider building a highly scalable public cloud. It also includes a suite of content delivery network (CDN) enablement solutions, targeted at network operators, and integrated into the cloud offering.

This is obviously a grandly ambitious plan, considering that it comes from a vendor that most non-carriers have never heard of, and which faces considerable prejudice in the United Sates. (Huawei recently got into a kerfuffle over its acquisition of the assets of 3Leaf Systems — the US government recently recommended rejecting their buying the patent portfolio and hiring some people out of a defunct Silicon Valley start-up with no customers to speak of, on national security grounds, a ridiculous tempest in a teapot if there ever was one.)

Huawei will join HP, IBM, Dell, and the VCE coalition, among others, in competing to deliver turnkey cloud infrastructure solutions. The product portfolio they claim to have, and the product portfolio they’re developing, are all highly ambitious and R&D-driven, although I believe that the technical problems that Huawei is tackling are genuinely difficult and therefore due caution needs to be exercised, as there is no proof that Huawei’s cloud technology scales as claimed.

Huawei will also face a significant barrier in the United States, given the political climate and suspicions about Huawei’s ties to the Chinese government, particularly the military and state security apparatus. This gets us back to the demand of cloud customers to know the underlying components of their solutions. If a service provider chooses to build on Huawei’s technology, will customers trust the solution?

Still, it’s an interesting entry into the cloud-building market from what to me, at least, was an unexpected quarter, and Huawei will clearly be a company to watch going forward — they have a track record of aggressive revenue growth, and plenty of money to throw at R&D.

What CenturyLink is Getting with Savvis

I scribbled off a quick blog post on the CenturyLink acquisition of Savvis but didn’t have time to delve into it in detail at the time. This is a bit of a follow-up.

Savvis has three core businesses:

  • Coloation: Savvis has carrier-diverse (though not strictly-speaking carrier neutral), high-quality colocation in data centers around the world. It is one of the most significant players in retail colocation for enterprises. It also has a substantial financial vertical play in its proximity hosting for low-latency trading.
  • Managed Hosting: Savvis is among the market share leaders in managed hosting. It is a highly capable provider, ranked for years as a Leader (and at or near the top of the pack) in Gartner’s Magic Quadrant for the market.
  • Networking: Although Savvis has a history as an ISP, their significant acquisition of networking assets came with the acquisition of Cable and Wireless North America’s assets (which included a substantial amount of MCI assets that had to be divested in the MCI-WorldCom acquisition), which they did in order to get Exodus. The networking business has been in slow decline for years, although it has some usefulness in competing with BT Radiance, in the proximity hosting context.

As part of their managed hosting business, Savvis has built a significant portfolio of cloud IaaS products. Savvis has historically had a tendency to overcomplicate their product lines, and cloud has been no exception to this rule. The most significant elements in the portfolio are Virtual Intelligent Hosting (utility managed hosting, equivalent to Terremark Infinistructure, AT&T Synaptic Hosting, etc.), and Symphony VPDC (self-service public cloud, equivalent to Terremark Enterprise Cloud, AT&T Synaptic CaaS, Verizon CaaS, NaviSite NaviCloud, etc.), which is divided into tiers of service quality. Savvis also has an array of private cloud services.

We consider Savvis to be highly competitive in enterprise-class cloud IaaS. They do not necessarily have the best service, featurewise, and they are a relatively expensive option, but they have done a credible job of incorporating security into their architecture, emphasized in RFP responses, in a way that customers respond to very strongly. Savvis has also done a good job layering managed services on top of their cloud offerings, and has begun to compete quite aggressively in the cloud-enabled data center outsourcing market segment, targeting mid-market companies.

In short, CenturyLink is buying a very high-quality set of assets. Qwest has a colocation business, but it is marred by poor customer service (it’s pretty hard to deliver poor customer service in a business as simple as colocation, but Qwest has historically managed to do this, although quality varies per-data-center). Qwest also has a managed hosting business, but it’s historically been sub-par to the market and well behind the hosting businesses of AT&T and Verizon. Qwest’s forays into cloud computing are embryonic. Consequently, CenturyLink is vastly accelerating its entry into this business with the Savvis acquisition. Also, given the capabilities gap between CenturyLink and Savvis, customers can probably expect little if any disruption from the acquisition.

It’s clear that carriers, even the less visionary ones, now feel that they need to have solutions to address data center needs, not just networking needs. While some carriers were able to articulate a vision around this relatively early on — AT&T notably — all the other network operators are quickly falling in line, albeit with varying degrees of vision and commitment.

CenturyLink buys Savvis

Ever since Verizon bought Terremark, and Time Warner Cable bought NaviSite earlier this year, people have speculated about the fate of Savvis. Well, today we know: CenturyLink is buying Savvis.

CenturyLink is a carrier, which rolled up a lot of rural telco assets before going on to digest Qwest. Acquiring Savvis signals its cloud computing ambitions — few carriers can afford to be without a cloud strategy, and apparently CenturyLink has decided to buy one rather than to build one. CenturyLink didn’t have much in the way of hosting assets pre-Qwest, and Qwest’s hosting assets were weak; with the exception of pure-plays like Amazon, nearly everyone in the cloud IaaS business has a hosting background. Moreover, while Qwest has been trying to get into the cloud, they are not a player to speak of.

With the Savvis buy, if CenturyLink is smart, Savvis gets largely left alone to continue what’s been a pretty successful colocation, hosting, and cloud IaaS business, Savvis incorporates the Qwest data center assets and kicks their hosting business to the curb (migrating the customers onto Savvis managed hosting), and Savvis stops fooling with a networking business save for what’s necessary to deliver proximity hosting. CenturyLink has announced that they’ll be consolidating their hosting assets with Savvis and having their current CEO run the unit, so that’s a good sign, at least.

I do believe that early industry consolidation may be bad for cloud innovation, but there’s a certain inevitability to the big network operators picking up leading cloud IaaS providers.

The really interesting question now is if Rackspace is a target. Their model doesn’t fit anywhere near as well into a carrier, given their focus on customer service — arguably their culture would be annihiiated in just about any merger with a likely buyer. Moreover, they have focused upon commodity cloud, while carriers are typically far more interested in enterprise cloud. But that doesn’t mean they’re not a takeover target anyway.

Why transparency matters in the cloud

A number of people have asked if the advice that Gartner is giving to clients about the cloud, or about Amazon, has changed as a result of Amazon’s outage. The answer is no, it hasn’t.

In a nutshell:

1. Every cloud IaaS provider should be evaluated individually. They’re all different, even if they seem to be superficially based off the same technology. The best provider for you will be dependent upon your use case and requirements. You absolutely can run mission-critical applications in the cloud — you just need to choose the right provider, right solution, and architect your application accordingly.

2. Just like infrastructure in your own data center, cloud IaaS requires management, governance, and a business continuity / disaster recovery plan. Know your risks, and figure out what you’re going to do to mitigate them.

3. If you’re using a SaaS vendor, you need to vet their underlying infrastructure (regardless of whether it’s their own data center, colo, hosting, or cloud).

The irony of the cloud is that you’re theoretically just buying something as a service without worrying about the underlying implementation details — but most savvy cloud computing buyers actually peer at the underlying implementation in grotesquely more detail than, say, most managed hosting customers ever look at the details of how their environment implemented by the provider. The reason for this is that buyers lack adequate trust that the providers will actually offer the availability, performance, and security that they claim they will.

Without transparency, buyers cannot adequately assess their risks. Amazon provides some metrics about what certain services are engineered to (S3 durability, for instance), but there are no details for most of them, and where there are metrics, they are usually for narrow aspects of the service. Moreover, very few of their services actually carry SLAs, and those SLAs are narrow and specific (as everyone discovered recently in this last outage, since it was EBS and RDS that were down and neither have SLAs, with EC2 technically unaffected, so nobody’s going to be able to claim SLA credits).

Without objectively understanding their risks, buyers cannot determine what the most cost-effective path is. Your typical risk calculation multiplies the probability of downtime by the cost of downtime. If the cost to mitigate the risk is lower than this figure, then you’re probably well-advised to go do that thing; if not, then, at least in terms of cold hard numbers, it’s not worth doing (or you’re better off thinking about a different approach that alters the probability of downtime, the cost of downtime, or the mitigation strategy).

Note that this kind of risk calculation can go out the window if the real risk is not well understood. Complex systems — and all global-class computing infrastructures are enormously complex under the covers — have nondeterministic failure modes. This is a fancy way of saying, basically, that these systems can fail in ways that are entirely unpredictable. They are engineered to be resilient to ordinary failure, and that’s the engineering risk that a provider can theoretically tell you about. It’s the weird one-offs that nobody can predict, and are the things that are likely to result in lengthy outages of unknown, unknowable length.

It’s clear from reading Amazon customer reactions, as well as talking to clients (Amazon customers and otherwise) over the last few days, that customers came to Amazon with very different sets of expectations. Some were deep in rose-colored-glasses land, believing that Amazon was sufficiently resilient that they didn’t have to really invest in resiliency themselves (and for some of them, a risk calculation may have made it perfectly sane for them to run just as they were). Others didn’t trust the resiliency, and used Amazon for non-mission-critical workloads, or, if they viewed continuous availability as critical, ran multi-region infrastructures. But what all of these customers have in common is the simple fact that they don’t really know how much resiliency they should be investing in, because Amazon doesn’t reveal enough details about its infrastructure for them to be able to accurately judge their risk.

Transparency does not necessarily mean having to reveal every detail of underlying implementation (although plenty of buyers might like that). It may merely mean releasing enough details that people can make calculations. I don’t have to know the details of the parts in a disk drive to be able to accept a mean time between failure (MTBF) or annualized failure rate (AFR) from the manufacturer, for instance. Transparency does not necessarily require the revelation of trade secrets, although without trust, transparency probably includes the involvement of external auditors.

Gartner clients may find the following research notes helpful:

and also some older notes on critical questions to ask your SaaS provider, covering the topics of infrastructure, security, and recovery.

Amazon outage and the auto-immune vulnerabilities of resiliency

Today is Judgment Day, when Skynet becomes self-aware. It is, apparently, also a very, very bad day for Amazon Web Services.

Lots of people have raised questions today about what Amazon’s difficulties today mean for the future of cloud IaaS. My belief is that this doesn’t do anything to the adoption curve — but I do believe that customers who rely upon Amazon to run their businesses will, and should, think hard about the resiliency of their architectures.

It’s important to understand what did and did not happen today. There’s been a popular impression that “EC2 is down”. It’s not. To understand what happened, though, some explanation of Amazon’s infrastructure is necessary.

Amazon divides its infrastructure into “regions”. You can think of a region as basically analogous to “a data center”. For instance, US-East-1 is Amazon’s Northern Virginia data center, while US-West-1 is Amazon’s Silicon Valley data center. Each region, in turn, is divided into multiple “availability zones” (AZs). You can think of an AZ is basically analogous to “a cluster” — it’s a grouping of physical and logical resources. Each AZ is designated by letters — for instance, US-East-1a, US-East-1b, etc. However, each of these designations are customer-specific (which is why Amazon’s status information cannot easily specify which AZ is affected by a problem).

Amazon’s virtual machine offering is the Elastic Compute Cloud (EC2). When you provision an EC2 “instance” (Amazon’s term for a VM), you also get an allocation of “instance storage”. Instance storage is transient — it exists only as long as the VM exists. Consequently, it’s not useful for storing anything that you actually want to keep. To get persistent storage, you use Amazon’s Elastic Block Store (EBS), which is basically just network-attached storage. Many people run databases on EC2 that are backed by EBS, for instance. Because that’s such a common use case, Amazon offers the Relational Database Service (RDS), which is basically an EC2 instance running MySQL.

Amazon’s issues today are with EBS, and with RDS, both in the US-East-1 region. (My guess is that the issues are related, but Amazon has not specifically stated that they are.) Customers who aren’t in the US-East-1 region aren’t affected (customers always choose which region and specific AZs they run in). Customers who don’t use EBS or RDS are also unaffected. However, use of EBS is highly commonplace, and likely just about everyone using EC2 for a production application or Web site is reliant upon EBS. Consequently, even though EC2 itself has been running just fine, the issues have nevertheless had a major impact on customers. If you’re storing your data on EBS, the issues with EBS have made your data inaccessible, or they’ve made access to that data slow and unreliable. Ditto with RDS. Obviously, if you can’t get to your data, you’re not going to be doing much of anything.

In order to get Amazon’s SLA for EC2, you, as a customer, have to run your application in multiple AZs within the same region. Running in multiple AZs is supposed to isolate you from the failure of any single AZ. In practice, of course, this only provides you so much protection — since the AZs are typically all in the same physical data center, anything that affects that whole data center would probably affect all the AZs. Similarly, the AZs are not totally isolated from one another, either physically or logically.

However, when you create an EBS volume, you place it in a specific availability zone, and you can only attach that EBS volume to EC2 instances within that same availability zone. That complicates resiliency, since if you wanted to fail over into another AZ, you’d still need access to your data. That means if you’re going to run in multiple AZs, you have to replicate your data across multiple AZs.

One of the ways you can achieve this is with the Multi-AZ option of RDS. If you’re running a MySQL database and can do so within the constraints of RDS, the multi-AZ option lets you gain the necessary resiliency for your database without having to replicate EBS volumes between AZs.

As one final caveat, data transfer within a region is free and fast — it’s basically over a local LAN, after all. By contrast, Amazon charges you for transfers between regions, which goes over the Internet and has the attendant cost and latency.

Consequently, there are lots of Amazon customers who are running in just a single region. A lot of those customers may be running in just a single AZ (because they didn’t architect their app to easily run in multiple AZs). And of the ones who are running in multiple AZs, a fair number are reliant upon the multi-AZ functionality of RDS.

That’s why today’s impacts were particularly severe. US-East-1 is Amazon’s most popular region. The problems with EBS impacted the entire region, as did the RDS problems (and multi-AZ RDS was particularly impacted), not just a single AZ, so if you were multiple-AZ but not multi-region, the resiliency you were theoretically getting was of no help to you. Today, people learned that it’s not necessarily adequate to run in multiple AZs. (Justin Santa Barbara has a good post about this.)

My perspective on this is pretty much exactly what I would tell a traditional Web hosting customer who’s running only in one data center: If you want more resiliency, you need to run in more than one data center. And on Amazon, if you want more resiliency, you need to not only be multi-AZ but also multi-region.

Amazon’s SLA for EC2 is 99.95% for multi-AZ deployments. That means that you should expect that you can have about 4.5 hours of total region downtime each year without Amazon violating their SLA. Note, by the way, that this outage does not actually violate their SLA. Their SLA defines unavailability as a lack of external connectivity to EC2 instances, coupled with the inability to provision working instances. In this case, EC2 was just fine by that definition. It was EBS and RDS which weren’t, and neither of those services have SLAs.

So how did Amazon end up with a problem that affected all the AZs within the US-East-1 region? Well, according to their status dashboard, they had some sort of network problem last night in their east coast data center. That problem resulted in their automated resiliency mechanisms attempting to re-mirror a large number of EBS volumes. This impacted one of the AZs, but it also overloaded the control infrastructure for EBS in that region. My guess is that RDS also uses this same storage infrastructure, so the capacity shortages and whatnot created by all of this activity ended up also impacting RDS.

My colleague Jay Heiser, who follows, among other things, risk management, calls this “auto-immune disease” — i.e., resiliency mechanisms can sometimes end up causing you harm. (We’ve seen auto-immune problems happen before in a prior Amazon S3 outage, as well as a Google Gmail outage.) The way to limit auto-immune damage is isolation — ensuring limits to the propagation.

Will some Amazon customers pack up and leave? Will some of them swear off the cloud? Probably. But realistically, we’re talking about data centers, and infrastructure, here. They can and do fail. You have to architect your app to have continuous availability across multiple data centers, if it can never ever go down. Whether you’re running your own data center, running in managed hosting, or running in the cloud, you’re going to face this issue. (Your problems might be different — i.e., your own little data center isn’t going to have the kind of complex problem that Amazon experienced today — but you’re still going to have downtime-causing issues.)

There are a lot of moving parts in cloud IaaS. Any one of them going wrong can bork your entire site/application. Your real problem is appropriate risk mitigation — the risk of downtime and its attendant losses, versus the complications and technical challenges and costs created by infrastructure redundancy.

Cloud as a Business Executive Forum

On Wednesday, April 13th, I will be speaking at Joyent’s Cloud as a Business Executive Forum, an all-day event that they’re holding at the Le Meridien Hotel in San Francisco.

I’ll be delivering a presentation on the cloud computing opportunity for service providers, as follows:

Service providers face unprecedented opportunity to grow their revenues and profits and deepen their customer relationships as more and more SMBs and enterprises consider moving their applications to the cloud. But how big, exactly, is the cloud market? From which market segments is the growth coming? How can service providers capitalize most effectively on the growth in the market? How can service providers maximize their margins as cloud products and services rapidly commoditize?

(Per Gartner’s rules for analysts speaking at these kinds of events, my presentation is a market view only, and does not advocate for Joyent.)

The event also includes presentations by Joyent CEO David Young, and chief scientist Jason Hoffman, along with a roundtable discussion with Joyent’s customers and a look at Joyent’s SmartDataCenter software.

If you’re a service provider and are interested in attending, please contact Joyent Sales directly.

The event also anchors my usual monthly visit to the Bay Area, so if you’re interested in meeting in person, please contact your Gartner account executive. (I think my meeting slots have mostly been taken, but these are always somewhat fluid as executive schedules change, and even if I don’t see you this time around, I’ll be back in May.)

Bookmark and Share