Blog Archives

Call for vendors — 2013 Cloud IaaS Magic Quadrant

It’s that time of the year again, a little bit early — we’re trying to refresh the Cloud IaaS Magic Quadrant on a nine-month cycle rather than a yearly cycle, reflecting the faster pace of the market.

A pre-qualification survey, intended to gather quantitative metrics and information about each provider’s service, will be going out very soon.

If you are a cloud IaaS provider, and you did not receive the 2012 survey, and you would like to receive the 2013 survey, please email Michele dot Severance at Gartner dot com to request to be added to the contact list. You must be authorized to speak for your company. Please note we cannot work with PR firms for the Magic Quadrant; if you are a PR agency and you think that your client should be participating, you should get in touch with your client and have your client contact Michele.

If you did receive the 2012 survey, you should be receiving email from Michele within the next few days, requesting that you confirm that you’re the right contact or passing it on to the correct contact to do so.

If you’re unsure whether you’re a cloud IaaS provider by this MQ’s definitions, consider the following:

  • Are you selling a service? (That means you’re not selling hardware or software.)
  • Are you offering compute, storage, and network resources? (You can’t be, say, just a cloud storage provider.)
  • Is your offering fully standardized? (It’s identical for every customer, not a reference architecture that you customize.)
  • Can customers self-service? (Once approved as customers, they can go to your portal and push buttons to immediately, with zero human intervention, obtain/destroy/configure/manage their infrastructure resources. Managed services can be optional.)
  • Can you meter by the hour? (You can either sell by the hour, or you can offer monthly capacity where usage is metered hourly. Having to take a VM for a full month is hosting, not IaaS.)
  • Do you have at least one multi-tenant cloud IaaS offering? (Customers must share a capacity pool for the offering to be considered multi-tenant.)
  • Do you consider your competition to be offerings such as Amazon EC2, Verizon Terremark’s Enterprise Cloud, or CSC’s CloudCompute? (If not, you’re probaly confused about what cloud IaaS is.)

The best guide to this year’s Magic Quadrant is last year’s Magic Quadrant. Read the interactive MQ if you’re a client, or the free reprint if you’re not.

Please note that receiving a survey does not in any way indicate that we believe that your company is likely to qualify; we simply allow surveys to go to all interested parties (assuming that theyâre not obviously wrong fits, like software companies without an IaaS offering).

The status for this Magic Quadrant will be periodically updated on its status page.

The myth of zero downtime

Every time there’s been a major Amazon outage, someone always says something like, “Regular Web hosters and colocation companies don’t have outages!” I saw an article in my Twitter stream today, and finally decided that the topic deserves a blog post. (The article seemed rather linkbait-ish, so I’m not going to link it.)

It is an absolute myth that you will not have downtime in colocation or Web hosting. It is also a complete myth that you won’t have downtime in cloud IaaS run by traditional Web hosting or data center outsourcing providers.

The typical managed hosting customer experiences roughly one outage a year. This figure comes from thirteen years of asking Gartner clients, day in and day out, about their operational track record. These outages are typically related to hardware failure, although sometimes they are related to service provider network outages (often caused by device misconfiguration, which can obliterate any equipment or circuit redundancy). Some customers are lucky enough to never experience any outages over the course of a given contract (usually two to three years for complex managed hosting), but this is actually fairly rare, because most customers aren’t architected to be resilient to all but the most trivial of infrastructure failures. (Woe betide the customer who has a serious hardware failure on a database server.) The “one outage a year” figure does not include any outages that the customer might have caused himself through application failure.

The typical colocation facility in the US is built to Tier III standards, with a mathematical expected availability of about 99.98%. In Europe, colocation facilities are often built to Tier II standards intead, for an expected availability of about 99.75%. Many colocation facilities do indeed manage to go for many years without an outage. So do many enterprise data centers — including Tier I facilities that have no redundancy whatsoever. The mathematics of the situation don’t say that you will have an outage — these are merely probabilities over the long term. Moreover, there will be an additional percentage of error that is caused by humans. Single-data-center kings who proudly proclaim that their one data center has never had an outage have gotten lucky.

The amount of publicity that a data center outage gets is directly related to its tenant constituency. The outage at the 365 Main colocation facility in San Francisco a few years back was widely publicized, for instance, because that facility happened to house a lot of Internet properties, including ones directly associated with online publications. There have been significant outages at many other colocation faciliities over the years, though, that were never noted in the press — I’ve found out about them because they were mentioned by end-user clients, or because the vendor disclosed them.

Amazon outages — and indeed, more broadly, outages at large-scale providers like Google — get plenty of press because of their mass effects, and the fact that they tend to impact large Internet properties, making the press aware that there’s a problem.

Small cloud providers often have brief outages — and long maintenance windows, and sometimes lengthy maintenance downtimes. You’re rolling the dice wherever you go. Don’t assume that just because you haven’t read about an outage in the press, it hasn’t occurred. Whether you decide on managed hosting, dedicated hosting, colocation, or cloud IaaS, you want to know a provider’s track record — their actual availability over a multi-year period, not excluding maintenance windows. Especially for global businesses with 24×7 uptime requirements, it’s not okay to be down at 5 am Eastern, which is prime-time in both Europe and Asia.

Sure, there are plenty of reasons to worry about availability in the cloud, especially the possibility of lengthy outages made worse by the fundamental complexity that underlies many of these infrastructures. But you shouldn’t buy into the myth that your local Web hoster or colocation provider necessarily has better odds of availability, especially if you have a non-redundant architecture.

Some clarifications on HP’s SLA

I corresponded with some members of the HP cloud team in email, and then colleagues and I spoke with HP on the phone, after my last blog post called, “Cloud IaaS SLAs can be Meaningless“. HP provided some useful clarifications, which I’ll detail below, but I haven’t changed my fundamental opinion, although arguably the nuances make the HP SLA slightly better than the AWS SLA.

The most significant difference between the SLAs is that the HP’s SLA is intended to cover a single-instance failure, where you can’t replace that single instance; AWS requires that all of your instances in at least two AZs be unavailable. HP requires that you try to re-launch that instance in a different AZ, but a failure of that launch attempt in any of the other AZs in the region will be considered downtime. You do not need to be running in two AZs all the time in order to get the SLA; for the purposes of the SLA clause requiring two AZs, the launch attempt into a second AZ counts.

HP begins counting downtime when, post-instance-failure, you make the launch API call that is destined to fail — downtime begins to accrue 6 minutes after you make that unsuccessful API call. (To be clear, the clock starts when you issue the API call, not when the call has actually failed, from what I understand.) When the downtime clock stops is unclear, though — it stops when the customer has managed to successfully re-launch a replacement instance, but there’s no clarity regarding the customer’s responsibility for retry intervals.

(In discussion with HP, I raised the issue of this potentially resulting in customers hammering the control plane with requests in mass outages, along with intervals when the control plane might have degraded response and some calls succeed while others fail, etc. — i.e., the unclear determination of when downtime ends, and whether customers trying to fulfill SLA responsibilities contribute to making an outage worse. HP was unable to provide a clear answer to this, other than to discuss future plans for greater monitoring transparency, and automation.)

I’ve read an awful lot of SLAs over the years — cloud IaaS SLAs, as well as SLAs for a whole bunch of other types of services, cloud and non-cloud. The best SLAs are plain-language comprehensible. The best don’t even need examples for illustration, although it can be useful to illustrate anything more complicated. Both HP and AWS sin in this regard, and frankly, many providers who have good SLAs still force you through a tangle of verbiage to figure out what they intend. Moreover, most customers are fundamentally interested in solution SLAs — “is my stuff working”, regardless of what elements have failed. Even in the world of cloud-native architecture, this matters — one just has to look at the impact of EBS and ELB issues in previous AWS outages to see why.

Cloud IaaS SLAs can be meaningless

In infrastructure services, the purpose of an SLA (or, for that matter, the liability clause in the contract) is not “give the customer back money to compensate for the customer’s losses that resulted from this downtime”. Rather, the monetary guarantees involved are an expression of shared risk. They represent a vote of confidence — how sure is the provider of its ability to deliver to the SLA, and how much money is the provider willing to bet on that? At scale, there are plenty of good, logical reasons to fear the financial impact of mass outages — the nature of many cloud IaaS architectures create a possibility of mass failure that only rarely occurs in other services like managed hosting or data center outsourcing. IaaS, like traditional infrastructure services, is vulnerable to catastrophes in a data center, but it is additionally vulnerable to logical and control-plane errors.

Unfortunately, cloud IaaS SLAs can readily be structured to make it unlikely that you’ll ever see a penny of money back — greatly reducing the provider’s financial risks in the event of an outage.

Amazon Web Services (AWS) is the poster-child for cloud IaaS, but the AWS SLA also has the dubious status of “worst SLA of any major cloud IaaS provider”. (It’s notable that, in several major outages, AWS did voluntary givebacks — for some outages, there were no applicable SLAs.)

HP has just launched its OpenStack-based Public Cloud Compute into general availability. HP’s SLA is unfortunately arguably even worse.

Both companies have chosen to express their SLAs in particularly complex terms. For the purposes of this post, I am simplifying all the nuances; I’ve linked to the actual SLA text above for the people who want to go through the actual word salad.

To understand why these SLAs are practically useless, you need to understand a couple of terms. Both providers divide their infrastructure into “regions”, a grouping of data centers that are geographically relatively close to one another. Within each region are multiple “availability zones” (AZs); each AZ is a physically distinct data center (although a “data center” may be comprised of multiple physical buildings). Customers obtain compute in the form of virtual machines known as “instances”. Each instance has ephemeral local storage; there is also a block storage service that provides persistent storage (typically used for databases and anything else you want to keep). A block storage volume resides within a specific AZ, and can only be attached to a compute instance in that same AZ.

AWS measures availability over the course of a year, rather than monthly, like other providers (including HP) do. This is AWS’s hedge against a single short outage in a month, especially since even a short availability-impacting event takes time to recover from. 99.95% monthly availability only permits about 21 minutes of downtime; 99.95% yearly availability permits nearly four and a half hours of downtime, cumulative over the course of the year.

However, AWS and HP both define their SLA not in terms of instance availability, or even AZ availability, but in terms of region availability. In the AWS case, a region is considered unavailable if you’re running instances in at least two AZs within that region, and in both of those AZs, your instances have no external network connectivity and you can’t launch instances in that AZ that do; this is metered in five-minute intervals. In the HP case, a region is considered unavailable if an instance within that region can’t respond to API or network requests, you are currently running in at least two AZs, and you cannot launch a replacement instance in any AZ within that region; the downtime clock doesn’t start ticking until there’s more than 6 minutes of unavailability.

(Update: HP provided some clarifications.)

Every AZ that a customer chooses to run in effectively imposes a cost. An AZ, from an application architecture point of view, is basically a data center, so running in multiple AZs within a region is basically like running in multiple data centers in the same metropolitan region. That’s close enough to do synchronous replication. But it’s still a pain to have to do this, and many apps don’t lend themselves well to a multi-data-center distributed architecture. Also, that means paying to store your data in every AZ that you need to run in. Being able to launch an instance doesn’t do you very much good if it doesn’t have the data it needs, after all. The AWS SLA essentially forces you to replicate your data in two AZs; the HP one makes you do this for all the AZs within a region. Most people are reasonably familiar with the architectural patterns for two data centers; once you add a third and more, you’re further departing from people’s comfort zones, and all HP has to do is to decide they want to add another AZ in order to essentially force you to do another bit of storage replication if you want to have an SLA.

(I should caveat the former by saying that this applies if you want to be able to usefully run workloads within the context of the SLA. Obviously you could just choose to put different workloads in different AZs, for instance, and not bother trying to replicate into other AZs at all. But HP’s “all AZs not available” is certainly worse than AWS’s “two AZs not available”.)

Amazon has a flat giveback of 10% of the customer’s monthly bill in the month in which the most recent outage occurred. HP starts its giveback at 5% and caps it at 30% (for less than 99% availability), but it covers strictly the compute portion of the month’s bill.

HP has a fairly nonspecific claim process; Amazon requires that you provide the instance IDs and logs proving the outage. (In practice, Amazon does not seem to have actually required detailed documentation of outages.)

Neither HP nor Amazon SLA their management consoles; the create-and-launch instance APIs are implicitly part of their compute SLAs. More importantly, though, neither HP nor Amazon SLA their block storage services. Many workloads are dependent upon block storage. If the storage isn’t available, it doesn’t matter if the virtual machine is happily up and running — it can’t do anything useful. For example of why this matters, you need look no further than the previous Amazon EBS outages, where the compute instances were running happily, but tons of sites were down because they were dependent on data stores on EBS (and used EBS-backed volumes to launch instances, etc.).

Contrast these messes to, say, the simplicity of the Dimension Data (OpSource) SLA. The compute SLA is calculated per-VM (i.e., per-instance). The availability SLA is 100%; credits start at 5% of monthly bill for the region, and go up to 100%, based on cumulative downtime over the course of the month (5% for every hour of downtime). One caveat: Maintenance windows are excluded (although in practice, maintenance windows seem to affect the management console, not impacting uptime for VMs). The norm in the IaaS competition is actually strong SLAs with decent givebacks, that don’t require you to run in multiple data centers.

Amazon’s SLA gives enterprises heartburn. HP had the opportunity to do significantly better here, and hasn’t. To me, it’s a toss-up which SLA is worse. HP has a monthly credit period and an easier claim process, but I think that’s totally offset by HP essentially defining an outage as something impacting every AZ in a region — something which can happen if there’s an AZ failure coupled with a massive control-plane failure in a region, but not otherwise likely.

Customers should expect that the likelihood of a meaningful giveback is basically nil. If a customer needs to, say, mitigate the fact he’s losing a million dollars an hour when his e-commerce site is down, he should be buying cyber-risk insurance. The provider absorbs a certain amount of contractual liability, as well as the compensation given by the SLA, but this is pretty trivial — everything else is really the domain of the insurance companies. (Probably little-known fact: Amazon has started letting cyber-risk insurers inspect the AWS operations so that they can estimate risk and write policies for AWS customers.)

The meta-info of my OpenStack note

I’ve been reading the social media reactions to my recent note on OpenStack, “Don’t Let OpenStack Hype Distort Your Selection of a Cloud Management Platform in 2012” (that’s a client link; a free public reprint without the executive summary is also available), and wanted to respond to some comments that are more centered on the research process and publication process than on the report itself. So, here are a number of general assertions:

Gartner doesn’t do commissioned research. Ever. Repeat: Gartner, unlike almost every other analyst firm, doesn’t do commissioned resesarch — ever. Most analyst firms will do “joint research” or “commissioned whitepapers” or the like — research where a vendor is paying for the note to be written. About a decade ago, Gartner stopped this practice, because management felt it could be seen as compromising neutrality. No vendor paid for that OpenStack note to be written, directly or indirectly. Considering that most of the world’s largest IT vendors are significant participants in OpenStack, and plenty of small ones are as well, and a bunch of them are undoubtedly unhappy about the publication of that note, if Gartner’s interests were oriented around vendors, we certainly wouldn’t have published research practically guaranteed to upset a lot of vendors.

Gartner earns the overwhelming majority of its revenue from IT buyers. About 80% of Gartner’s revenues come from our IT buyer clients (we call them “end-users”). We don’t shill for vendors, ever, because our bread-and-butter comes from IT buyers, who trust us for neutral advice. Analysts are interested in helping our end-user clients make the technology decisions that are best for their business. We also want to help vendors (including those who are involved with free or commercial open-source) succeed in better serving end-users — which often means that we will be critical of vendor efforts. Our clients are asking about OpenStack, and every example of hype in that note comes directly from client interactions. I wrote that note because the volume of OpenStack queries from clients was ramping up, and we needed written research to address it.

Gartner analysts are not compensated on commissions of any sort. Many other analyst firms have incentives for analysts that are tied to revenue or publicity — be quoted in the press X times, sell reports, sell consulting, sell strategy days with vendors, etc. Gartner doesn’t do any of that, and hasn’t for about a decade. Our job, as Gartner analysts, is to try to offer the best advice we can to our clients. Sometimes, of course, we will be wrong, but we try hard. It’s not an academic exercise; our end-user clients have business outcomes that ride on their technology decisions.

Gartner doesn’t dislike open source. As a collective entity, Gartner tends to be cautious in its stances, as our end-user clients tend to be mid-market and enterprise IT executives who are fairly risk-averse; our analysis of all solutions, including OSS, tends to be from that perspective. But we write extensively about open source; we have analysts devoted to the topic, plus everyone covers the OSS relevant to their own coverage. We consider OSS a business strategy like anything else. In fact, we’ve been particularly vocal about how we feel that cloud is driving OSS adoption across a broad spectrum of solutions, and advocates that an IT organization’s adoption of cloud is a great time to consider replacing proprietary tech with OSS. (You’ll note that a whole section of the report predicts OpenStack’s eventual success, by the way, so it’s not like this a prediction of doom, just an observation of present stumbling-blocks on the road to maturity.)

Gartner research notes are Gartner opinion, not an individual analyst’s opinion. Everything that Gartner publishes as a research note (excluding things like blog posts, which we consider off-the-clock, personal time and not a corporate activity) undergoes a peer review process. While notes do slip through the cracks (i.e., get published without sufficiently broad or deep review), our internal processes require analysts to get review from everyone who touches a coverage area. My OpenStack note was very broadly peer reviewed — by other analysts who cover cloud infrastructure, cloud management platforms, and open source software, as well as a bunch of related areas that OpenStack touches. (As a result of that review, the original note almost quadrupled in size, split into one note on OSS CMPs in general, and one note on OpenStack itself.) I also asked for our Ombudsman’s office, which deals with vendor complaints, to review the note to make sure that it seemed fair, balanced, and free of inflammatory language, and they (and my manager) also asked questions about potentially controversial sections, in order to ensure they were backed by facts. Among other things, these processes are intended to ensure that individual analyst bias is eliminated to as large an extent as possible. That process is part of why Gartner’s opinions often sound highly conservative, but when we take a stance, it is usually neither casual nor one analyst’s opinion.

The publication of this note was not a shock to the vendors involved. Most of the vendors were aware that this note was coming; it was a work in progress over the summer. Rackspace, as the owner of OpenStack at the time that this was placed in the publication pipeline, was entitled to a formal review and discussion prior to its publication (as we do for any research that covers a vendor’s product in a substantive way). I had spoken to many of the other vendors in advance of its publication, letting them know it was coming (although since it was pre-Foundation they did not have advance review). The evolving OpenStack opinions of myself and other Gartner analysts have long been known to the vendors.

It would have been easier not to write anything. I have been closely following OpenStack since its inception, and I have worked with many of the OpenStack vendors since the early days of the project. I have a genuine interest in seeing them, and OpenStack, succeed, and I hope that the people that I and other analysts have dealt with know that. Many individuals have confided in me, and other Gartner analysts, about the difficulties they were having with the OpenStack effort. We value these relationships, and the trust they represent, and we want to see these people and their companies succeed. I was acutely careful to not betray any individual confidences when writing that report, ensuring that any concerns surfaced by the vendors had been said by multiple people and organizations, so that there would be no tracebacks. I am aware, however, that I aired everyone’s collective dirty laundry in public. I hope that making the conversation public will help the community rally around some collective goals that will make OpenStack mainstream adoption possible. (I think Rackspace’s open letter implicitly acknowledges the issues that I raised, and I highly encourage paying attention to its principles.)

You will see other Gartner analysts be more active in OpenStack coverage. I originally picked up OpenStack coverage because I have covered Rackspace for the last decade, and in its early days it was mostly a CMP for service providers. Enterprise adoption has begun, and so its primary home for coverage is going to be our CMP analysts (folks like Alessandro Perilli, Donna Scott, and Ronni Colville), although those of us who cover cloud IaaS (especially myself, Kyle Hilgendorf, and Doug Toombs) will continue to provide coverage from a service provider perspective. Indeed, our coverage of OSS CMPs (CloudStack, Eucalyptus, OpenNebula, etc.) has been ramping up substantially of late. We’re early in the market, and you can expect to see us track the maturation of these solutions.

Servers are cheap, talent is expensive

Of late, I’ve been talking to Amazon customers who are saying, you know, AWS gives us a ton of benefits, it makes a lot of things easy and fast that used to be hard, but in the end, we could do this ourselves, and probably do it at comparable cost or a cost that isn’t too much higher. These are customers that are at some reasonable scale — a take-back would involve dozens if not hundreds of physical server deployments — but aren’t so huge that the investment would be leveraged over, say, tens of thousands of servers.

Most people don’t choose cloud IaaS for lowered costs, unless they have very bursty or unpredictable workloads. Instead, they choose it for increased business agility, which to most people means “getting applications, and thus new business capabilities, more quickly”.

But there’s another key reason to not do it yourself: The war for talent.

The really bright, forward-thinking people in your organization — the people who you would ordinarily rely upon to deploy new technologies like cloud — are valuable. The fact that they’re usually well-paid is almost inconsequential compared to the fact that these are often the people who can drive differentiated, innovative business value for your organization, and they’re rare. Even if you have open headcount, finding those “A” players can be really, really tough, especially if you want a combination of cutting-edge technical skills with the personal traits — drive, follow-through, self-starting, thinking out of the box, communication skills, and so on — that make for top-tier engineers.

Just because you can do it yourself doesn’t mean that you should. Even if your engineers think they’re just as smart as Amazon’s engineers (which they might well be), and are chomping at the bit to prove it. If you can outsource a capability that doesn’t generate competitive advantage for you, then you can free your best people to work on the things that do generate competitive advantage. You can work on the next cool thing… and challenge your engineers to prove their brilliance by dreaming up something that hasn’t been done before, solving the challenges that deliver business value to your organization. Assuming, of course, that your culture provides an environment receptive to such innovation.

Thoughts on cloud IaaS market share

As part of our qualification survey for the cloud IaaS Magic Quadrant, we ask providers for detailed information about their revenue, provisioned capacity, and usage, under a very strict nondisclosure agreement. We treat this data with a healthy dose of skepticism (and we do our own models, channel-check, talk to contacts in the financial industry who’ve seen disclosures, and otherwise apply a filter of cynicism to it), but post-scrubbing, there are a number of very interesting data points that come out of the aggregated whole.

Three teasers:

Growth is huge, but is disproportionately concentrated on a modest number of vendors. Obviously, everyone knows that Amazon is a behemoth, but among the other vendors, there are stark differences in growth. Obviously, some small vendors post huge growth on a percentage basis (we went from $100k to $2m! yay!) so raw percentages aren’t the right way to look at this. Instead, what’s interesting is relative market share once you eliminate Amazon from the numbers. The data suggests that to succeed in this market, you have two possible routes — you have a giant sales channel with a ton of feet on the street and existing relationships, or you have excellent online marketing and instant online sign-ups. A third possible route is that you make it easy for people to white-label and resell your service.

Most vendors are still not at scale. Despite huge growth, most vendors remain under $10 million in revenue, and even the club above $20 million in revenue in pure public cloud IaaS revenue is only 20-odd vendors. Even that club is often still at a scale where Amazon could probably casually provide that as spot capacity in one AZ. By market share, Amazon is a de facto monopoly, although this market doesn’t have the characteristics of monopoly markets; the sheer number of competing vendors and the early stage of the market suggest shake-ups to come.

Customers love dedicated compute nodes. An increasing number of vendors offer dedicated compute nodes — i.e., a guarantee that a customer’s VMs won’t share a physical server with another customer’s VMs. That can be done on temporarily-dedicated hardware (like Amazon’s Dedicated Instances) or on an allocation of hardware that’s contractually the customer’s for a lengthier period of time (often a dedicated-blade option for vCloud Powered providers). For most providers who offer this option, customers seem to overwhelmingly choose it over VMs on shared hosts, even though it represents a price premium. Note that in most of these cases, the network and storage are still shared, although vendors may label this “private cloud” nevertheless. (We believe Amazon’s DI to be an exception to this, by the way, due to its very high price premium, especially for small numbers of instances; this is an effect of DIs being spread out over many servers rather than consolidated, like other providers do it.)

Foundational Gartner research notes on cloud IaaS

In light of the upcoming Magic Quadrant work, I thought it would be useful to highlight research that myself and others have published that is important in the context of this MQ. These notes lay out how we see the market, and consequently, the lens that we’re going to be evaluating the service providers through.

I want to stress that service providers do not need to agree with our perspective in order to rate well. We admire those who march to their own particular beat, as long as it results in true differentiation and more importantly, customer wins and happy customers — a different perspective can allow a service provider to serve their particular segments of the market more effectively. However, such providers need to be able to clearly articulate that vision and to back it up with data that supports their world-view.

That said, if you are a service provider, these are the research notes that it might be helpful to be familiar with (sorry, clients only):

Pricing and Buyer’s Guide for Web Hosting and Cloud Infrastructure, 2012. Our market definitions are described here, in case you’re confused about what we consider to be cloud IaaS.

Competitive Landscape: New Entrants to the Cloud IaaS Market Face Tough Competitive Challenges. This describes the competitive landscape and the challenges of differentiating in this market. It also profiles two sucessful providers, Amazon and CSC, in detail. This is critical reading to understand what we believe does and does not differentiate providers.

Market Insight: Structuring the Cloud Compute IaaS Market. This presents our market segmentation; each segment is associated with a buyer profile. While our thinking has refined since this was published in early 2011, it is still an extremely important view into our thinking about customer needs.

Evaluating Cloud Infrastructure as a Service. This seven-part set of research notes describes the range of IaaS capabilities offered across the market, from the technology itself to how service is done. This provides important terminology, and is also useful for determining how competitive your offering really is. (Note that this is an early-2011 note set, so the state of the art has advanced since then.)

Evaluation Criteria for Public Cloud IaaS Providers. Our Technical Professionals research provides extremely detailed criteria for large enterprises that are evaluating providers. While the customer requirements are somewhat different in other segments, like the mid-market, these criteria should give you an extremely strong idea of the kinds of things that we think are important to customers. The Magic Quadrant evaluation criteria will not be identical (because it is broader than just large-enterprise), but this is the kind of thing you should be thinking about.

Market Trends: Public and Private Cloud Infrastructure Converge into On-Demand Infrastructure Fabrics. This describes our view of how the service provider cloud infrastructure platforms will evolve, including providing a perspective on public vs. private cloud, and developer-class vs. enterprise-class cloud.

Best Practice: Evaluate Isolation Mechanisms in Public and Private Cloud IaaS. Many service providers are using “private cloud” in ways we consider actively deceptive. This note provides a warning to IT buyers, and discusses the kinds of isolation options that are available. This emphasizes our insistence that providers be transparent about their isolation mechanisms and security controls.

Less-critical notes that cover narrower topics, that you may nevertheless want to read:

Market Insight: Customers Need Hybrid Cloud Compute Infrastructure as a Service. This describes customer requirements for “hybrid” scenarios — the need for cloud bridging into the enterprise data center, physical-virtual hybrid environments, hybrid hosting, and multi-cloud environments.

Infrastructure as a Service in the Cloud Services Value Chain. This describes the overall place of IaaS in the value chain. It explains market evolution and how this impacts upstream and downstream technology vendors; it provides our viewpoint on the channel.

Toolkit: Mitigating Risks in Cloud Infrastructure as a Service. This provides a fairly comprehensive checklist for risk assessment. You may want to think about how well your solution addresses this list of risks.

Delivery Models for Application Infrastructure in the Cloud: Beware the Lure of False PaaS. This provides software and middleware licensing models, and contrasts IaaS vs. PaaS. Pay particular attention to the importance of software marketplaces.

If you are not a Gartner client, please note that many of these topics have been covered in my blog in the past, if at a higher level (and generally in a mode where I am still working out my thinking, as opposed to a polished research position).

Call for vendors – 2012 Cloud IaaS Magic Quadrant

We’re about to kick off Gartner’s 2012 Cloud IaaS Magic Quadrant.

A pre-qualification survey, intended to gather quantitative metrics and basic information about each provider’s service, will be going out very soon.

If you are a cloud compute IaaS provider — that means you offer, as a service, fully-automated, self-service compute, storage, and network infrastructure that is available on-demand (think “by the hour” and not “by the month”) — you did not receive a survey last year, and you would like to receive a survey this year, please contact me via email at Lydia dot Leong at Gartner dot com.

Note: This is not hosting and this is not data center outsourcing. You should have a fully-standardized offering — one that is identical for every customer, not a reference architecture that you customize — and customers should be self-servicing (i.e., they go to your portal and push buttons to immediately, with zero human intervention, obtain/destroy/configure/manage their infrastructure), although you can optionally provide managed services.

Also note: This is not for software or hardware vendors. This is for service providers.

Bottom line: If you don’t consider yourself to be in competition with Amazon EC2 or the Terremark Enterprise Cloud, to take two well-known examples, this is not your Magic Quadrant.

Please note that receiving a survey does not in any way indicate that we believe that your company is likely to qualify; we simply allow surveys to go to all interested parties (assuming that they’re not obviously wrong fits, like software companies without an IaaS offering).

Do Amazon’s APIs matter?

For those who have been wondering where I personally stand in the brouhaha over Amazon, Citrix, Eucalyptus, CloudStack, OpenStack, Rackspace, HP, and so on, along with the broader competitive market that includes VMware, Microsoft, and the Four Horsemen of management tools… I should state up-front that I hold the optimistic viewpoint that I want everyone to be successful as possible — service providers, commercial vendors, open-source projects, and the customers and users that depend upon them.

I feel that the more competent the competition in a market, the more that everyone in the ecosystem is motivated to do better, and the more customers benefit as a result. Customers benefit from better technology, lower costs, more responsive sales, and differentiated approaches to the market. Clearly, competition can hurt companies, but especially with emerging technology markets, competition often results in making the pie bigger for everyone, by expanding the range of customers that can be served — although yes, sometimes weaker competitors will be culled from the herd.

I believe that companies are best served by being the best they can be — you can target a competitor by responding on a tactical basis, and sometimes you want to, but for your optimal long-term success, you should strive to be great yourself. Obsessing over what your competitors are doing can easily distract companies from doing the right thing on a long-term strategic basis.

That said:

Dan Woods over on Forbes has written a blog post about questions around Amazon’s API strategy, and Jim Plamondon (Rackspace Developer Relations) has posted a comment on my blog about Amazon ecosystem zombiefication.

I’ve been thinking about the implications of Amazon API compatibility, and the degree to which it is or isn’t to Amazon’s advantage to encourage other people to build Amazon-compatible clouds.

I think it comes down to the following: If Amazon believes that they can innovate faster, drive lower costs, and deliver better service than all of their competitors that are using the same APIs (or, for that matter, enterprises who are using those same APIs), then it is to their advantage to encourage as many ways to “on-ramp” onto those APIs as possible, with the expectation that they will switch onto the superior Amazon platform over time.

But I would also argue that all this nattering about the basic semantics of provisioning bare resource elements is largely a waste of time for most people. None of the APIs for provisioning compute and storage (whether EC2/S3/EBS or their counterparts in other clouds) are complicated things at their core. They’re almost always wrappered with an abstraction layer, third-party library, or management tool. However, APIs may matter to people who are building clouds because they implicitly express the underlying conceptual framework of the system, though, and the richness of the API semantics constrain what can be expressed and therefore, what can be controlled via the API; the constraints of the Amazon APIs forces everyone else to express richer concepts in some other way.

But the battle will increasingly not be fought at this very basic level of ‘how do I get raw resources’. I recognize that building a cloud infrastructure platform at scale and with a lot of flexibility is a very difficult problem (although a simple and rigid one is not an especially difficult problem, as you can see from the zillion CMPs out in the market). But it’s not where value is ultimately created for users.

Value for users is ultimately created at the layers above the core infrastructure. Everyone has to get core infrastructure right, but the real question is: How quickly can you build value-added services, and how well does the adaptibility of your core infrastructure allow you to serve a broad range of use cases (or serve a narrow range of use cases in a fashion superior to everyone else) and to deliver new capabilities to your users?