Category Archives: Infrastructure
I recently wrote a Twitter thread about cloud risk and resilience that drew a lot of interest, so I figured I’d expand on it in a blog post. I’ve been thinking about cloud resilience a lot recently, given that clients have been asking about how they manage their risks.
Inquiries about this historically come in waves, almost always triggered by incidents that raise awareness (unfortunately often because the customer has been directly impacted). A wave generally spans a multi-week period, causing waves to bleed into one another. Three distinct sets come to mind over the course of 2021:
- The Azure AD outages earlier this year had a huge impact on client thinking about concentration risks and critical service dependencies — often more related to M365 than Azure, though (and exacerbated by the critical dependency that many organizations have on Teams during this pandemic). Azure AD is core to SSO for many organizations, making its resilience enormously impactful. These impacts are still very top of mind for many clients, months later.
- The Akamai outage (and other CDN outages with hidden dependencies) this summer raised application and infrastructure dependency awareness, and came as a shock to many customers, as Akamai has generally been seen as a bedrock of dependability.
- The near-daily IBM Cloud “Severity 1” outages over the last month have drawn selective client mentions, rather than a wave, but add to the broader pattern of cloud risk concerns. (To my knowledge, there has been no public communication from IBM regarding root cause of these issues. Notifications indicate the outages are multi-service and multi-regional, often impacting all Gen 2 multizone regions. Kubernetes may be something of a common factor, to guess from the impact scope.)
Media amplification of outage awareness appears to have a lot to do with how seriously they’re taken by customers — or non-customers. Affecting stuff that’s consumed by end-users — i.e. office suites, consumer websites, etc. — gets vastly more attention than things that are “just” a really bad day for enterprise ops people. And there’s a negative halo effect — i.e. if Provider X fails, it tends to raise worries about all their competitors too. But even good media explanations and excellent RCAs tend to be misunderstood by readers — and even by smart IT people. This leads, in turn, to misunderstanding why cloud services fail and what the real risks are.
I recently completed my writing on a note about HA and failover (DR) patterns in cloud IaaS and PaaS, with a light touch on application design patterns for resilience. However, concerns about cloud resilience applies just as much — if not more so — to SaaS, especially API SaaS, which creates complicated and deep webs of dependencies.
You can buy T-shirts, stickers, and all manner of swag that says, “The cloud is just somebody else’s computer.” Cute slogan, but not true. Cloud services — especially at massive scale — are incredibly complex software systems. Complex software systems don’t fail the way a “computer” fails. The cloud exemplifies the failure principles laid out by Richard Cook in his classic “How Complex Systems Fail“.
As humans, we are really bad at figuring out the risk of complex systems, especially because the good ones are heavily defended against failure. And we tend to over-index on rare but dramatic risks (a plane crash) versus more commonplace risks (a car crash).
If you think about “my application hosted on AWS” as “well, it’s just sitting on a server in an AWS data center rather than mine”, then at some point in time, the nature of a failure is going to shock you, because you are wrong.
Cloud services fail after all of the resiliency mechanisms have failed (or sometimes, gone wrong in ways that contribute to the failure). Cloud services tend to go boom because of one or more software bugs, likely combined with either a configuration error or some kind of human error (often related to the deployment process for new configs and software versions). They are only rarely related to a physical failure — and generally the physical failure only became apparent to customers because the software intended to provide resilience against it failed in some fashion.
Far too many customers still think about cloud failure as a simple, fundamentally physical thing. Servers fail, so we should use more than one. Data centers fail, so we should be able to DR into another. Etc. But that model is wrong for cloud and for the digital age. We want to strive for continuous availability and resilience (including graceful degradation and other ways to continue business functionality when the application fails). And we have to plan for individual services failures rather than total cloud failure (whether in an AZ, region, or globally). Such failures can be small-scale, and effectively merely “instability”, rather than an “outage” — and therefore demands apps that are resilient to service errors.
So as cloud buyers, we have to think about our risks differently, and we need to architect and operate differently. But we also need to trust our providers — and trust smartly. To that end, cloud providers need to support us with transparency, so we can make more informed decisions. Key elements of that include:
- Publicly-documented engineering service-level objectives (SLOs), which are usually distinct from the financially-backed SLAs. This is what cloud providers design to internally and measure themselves against, and knowing that helps inform our own designs and internal SLOs for our apps.
- Service architecture documentation that helps us understand the ways a service is and isn’t resilient, so we can design accordingly.
- Documented service dependency maps, which allow us to see the chain of dependencies for each of the services we use, allowing us to think about if Service X is really the best fallback alternative if Service Y goes down, as well as inform our troubleshooting.
- Public status dashboards, clearly indicating the status of services, with solid historical data that allows us to see the track record of service operations. This helps with our troubleshooting and user communication.
- Public outage root-cause analysis (RCA), which allow us to understand why outages occurred, and receive a public pledge as to what will be done to prevent similar failures in the future. A historical archive of these is also a valuable resource.
- Change transparency that could help predict stability concerns. Because so many outages end up being related to new deployments / config changes, and the use of SRE principles, including error budgets, is pretty pervasive amongst cloud providers, there is often an interesting pattern to outages. Changes tend to freeze when the error budget is exceeded, leading to an on-and-off pattern of outages; instability can resume at intervals unpredictable to the customer.
Mission-critical cloud applications are becoming commonplace — both in the pervasive use of SaaS, along with widespread production use of IaaS and PaaS. It’s past time to modernize thinking about cloud operations, cloud resilience, and cloud BC/DR. Cloud risk management needs to be about intelligent mitigation and not avoidance, as forward-thinking businesses are will not accept simply avoiding the cloud at this point.
I am interested in your experiences with resilience as well as cloud instability and outages. Feel free to DM me on Twitter to chat about it.
I’m seeing various bits of angst around “Is it safe to use cloud services, if my business can be suspended or terminated at any time?” and I thought I’d take some time to explain how cloud providers (and other Internet ecosystem providers, collectively “service providers” [SPs] in this blog post) enforce their Terms of Service (ToS) and Acceptable Use Policies (AUPs).
The TL;DR: Service providers like money, and will strive to avoid terminating customers over policy violations. However, providers do routinely (and sometimes automatically) enforce these policies, although they vary in how much grace and assistance they offer with issues. You don’t have to be a “bad guy” to occasionally run afoul of the policies. But if your business is permanently unwilling or unable to comply with a particular provider’s policies, you cannot use that provider.
AUP enforcement actions are rarely publicized. The information in this post is drawn from personal experience running an ISP abuse department; 20 years of reviewing multiple ISP, hosting, CDN, and cloud IaaS contracts on a daily basis; many years of dialogue with Gartner clients about their experiences with these policies; and conversations with service providers about these topics. Note that I am not a lawyer, and this post is not legal advice. I would encourage you to work with your corporate counsel to understand service provider contract language and its implications for your business, whenever you contract for any form of digital service, whether cloud or noncloud.
The information in this post is intended to be factual; it is not advice and there is a minimum of opinion. Gartner clients interested in understanding how to negotiate terms of service with cloud providers are encouraged to consult our advice for negotiating with Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP), or with SaaS providers. My colleagues will cheerfully review your contracts and provide tailored advice in the context of client inquiry.
Click-thrus, negotiated contracts, ToS, and AUPs
Business-to-business (B2B) service provider agreements have taken two different forms for more than 20 years. There are “click-through agreements” (CTAs) that present you with a online contract that you click to sign; consequently, they are as-is, “take it or leave it” legal documents that tend to favor the provider in terms of business risk mitigation. Then there are negotiated contracts — “enterprise agreements” (EAs) that typically begin with more generous terms and conditions (T&Cs) that better balance the interests of the customer and the provider. EAs are typically negotiated between the provider’s lawyers and the customer’s procurement (sourcing & vendor management) team, as well as the customer’s lawyers (“counsel”).
Some service providers operate exclusively on either CTAs or EAs. But most cloud providers offer both. Not all customers may be eligible to sign an EA; that’s a business decision. Providers may set a minimum account size, minimum spend, minimum creditworthiness, etc., and these thresholds may be different in different countries. Providers are under no obligation to either publicize the circumstances under which an EA is offered, or to offer an EA to a particular customer (or prospective customer).
While in general, EAs would logically be negotiated by all customers who can qualify, providers do not necessarily proactively offer EAs. Furthermore, some companies — especially startups without cloud-knowledgeable sourcing managers — aren’t aware of the existence of EAs and therefore don’t pursue them. And many businesses that are new to cloud services don’t initially negotiate an EA, or take months to do that negotiation, operating on a CTA in the meantime. Therefore, there are certainly businesses that spend a lot of money with a provider, yet only have a CTA.
Terms of service are typically baked directly into both CTAs and EAs — they are one element of the T&Cs. As a result, a business on an EA benefits both from the greater “default” generosity of the EA’s T&Cs over the provider’s CTA (if the provider offers both), as well as whatever clauses they’ve been able to negotiate in their favor. In general, the bigger the deal, the more leverage the customer has to negotiate favorable T&Cs, which may include greater “cure time” for contract breaches, greater time to pay the bill, more notice of service changes, etc.
AUPs, on the other hand, are separate documents incorporated by reference into the T&Cs. They are usually a superset or expansion/clarification of the things mentioned directly in the ToS. For instance, the ToS may say “no illegal activity allowed”, and the AUP will give examples of prohibited activities (important since what is legal varies by country). AUPs routinely restrict valid use, even if such use is entirely legal. Service providers usually stipulate that an AUP can change with no notice (which essentially allows a provider to respond rapidly to a change in the regulatory or threat environment).
Unlike the EA T&Cs, an AUP is non-negotiable. However, an EA can be negotiated to clarify an AUP interpretation, especially if the customer is in a “grey area” that might be covered by the AUP even if the use is totally legitimate (i.e. a security vendor that performs penetration testing on other businesses at their request, may nevertheless ask for an explicit EA statement that such testing doesn’t violate the AUP).
Prospective customers of a service provider can’t safely make assumptions about the AUP intent. For example, some providers might not exclude even a fully white-hat pen-testing security vendor from the relevant portion of the AUP. Some providers with a gambling-excluding AUP may not choose to do business with an organization that has, for instance, anything to do with gambling, even if that gambling is not online (which can get into grey areas like, “Is running a state lottery a form of gambling?”). Some providers operating data centers in countries without full freedom of the press may be obliged to enforce restrictions on what content a media company can host in those regions. Anyone who could conceivably violate the AUP as part of the routine course of business would therefore want to gain clarity on interpretation up front — and get it in writing in an EA.
What does AUP enforcement look like?
If you’re not familiar with AUPs or why they exist and must be enforced, I encourage you to read my post “Terms of Service: from anti-spam to content takedown” first.
AUP enforcement is generally handled by a “fraud and abuse” department within a service provider, although in recent years, some service providers have adopted friendlier names, like “trust and safety”. When an enforcement action is taken, the customer is typically given a clear statement of what the violation is, any actions taken (or that will be taken within X amount of time if the violation isn’t fixed), and how to contact the provider regarding this issue. There is normally no ambiguity, although less technically-savvy customers can sometimes have difficulties understanding why what they did wrong — and in the case of automatic enforcement actions, the customer may be entirely puzzled by what they did to trigger this.
There is almost always a split in the way that enforcement is handled for customers on a CTA, vs customers on an EA. Because customers on a CTA undergo zero or minimal verification, there is no presumption that those customers are legit good actors. Indeed, some providers may assume, until proven otherwise, that such customers exist specifically to perpetuate fraud and/or abuse. Customers on an EA have effectively gone through more vetting — the account team has probably done the homework to figure out likely revenue opportunity, business model and drivers for the sale, etc. — and they also have better T&Cs, so they get the benefit of the doubt.
Consequently, CTA customers are often subject to more stringent policies and much harsher, immediate enforcement of those policies. Immediate suspension or termination is certainly possible, with relatively minimal communication. (To take a public GCP example: GCP would terminate without means to protest as recently as 2018, though that has changed. Its suspension guidelines and CTA restrictions offer clear statements of swift and significantly automated enforcement, including prevention of cryptocurrency mining for CTA customers who aren’t on invoicing, even though it’s perfectly legal.) The watchword for the cloud providers is “business risk management” when it comes to CTA customers.
Customers that are on a CTA but are spending a lot of money — and seem to be legit businesses with an established history on the platform — generally get a little more latitude in enforcement. (And if enforcement is automated, there may be a sliding threshold for automated actions based on spend history.) Similarly, customers on a CTA but who are actively negotiating an EA or engaged in the enterprise sales process may get more latitude.
Often-contrary to the handling of CTA customers, providers usually assume an EA customer who has breached the AUP has done so unintentionally. (For instance, one of the customer’s salespeople may have sent spam, or a customer VM may have been compromised and is now being used as part of a botnet.) Consequently, the provider tends to believe that what the customer needs is notification that something is wrong, education on why it’s problematic, and help in addressing the issue. EA customers are often completely spared from any automated form of policy enforcement. While business risk management is still a factor for the service provider, this is often couched politely as helping the customer avoid risk exposure for the customer’s own business.
Providers do, however, generally firmly hold the line on “the customer needs to deal with the problem”. For instance, I’ve encountered cloud customers who have said to me, “Well, my security breach isn’t so bad, and I don’t have time/resources to address this compromised VM, so I’d like more than 30 days grace to do so, how do I make my provider agree?” when the service provider has reasonably taken the stance that the breach potentially endangers others, and mandated that the customer promptly (immediately) address the breach. In many cases, the provider will offer technical assistance if necessary. Service providers vary in their response to this sort of recalcitrance and the extent of their enforcement actions. For instance, AWS normally takes actions against the narrowest feasible scope — i.e. against only the infrastructure involved in the policy violation. Broadly, cloud providers don’t punish customers for violations, but customers must do something about such violations.
Some providers have some form of variant of a “three strikes” policy, or escalating enforcement. For instance, if a customer has repeated issues — for example, it seems unable implement effective anti-spam compliance for itself, or it constantly fails to maintain effective security in a way that could impact other customers or the cloud provider’s services, or it can’t effectively moderate content it offers to the public, or it can’t prevent its employees from distributing illegally copied music using corporate cloud resources — then repeated warnings and limited enforcement actions can turn into suspensions or termination. Thus, even EA customers are essentially obliged treat every policy violation as something that they need to strive to ensure will not recur in the future. Resolution of a given violation is not evidence that the customer is in effective compliance with the agreement.
It’s not unusual for entirely legitimate, well-intentioned businesses to breach the ToS or AUP, but this is normally rare; a business might do this once or twice over the course of many years. New cloud customers on a CTA may also innocently exhibit behaviors that trigger automated enforcement actions that use algorithms to look for usage patterns that may be indicative of fraud or abuse. Service providers will take enforcement actions based on the customer history, the contractual agreement, and other business-risk and customer-relationship factors.
Intent matters. Accidental breaches are likely to be treated with a great deal more kindness. If breaches recur, though, the provider is likely to want to see evidence that the business has an effective plan for preventing further such issues. Even if the customer is willing to absorb the technical, legal, or business risks associated with a violation, the service provider is likely to insist that the issue be addressed — to protect their own services, their own customers, and for the customer’s own good.
(Update: Gartner clients, I have published a research note: “What is the risk of actually losing your cloud provider?“)
Clients have recently been asking a lot more questions about the comparative resilience of cloud providers.
Identity services are a particular point of concern (for instance, the Azure AD outage of October 1st and Google Cloud IAM outage of March 26th) since when identity is down, the customer can’t access the cloud provider’s control plane (and it may impact service use in general) — plus there’s generally no way for the customer to work around such issues.
The good news is, hyperscale cloud providers do a pretty good job of being robust. However, the risk of smaller, more hosting-like providers can be much higher — and there are notable differences between the hyperscalers, too.
Operations folks know: Everything breaks. Physical stuff fails, software is buggy, and people screw up (a lot). A provider can try its best to reduce the number of failures, limit the “blast radius” of a problem, limit the possibility of “cascading failures”, and find ways to mitigate the impact on users. But you can’t avoid failure entirely. Systems that are resilient recover quickly from failure.
If you chop off the head of a hydra, it grows back — quickly. We can think about five key factors — heads of the hydra — that influence the robustness, resilience, and observed (“real world”) availability of cloud services:
- Physical design: The design of physical things, such as the data center and the hardware used to deliver services.
- Logical (software) design: The design of non-physical things, especially software — all aspects of the service architecture that is not related to a physical element.
- Implementation quality: The robustness of the actual implementation, encompassing implementation skill, care and meticulousness, and the effectiveness of quality-assurance (QA) efforts.
- Deployment processes: The rollout of service changes is the single largest cause of operational failures in cloud services. The quality of these processes, the automation used in the processes, and the degree to which humans are given latitude to use good judgment (or poor judgment) thus have a material impact on availability.
- Operational processes: Other operational processes, such as monitoring, incident management — and, most importantly, problem management — impact the cloud provider’s ability to react quickly to problems, mitigate issues, and ensure that the root causes of incidents are addressed. Both proactive and reactive maintenance efforts can have an impact on availability.
A sixth factor, Transparency, isn’t directly related to keeping the hydra alive, but matters to customers as they plan for their own application architectures and risk management — contributing to customer resilience. Transparency includes making architectural information to customers, as well as delivering outage-related visibility and insight to customers. Customers need real-world info — like current and historical outage reports and the root-cause-analysis port-mortems that offer insight into what went wrong and why (and what the provider is doing about it).
When you think about cloud service resilience (or the resilience of your own systems), think about it in terms of those factors. Don’t think about it like you think about on-premises systems, where people often think primarily about hardware failures or a fire in the data center. Rather, you’re dealing with systems where software issues are almost always the root cause. Physical robustness still matters, but the other four factors are largely about software.
Preface added 20 November 2020: This post received a lot more attention than I expected. I must reiterate that it is not in any way an endorsement. Indeed, sparkly pink unicorns are, by their nature, fanciful. Caution must be exercised, as sparkly pink glitter can conceal deficiencies in the equine body.
Digging into my archive of past predictions… In a research note on the convergence of public and private cloud, published almost exactly eight years ago in July 2012, I predicted that the cloud IaaS market would eventually deliver a service that delivered a full public cloud experience as if it were private cloud — at the customer’s choice of data center, in a fully single-tenant fashion.
Since that time, there have been many attempts to introduce public-cloud-consistent private cloud offerings. Gartner now has a term, “distributed cloud”, to refer to the on-premises and edge services delivered by public cloud providers. AWS Outposts deliver, as a service, a subset of AWS’s incredibly rich product porfolio. Azure Stack (now Azure Stack Hub) delivers, as software, a set of “Azure-consistent” capabilities (meaning you can transfer your scripts, tooling, conceptual models, etc., but it only supports a core set of mostly infrastructure capabilities). Various cloud MSPs, notably Avanade, will deliver Azure Stack as a managed service. And folks like IBM and Google want you to take their container platform software to facilitate a hybrid IT model.
But no one has previously delivered what I think is what customers really want:
- Location of the customer’s choice
- Single-tenant; no other customer shares the hardware/service; data guaranteed to stay within the environment
- Isolated control plane and private self-service interfaces (portal, API endpoints); no tethering or dependence on the public cloud control plane, or Internet exposure of the self-service interfaces
- Delivered as a service with the same pricing model as the public cloud services; not significantly more expensive than public cloud as long as minimum commitment is met
- All of the provider’s services (IaaS+PaaS), identical to the way that they are exposed in the provider’s public cloud regions
Why do customers want that? Because customers like everything the public cloud has to offer — all the things, IaaS and PaaS — but there are still plenty of customers who want it on-premises and dedicated to them. They might need it somewhere that public cloud regions generally don’t live and may never live (small countries, small cities, edge locations, etc.), they might have regulatory requirements they believe they can only meet through isolation, they may have security (even “national security”) requirements that demand isolation, or they may have concerns about the potential to be cut off from the rest of the world (as the result of sanctions, for instance). And because when customers describe what they want, they inevitably ask for sparkly pink unicorns, they also want all that to be as cheap as a multi-tenant solution.
And now it’s here, and given that it’s 2020… the sparkly pink unicorn comes from Oracle. Specifically, the world now has Oracle Dedicated Regions Cloud @ Customer. (Which I’m going to shorthand as OCI-DR, even though you can buy Oracle SaaS hosted on this infrastructure) OCI’s region model, unlike its competitors, has always been all-services-in-all-regions, so the OCI-DR model continues that consistency.
In an OCI-DR deal, the customer basically provides colo (either their own data center or a third party colo) to Oracle, and Oracle delivers the same SLAs as it does in OCI public cloud. The commit is very modest — it’s $6 million a year, for a 3-year minimum, per OCI-DR Availability Zone (a region can have multiple AZs, and you can also buy multiple regions). There are plenty of cloud customers that easily meet that threshold. (The typical deal size we see for AWS contracts at Gartner is in the $5 to $15 million/year range, on 3+ year commitments.) And the pricing model and actual price for OCI-DR services is identical to OCI’s public regions.
The one common pink sparkly desire that OCI doesn’t meet is the ability to use your own hardware, which can help customers address capex vs. opex desires, may have perceived cost advantages, and may address secure supply chain requirements. OCI-DR uses some Oracle custom hardware, and the hardware is bundled as part of the service.
I predict that this will raise OCI’s profile as an alternative to the big hyperscalers, among enterprise customers and even among digital-native customers. Prior to today’s announcement, I’d already talked to Gartner clients who had been seriously engaged in sales discussions on OCI-DR; Oracle has quietly been actively engaged in selling this for some time. Oracle has made significant strides (surprisingly so) in expanding OCI’s capabilities over this last year, so when they say “all services” that’s now a pretty significant portfolio — likely enough for more customers to give OCI a serious look and decide whether access to private regions is worth dealing with the drawbacks (OCI’s more limited ecosystem and third-party tool support probably first and foremost).
As always, I’m happy to talk to Gartner clients who are interested in a deeper discussion. We’ve recently finished our Solution Scorecards (an in-depth assessment of 270 IaaS+PaaS capabilities), including our new assessment of OCI. The scores are summarized in a publicly-reprinted document. The full scorecard has been published, and the publicly-available summary says, “OCI’s overall solution score is 62 out of 100, making it a scenario-specific option for technical professionals responsible for cloud production deployments.”
We’ve just completed our 2019 evaluations of cloud IaaS providers, resulting in a new Magic Quadrant, Critical Capabilities, and six Solution Scorecards — one for each of the providers included in the Magic Quadrant. This process has also resulted in fresh benchmarking data within Gartner’s Cloud Decisions tool, a SaaS offering available to Gartner for Technical Professionals clients, which contains benchmarks and monitoring results for many cloud providers.
As part of this, we are pleased to introduce Gartner’s new Solution Scorecards, an updated document format for what we used to call In-Depth Assessments. Solution Scorecards assess an individual vendor solution against our recently-revised Solution Criteria (formerly branded Evaluation Criteria). They are highly detailed documents — typically 60 pages or so, assessing 265 individual capabilities as well as providing broader recommendations to Gartner clients.
The criteria are always divided into Required, Preferred, and Optional categories — essentially, things that everyone wants (and where they need to compensate/risk-mitigate if something is missing), things that most people want but can live without or work around readily, and things that are use case-specific. The Required, Preferred, and Optional criteria are weighted into a 4:2:1 ratio in order to calculate an overall Solution Score.
If you are a Gartner for Technical Professionals client, the scorecards are available to you today. You can access them from the links below (Gartner paywall):
- Amazon Web Services
- Microsoft Azure
- Google Cloud Platform
- IBM Cloud
- Oracle Cloud Infrastructure
- Alibaba Cloud International (English-language offerings available outside of China)
We will be providing a comparison of these vendors and their Solution Scorecards at the annual “Cloud Wars” presentation at the Gartner Catalyst conference — one of numerous great reasons to come to San Diego the week of August 11th (or Catalyst UK in London the week of September 15th)! Catalyst has tons of great content for cloud architects and other technical professionals involved in implementing cloud computing.
Note that we are specifically assessing just the integrated IaaS+PaaS offerings — everything offered through a single integrated self-service experience and on a single contract. Also, only cloud services count; capabilities offered as software, hosting, or a human-managed service do not count. Capabilities also have to be first-party.
Also note that this is not a full evaluation of a cloud provider’s entire portfolio. The scorecards have “IaaS” in the title, and the scope is specified clearly in the Solution Criteria. For the details of which specific provider services or products were or were not evaluated, please refer to each specific Scorecard document.
All the scores are current as of the end of March, and count only generally-available (GA) capabilities. Because it takes weeks to work with vendors for them to review and ensure accuracy, and time to edit and publish, some capabilities will have gone beta or GA since that time; because we only score what we’re able to test, the evaluation period has a cut-off date. After that, we update the document text for accuracy but we don’t change the numerical scores. We expect to update the Solution Scorecards approximately every 6 months, and working to increase our cadence for evaluation updates.
This year’s scores vs. last year’s
When you review the scores, you’ll see that broadly, the scores are lower than they were in 2018, even though all the providers have improved their capabilities. There are several reasons why the 2019 scores are lower than in previous years. (For a full explanation of the revision of the Solution Criteria in 2019, see the related blog post.)
First, for many feature-sets, several Required criteria were consolidated into a single multi-part criterion with “table stakes” functionality; missing any part of that criterion caused the vendor to receive a “No” score for that criterion (“Yes” is 1 point; “No” is zero points; there is no partial credit). The scorecard text explains how the vendor does or does not meet each portion of a criterion. The text also mentions if there is beta functionality, or if a feature was introduced after the evaluation period.
Second, many criteria that were Preferred in 2018 were promoted to Required in 2019, due to increasing customer expectations. Similarly, many criteria that were Optional in 2018 are now Preferred. We introduced some brand-new criteria to all three categories as well, but providers that might have done well primarily on table-stakes Required functionality in previous years may have scored lower this year due to the increased customer expectations reflected by revised and new criteria.
Customizing the scores
The solution criteria, with all of the criteria detail, is available to all Gartner for Technical Professionals clients, and comes with a spreadsheet that allows you to score any provider yourself; we also provide a filled-out spreadsheet with each Solution Scorecard so you can adapt the evaluation for your own needs. The Solution Scorecards are similarly transparent on which parts of a criterion are or aren’t met, and we link to documentation that provides evidence for each point (in some cases Gartner was provided with NDA information, in which case we tell you how to get that info from the provider).
This allows you to customize the scores as you see fit. Thus, if you decide that getting 3 out of 4 elements of a criteria is good enough for you, or you think that the thing they miss isn’t relevant to you, or you want to give the provider credit for newly-released capabilities, or you want to do region-specific scoring, you can modify the spreadsheet accordingly.
If you’re a Gartner client and are interested in discussing the solution criteria, assessment process, and the cloud providers, please schedule an inquiry or a 1-on-1 at Catalyst. We’d be happy to talk to you!
In February of this year, we revised the Evaluation Criteria for Cloud IaaS (Gartner paywall). The evaluation criteria (now rebranded Solution Criteria) are essentially the sort of criteria that prospective customers typically include in RFPs. They are highly detailed technical criteria, along with some objectively-verifiable business capabilities (such as elements in a technical support program, enterprise ISV partnerships, ability to support particular compliance requirements, etc.).
The Solution Criteria are intended to help cloud architects evaluate cloud IaaS providers (and integrated IaaS+PaaS providers such as the hyperscale cloud providers), whether public or private, or assess their own internal private cloud. We are about to publish Solution Scorecards (formerly branded In-Depth Assessments) for multiple providers; Gartner analysts assess these solutions hands-on and determine whether or not they have capabilities that meet the requirements of a criterion.
The TL;DR version
In summary, we revised the Solution Criteria extensively in 2019, and the results were as follows:
- The criteria have been updated to reflect the current IaaS+PaaS market.
- Expectations are significantly higher than in previous years.
- Expectations have been aligned to other Gartner research, taking into account customer wants and needs in the relevant market, not just in a cloud-specific context.
- Many capabilities have been consolidated and are now required.
- Most vendor scores in the Solution Scorecards have dropped dramatically since last year, and there is a much broader spread of vendor scores.
The Evolution of Customer Demands
The Evaluation Criteria (EC) for Cloud IaaS was first published in 2012. It received a significant update every other year (each even-numbered year) thereafter. When first written, the EC reflected the concerns of our clients at the time, many of whom were infrastructure and operations (I&O) professionals with VMware backgrounds. With each iteration, the EC evolved significantly, yet incrementally.
In the meantime, the market moved extremely quickly. The market evolution towards cloud integrated IaaS and PaaS (IaaS+PaaS) providers, and the market exit (or strategic de-investment) of many of the “commodity” providers, radically changed the structure and nature of the market over time. Cloud IaaS providers weren’t just expected to provide “hardware infrastructure”, but also “software infrastructure”, including all of the necessary management and automation. This essentially forced these providers into introducing services that compete in many IT markets and in an extraordinary number of software niches.
Furthermore, as the market matured, the roles and expectations of our clients also evolved significantly. The focus shifted to enterprise-wide initiatives, rather than project-based adoption. Digital business transformation elevated the importance of cloud-native workloads, while IT transformation emphasized the need for high-quality cloud migration of existing workloads. The notion that a cloud IaaS provider could successfully run all, or almost all, of a customer’s IT became part of the assumptions that needed to underpin the provider evaluation process.
Today’s cloud IaaS customers have high expectations. Experienced customers are becoming more sophisticated, but late adopters also have high expectations of a provider that have to be met to help the customer overcome barriers to adoption.
For 2019, we decided to take a look at the EC“from scratch”, in order to try to construct a list of criteria that are the most relevant to the initiatives of customers today. In many cases, our clients are trying to pick a primary strategic IaaS provider. In other cases, our clients already have a primary provider but are trying to pick a strategic secondary provider as they implement a multicloud strategy. Finally, some of our clients are choosing a provider for a tactical need, but still need to understand that provider’s capabilities in detail.
Constructing the Revision
The revision needed to keep a similar number of criteria (in order to keep the assessment time manageable and the assessment itself at a readable length) — we ended up with 265 for 2019.
In order to keep the total number of criteria down, we needed to consolidate closely-related criteria into a single criterion. Many criteria became multi-part as a result. We tried to consolidate the “table stakes” functionality that could be assumed to be a part of all (or almost all) cloud IaaS offerings, in order to make room for more differentiated capabilities.
We tried to be as vendor-neutral as possible. The evaluation criteria have evolved since the initial 2012 introduction; when we introduced new criteria in the past, we often ended up with criteria requirements that closely mirrored the feature-set of the first provider to offer a capability, since that provider shaped customer expectations. In this 2019 revision, we tried to go back to the core customer requirements, without concern as to whether cloud provider implementations fully aligned with those requirements — the criteria are intended to reflect what customers want and not what vendors offer. There are requirements that no vendors meet, but which we often hear our clients ask for; in such cases we tried to phrase those requirements in ways that are reasonable and implementable at scale, as it’s okay for the criteria to be somewhat aspirational for the market.
We tried to make sure that the criteria were worded using standard Gartner terms or general market terminology, avoiding vendor-specific terms. (Note that because vendors not-infrequently adopt Gartner terms, there were cases where providers had adopted terminology from earlier versions of EC, and we made no attempt to alter such terms.)
We tried to keep to requirements, without dictating implementation, where possible. However, we had to keep in mind that in cloud IaaS, where there are customers who want fine-grained visibility and control over the infrastructure, there still must be implementation specificity when the customer explicitly wants those elements exposed.
Defining the Criteria
During the process of determining the criteria, we sought input broadly within Gartner, both in terms of discussing the criteria with other analysts as well as incorporating things from existing Gartner written research. (And the criteria reflect, as much as possible, the discussions we’ve had with clients about what they’re looking for, and what they’re putting into their RFPs.)
In some cases, we needed input from specialists in a topic. In some areas of technology, clients who need to have deep-dive discussions on features may talk almost exclusively to analysts specialized in those areas. Those analysts are familiar with current requirements as well as the future of those technology areas, and are thus the best source for determining those needs. For example, areas such as machine learning and IoT are primarily covered by analysts with those specializations, even when the customers are implementing cloud solutions. There are also areas, such as Security, where we have detailed cloud recommendations from those teams. So we extensively incorporated their input..
We also looked at non-cloud capabilities when there were market gaps relative to customer desires. There are areas where either cloud providers do not currently have capabilities, or where those capabilities are relatively nascent. Thus, we needed to identify where customers are using on-premises solutions, and want cloud solutions. We also needed to determine what the “minimum viable product” should be for the purposes of constructing a criterion around it.
Feedback from non-cloud analysts was also important because it identified areas where clients were not using a cloud solution because of something that was missing. In many cases, these were not technology features, but issues around transparency, or the lack of solutions acceptable on a global basis.
Finally, the way that customers source solutions, build applications, and manage their data is changing. We tried to ensure that the new criteria aligned with these trends.
Because more and more of our clients are deploying cloud solutions globally, every criterion also had some requirements as to its global availability. These are used only for advisory purposes and are not part of scoring.
The vendors were allowed to give feedback on the criteria prior to publication. We wanted to check if the criteria were reasonable, and seemed fair. We incorporated feedback that constituted good, vendor-neutral suggestions that aligned to customer requirements.
The End Results
When you see the Solution Scorecards, you may be surprised by lower scores on the part of many of the providers. We’re being transparent about the Evaluation Criteria (Solution Criteria) revision in order to help you understand why the scores are lower.
The lower scores were an unintentional side-effect of the revision, but reflect, to some degree, the state of the market relative to the very high expectations of customers. Note that this year’s lower scores do not indicate that providers have “gone backwards” or removed capabilities; they just reflect the provider’s status against a raised bar of customer expectations.
We expect that when we update the scorecards in the second half of this year, scores will increase, as many of the vendors have since introduced missing capabilities, or will do so by the next update. We retain confidence that the solution criteria are a good reflection of a broad range of current customer expectations. Because many vendors are doing a good job of listening to what customers and prospects want, and planning accordingly, we think that the solution criteria will also be reflected in future vendor roadmaps and market development.
We discuss the Solution Scorecards and scores in a separate blog post.
The Critical Capabilities for Public Cloud IaaS, 2016 has now been published. The Critical Capabilities is a technical assessment of public cloud IaaS offerings against a set of use cases — cloud-native applications, general business applications, application development environments, batch computing, and (new for 2016) the Internet of Things. It’s part of our integrated series of cloud IaaS assessments and complements our Magic Quadrant for Cloud IaaS (Gartner clients: see interactive version).
We are now launching right back into the Magic Quadrant cycle for 2017, with the goal of publishing a new Magic Quadrant in April 2017, and a new Critical Capabilities shortly thereafter.
A lot has happened since the early-2016 research process for our 2016 Magic Quadrant and Critical Capabilities cycle for this market. Multiple providers have launched new offerings and are phasing out their previous offerings, and there are some important new market entrants. We want to make sure that our research notes offer current representations of provider capabilities. (Usefully, a shift to April publication also gets us back to a schedule that aligns with our infrastructure & operations conference season.)
In previous years, we’ve issued an open invitation for the pre-qualification survey to all cloud IaaS providers. This year, we are not doing so; instead, we have issued invitations only to providers who we believe are highly likely to qualify.
If you are a cloud IaaS provider that did not receive an invitation, but you believe you are highly likely to qualify for inclusion, please email me at Lydia dot Leong at Gartner dot com to discuss it.
Oracle has made multiple previous attempts to enter the cloud IaaS market — most recently (early this year), with the Oracle Compute Cloud. At Oracle OpenWorld this week, however, Oracle announced a brand-new cloud IaaS offering. Oracle hasn’t officially given this a real brand yet, so for the purposes of this blog post, I’ll call it their next-gen cloud.
News of this project leaked last year. Oracle has paid richly to hire an “A” team, so to speak — former long-time senior AWS engineers lead the project, and they’ve recruited heavily from all three hyperscale clud providers in Seattle (AWS, Microsoft Azure, Google Cloud Platform). These are credible product and engineering people who, in my opinion, understand what they need to build and the enormous challenges ahead of them.
The next-gen cloud currently consists of an SDN (capable of both Layer 2 and Layer 3 networking, which is a differentiator), block storage, object storage, and bare-metal servers (thus the initial moniker, “Oracle Bare Metal Cloud”). Virtual machines (VMs) are coming later this year, with containers to follow early next year. Based on a detailed engineering briefing that Oracle provided to myself and my colleagues, I would say that smart and scalable choices seem to have been made throughout. However, I would characterize this early offering as minimum viable product; it is the foundation of a future competitive offering, rather than a competitive offering today.
In the near term, Oracle’s next-gen cloud will be interesting primarily to a general audience in a bare-metal context. Here, Oracle will compete with Packet, and to some lesser degree, the bare-metal cloud offerings from CenturyLink and Rackspace (OnMetal). It is a true software-defined cloud IaaS offering, provisioned in minutes and billed by the hour. This sets it apart from more hosting-like bare-metal offerings such as IBM SoftLayer, Internap, and Cogeco Peer 1.
It is unlikely that Oracle’s announced price-point — 20% below AWS list prices — will be sufficient to move the needle in a market where AWS’s “real” prices are lowered up to 70% by reserved instances (plus AWS negotiates custom discounts), and where Google is already competing intensively on price (especially on negotiated deals) and has an offering substantially more featureful than what Oracle will have in the market in the next year. Good price-performance is table stakes here. This is not a commodity market; providers compete on their capabilities. This is also not about capital investment to build data centers; Oracle can use colocation until they reach a scale where building makes sense, though since such projects can take years, they’ll need to time that properly.
Bare metal, of course, significantly outperforms VMs in some cases — especially high I/O use cases. But bare metal should be thought of as part of a complete offering — a compute option for some of a customer’s workloads. Price-performance should always be considered in the context of the customer’s specific architecture. In the case of Oracle, bare metal and the layer 2 SDN features are important because they are needed for Oracle RAC and for better performance of Oracle application software. Oracle has built the core of their offering around off-box virtualization of networking and storage, which is important for allowing their cloud IaaS offering to smoothly interoperate with other Oracle hardware placed into the same environment, like Exadata appliances.
Overall, this should be seen as a positive move for Oracle, but one with many open questions about its future. As always, if anyone has more detailed questions, I am happy to answer them in the context of client inquiry, and I’ve set aside some time to speak with reporters during this OpenWorld week.
We’re pleased to announce that the 2016 Magic Quadrant for Cloud Infrastructure, Worldwide has been published. (Link requires a Gartner subscription. If you’re not a Gartner client, there are free reprints available through vendors, and various press articles, such as the Tech Republic analysis. Note that press articles do not always accurately reflect our opinions, though.)
Producing the Magic Quadrant is a huge team effort that involves many people across Gartner, including many analysts who aren’t credited as co-authors, administrative support staff, and people in our primary-research and benchmarking groups. The team effort also reflects the way that we produce an entire body of IaaS research as an integrated effort across Gartner’s research divisions. (The approach described below is specific to our IaaS research and may not apply to Gartner’s assessments in other markets.)
Whether you already have a cloud IaaS provider and are just looking for a competitive check-up, you’re thinking of adding one or more additional providers, or you’re just getting started with cloud IaaS, our work can help you find the providers that are right for you.
The TL;DR list of assessments:
- Magic Quadrant (market and technical evaluation)
- Evaluation Criteria (230+ technical and service traits to look for in a provider)
- In-Depth Assessments (detailed assessments of specific providers against the Evaluation Criteria)
- Critical Capabilities (use-case-based technical evaluations; 2016 update coming soon)
- CloudHarmony and Tech Planner Cloud Module (real-world stats and cost-performance comparisons)
- Peer Insights (IT leaders review providers)
(Note that not all of these might be available as part of your current Gartner client subscription.)
Gartner has produced a Magic Quadrant for Cloud IaaS since 2011. The MQ is our overall perspective on the market, looking at the provider solutions from both a technical and business angle. Gartner clients can use the interactive MQ tool to change the weightings of the criteria to suit their own evaluation priorities (if you read the detailed criteria descriptions, there’s an explanation of how each criterion maps to buyer priorities). The interactive MQ can also be used to get a multi-year historical perspective.
The MQ covers public, hosted private, and industrialized outsourced private cloud IaaS; it’s not just a public cloud MQ. We look at multi-tenant and single-tenant, located in either provider or customer premises, cloud IaaS offerings. We also look at the full range of compute options (VMs, bare-metal servers, containers) that are delivered in a cloud model (API-provisionable via automation, and metered by the hour or less), not just VMs. In addition, we consider some integrated PaaS-layer services (we call these cloud software infrastructure services, which include things like database as a service), but we have a separate enterprise application PaaS MQ for pure aPaaS. While we consider the provider’s overall value proposition in the context of cloud IaaS (including their ability to deliver managed services, network services, etc.), this isn’t a general cloud computing or outsourcing MQ.
2016 marks our sixth iteration of a pure cloud IaaS MQ. Previously, in 2009 and 2010, we included cloud IaaS in our hosting MQ, but by 2010, it was already clear that the hosting and cloud IaaS buyer wants and needs were distinctly different. Since 2011, we’ve produced a global cloud IaaS MQ, along with three regional hosting MQs (suitable for customers looking for dedicated servers or managed hosting on a monthly or annual basis), and three regional data center outsourcing MQs (which include customized private cloud services as part of a broader portfolio of infrastructure outsourcing capabiities). Not every infrastructure need can or should be met with cloud IaaS.
The core foundation of our assessment is our Evaluation Criteria for Cloud IaaS. Over the years, we’ve converged the technical-detail questionnaire that we ask providers to fill out during the Magic Quadrant research process, with the Gartner for Technical Professionals (GTP) document that we produce to guide buyers on evaluating providers. This has resulted in nearly 250 service traits that the Evaluation Criteria document categorizes as Required (almost all Gartner clients are likely to want these things and these have the potential to be showstoppers if missing), Preferred (many will want these things), and Optional (use-case-specific needs). This gives us a consistent set of formal definitions for service features — things you can put a clear yes/no to. As a buyer, you can use the Evaluation Criteria to score any cloud IaaS provider — and even score your own IT department’s private cloud.
In the course of doing this particular Magic Quadrant, providers fill out very detailed questionnaires that list these service features and capabilities (broken down even more granularly than in the Evaluation Criteria), indicating whether their service has those traits, and they’re also asked to provide evidence, like documentation. We also ask them to provide other information like the location of their data centers, languages supported across various aspects of service delivery (like portal localization and tech-support languages spoken), a copy of their standard contract and SLAs, and so forth. We score those questionnaires (and check service features against documentation, and with hands-on testing if need be). We also score things like the buyer-friendliness of contracts, based on the presence/absence of particular clauses. Those component scores are used in many different individual scoring categories within the Magic Quadrant.
We also produce a set of In-Depth Assessments for the providers that our clients are most interested in evaluating. The In-Depth Assessments are detailed documents that score an individual provider against the Evaluation Criteria; for every criteria, we explain how the provider does and doesn’t meet it, and we provide links to the corresponding documentation or other evidence. The results of our hands-on testing are noted, as well. For many buyers, this minimizes the need to conduct an RFP that dives into the technical solution; here we’ve done a very detailed fact-based analysis for you, and the provider has verified the accuracy of the information. (Buyer beware, though: Providers sometimes produce something that looks like one of these assessments, even quoting the Gartner definitions, but with their own more generous self-assessment rather than the stringent Gartner-produced assessment!)
Then, we produce Critical Capabilities for Public Cloud IaaS (2016 update still in progress). This technical assessment looks at a single public cloud IaaS offering from each of the providers included in the Magic Quadrant. The same technical traits used in the other assessments are used here, but they are divided into categories of capabilities, and those capabilities are weighted in a set of common use cases. You can also customize your own set of weightings. In addition to providing quantitative scores, we summarize, in a fair amount of detail, the technical capabilities of each evaluated provider. This allows you to get a sense of what providers are likely to be right for your needs, without having to go through the full deep-dive of reading the In-Depth Assessments. (Critical Capabilities are also available to all Gartner clients and reprints may be offered by providers on their websites, whereas the In-Depth Assessments are only available to GTP clients.)
Performance, and price-performance, is important to many buyers. Gartner provides hardware benchmarking via a SaaS offering called Tech Planner. We offer a Cloud Module within Tech Planner that uses technology that we derived from our acquisition of CloudHarmony. We conduct continuous automated testing on many cloud IaaS providers, including all providers in the Magic Quadrant. We benchmark compute performance for the full range of VMs and bare-metal cloud servers offered by the provider, along with storage performance and network performance; we use this to calculate price-performance metrics. We monitor the availability of their services across the globe. We track provisioning times. All this data is used as objective components to the scores within the Magic Quadrant. Much of this data is directly available to Tech Planner customers, who can use these tools to calculate performance-equivalencies as well as determine where workloads will be most cost-effective.
Finally, we collect end-user reviews of cloud IaaS providers, called Peer Insights. IT leaders (who do not need to be Gartner clients) can submit reviews of their providers; we verify that reviews are legitimate, and it’s one of the very few places where you’ll see enter senior IT executives and architects writing detailed reviews of their providers. We use this data, along with vendor-provided customer references, and the many thousands of clients conversations we have each year with cloud IaaS buyers, as part of the fact base for our Magic Quadrant scoring.
More than a dozen analysts are directly involved in all of these assessments, and many more analysts provide peer-review input into those assessments. It’s an enormous effort, involving a great deal of teamwork, to produce this body of interlinked research. We’re always trying to improve its quality, so we welcome your feedback!
You can DM me on Twitter at @cloudpundit or send email to lydia dot leong at gartner.com.
Back in January, I announced the creation of a new Gartner Magic Quadrant for Public Cloud Infrastructure Managed Service Providers. This MQ will evaluate MSPs that deliver managed services on top of Amazon Web Services, Microsoft Azure, or Google Cloud Platform.
We are currently putting together a contact list of providers to survey. We expect to begin this process in late July. We encourage MSPs who are interested in participation to add their names to the contact list.
MSPs should fill out THIS FORM.