I recently wrote a Twitter thread about cloud risk and resilience that drew a lot of interest, so I figured I’d expand on it in a blog post. I’ve been thinking about cloud resilience a lot recently, given that clients have been asking about how they manage their risks.
Inquiries about this historically come in waves, almost always triggered by incidents that raise awareness (unfortunately often because the customer has been directly impacted). A wave generally spans a multi-week period, causing waves to bleed into one another. Three distinct sets come to mind over the course of 2021:
- The Azure AD outages earlier this year had a huge impact on client thinking about concentration risks and critical service dependencies — often more related to M365 than Azure, though (and exacerbated by the critical dependency that many organizations have on Teams during this pandemic). Azure AD is core to SSO for many organizations, making its resilience enormously impactful. These impacts are still very top of mind for many clients, months later.
- The Akamai outage (and other CDN outages with hidden dependencies) this summer raised application and infrastructure dependency awareness, and came as a shock to many customers, as Akamai has generally been seen as a bedrock of dependability.
- The near-daily IBM Cloud “Severity 1” outages over the last month have drawn selective client mentions, rather than a wave, but add to the broader pattern of cloud risk concerns. (To my knowledge, there has been no public communication from IBM regarding root cause of these issues. Notifications indicate the outages are multi-service and multi-regional, often impacting all Gen 2 multizone regions. Kubernetes may be something of a common factor, to guess from the impact scope.)
Media amplification of outage awareness appears to have a lot to do with how seriously they’re taken by customers — or non-customers. Affecting stuff that’s consumed by end-users — i.e. office suites, consumer websites, etc. — gets vastly more attention than things that are “just” a really bad day for enterprise ops people. And there’s a negative halo effect — i.e. if Provider X fails, it tends to raise worries about all their competitors too. But even good media explanations and excellent RCAs tend to be misunderstood by readers — and even by smart IT people. This leads, in turn, to misunderstanding why cloud services fail and what the real risks are.
I recently completed my writing on a note about HA and failover (DR) patterns in cloud IaaS and PaaS, with a light touch on application design patterns for resilience. However, concerns about cloud resilience applies just as much — if not more so — to SaaS, especially API SaaS, which creates complicated and deep webs of dependencies.
You can buy T-shirts, stickers, and all manner of swag that says, “The cloud is just somebody else’s computer.” Cute slogan, but not true. Cloud services — especially at massive scale — are incredibly complex software systems. Complex software systems don’t fail the way a “computer” fails. The cloud exemplifies the failure principles laid out by Richard Cook in his classic “How Complex Systems Fail“.
As humans, we are really bad at figuring out the risk of complex systems, especially because the good ones are heavily defended against failure. And we tend to over-index on rare but dramatic risks (a plane crash) versus more commonplace risks (a car crash).
If you think about “my application hosted on AWS” as “well, it’s just sitting on a server in an AWS data center rather than mine”, then at some point in time, the nature of a failure is going to shock you, because you are wrong.
Cloud services fail after all of the resiliency mechanisms have failed (or sometimes, gone wrong in ways that contribute to the failure). Cloud services tend to go boom because of one or more software bugs, likely combined with either a configuration error or some kind of human error (often related to the deployment process for new configs and software versions). They are only rarely related to a physical failure — and generally the physical failure only became apparent to customers because the software intended to provide resilience against it failed in some fashion.
Far too many customers still think about cloud failure as a simple, fundamentally physical thing. Servers fail, so we should use more than one. Data centers fail, so we should be able to DR into another. Etc. But that model is wrong for cloud and for the digital age. We want to strive for continuous availability and resilience (including graceful degradation and other ways to continue business functionality when the application fails). And we have to plan for individual services failures rather than total cloud failure (whether in an AZ, region, or globally). Such failures can be small-scale, and effectively merely “instability”, rather than an “outage” — and therefore demands apps that are resilient to service errors.
So as cloud buyers, we have to think about our risks differently, and we need to architect and operate differently. But we also need to trust our providers — and trust smartly. To that end, cloud providers need to support us with transparency, so we can make more informed decisions. Key elements of that include:
- Publicly-documented engineering service-level objectives (SLOs), which are usually distinct from the financially-backed SLAs. This is what cloud providers design to internally and measure themselves against, and knowing that helps inform our own designs and internal SLOs for our apps.
- Service architecture documentation that helps us understand the ways a service is and isn’t resilient, so we can design accordingly.
- Documented service dependency maps, which allow us to see the chain of dependencies for each of the services we use, allowing us to think about if Service X is really the best fallback alternative if Service Y goes down, as well as inform our troubleshooting.
- Public status dashboards, clearly indicating the status of services, with solid historical data that allows us to see the track record of service operations. This helps with our troubleshooting and user communication.
- Public outage root-cause analysis (RCA), which allow us to understand why outages occurred, and receive a public pledge as to what will be done to prevent similar failures in the future. A historical archive of these is also a valuable resource.
- Change transparency that could help predict stability concerns. Because so many outages end up being related to new deployments / config changes, and the use of SRE principles, including error budgets, is pretty pervasive amongst cloud providers, there is often an interesting pattern to outages. Changes tend to freeze when the error budget is exceeded, leading to an on-and-off pattern of outages; instability can resume at intervals unpredictable to the customer.
Mission-critical cloud applications are becoming commonplace — both in the pervasive use of SaaS, along with widespread production use of IaaS and PaaS. It’s past time to modernize thinking about cloud operations, cloud resilience, and cloud BC/DR. Cloud risk management needs to be about intelligent mitigation and not avoidance, as forward-thinking businesses are will not accept simply avoiding the cloud at this point.
I am interested in your experiences with resilience as well as cloud instability and outages. Feel free to DM me on Twitter to chat about it.
Clients have recently been asking a lot more questions about the comparative resilience of cloud providers.
Identity services are a particular point of concern (for instance, the Azure AD outage of October 1st and Google Cloud IAM outage of March 26th) since when identity is down, the customer can’t access the cloud provider’s control plane (and it may impact service use in general) — plus there’s generally no way for the customer to work around such issues.
The good news is, hyperscale cloud providers do a pretty good job of being robust. However, the risk of smaller, more hosting-like providers can be much higher — and there are notable differences between the hyperscalers, too.
Operations folks know: Everything breaks. Physical stuff fails, software is buggy, and people screw up (a lot). A provider can try its best to reduce the number of failures, limit the “blast radius” of a problem, limit the possibility of “cascading failures”, and find ways to mitigate the impact on users. But you can’t avoid failure entirely. Systems that are resilient recover quickly from failure.
If you chop off the head of a hydra, it grows back — quickly. We can think about five key factors — heads of the hydra — that influence the robustness, resilience, and observed (“real world”) availability of cloud services:
- Physical design: The design of physical things, such as the data center and the hardware used to deliver services.
- Logical (software) design: The design of non-physical things, especially software — all aspects of the service architecture that is not related to a physical element.
- Implementation quality: The robustness of the actual implementation, encompassing implementation skill, care and meticulousness, and the effectiveness of quality-assurance (QA) efforts.
- Deployment processes: The rollout of service changes is the single largest cause of operational failures in cloud services. The quality of these processes, the automation used in the processes, and the degree to which humans are given latitude to use good judgment (or poor judgment) thus have a material impact on availability.
- Operational processes: Other operational processes, such as monitoring, incident management — and, most importantly, problem management — impact the cloud provider’s ability to react quickly to problems, mitigate issues, and ensure that the root causes of incidents are addressed. Both proactive and reactive maintenance efforts can have an impact on availability.
A sixth factor, Transparency, isn’t directly related to keeping the hydra alive, but matters to customers as they plan for their own application architectures and risk management — contributing to customer resilience. Transparency includes making architectural information to customers, as well as delivering outage-related visibility and insight to customers. Customers need real-world info — like current and historical outage reports and the root-cause-analysis port-mortems that offer insight into what went wrong and why (and what the provider is doing about it).
When you think about cloud service resilience (or the resilience of your own systems), think about it in terms of those factors. Don’t think about it like you think about on-premises systems, where people often think primarily about hardware failures or a fire in the data center. Rather, you’re dealing with systems where software issues are almost always the root cause. Physical robustness still matters, but the other four factors are largely about software.
Many of my client inquiries deal with the seemingly overwhelming complexity of maturing cloud adoption — especially with the current wave of pandemic-driven late adopters, who are frequently facing business directives to move fast but see only an immense tidal wave of insurmountably complex tasks.
A lot of my advice is focused on starting small — or at least tackling reasonably-scoped projects. The following is specifically applicable to IaaS / IaaS+PaaS:
Build a cloud center of excellence. You can start a CCOE with just a single person designated as a cloud architect. Standing up a CCOE is probably going to take you a year of incremental work, during which cloud adoption, especially pilot projects, can move along. You might have to go back and retroactively apply governance and good practices to some projects. That’s usually okay.
Start with one cloud. Don’t go multicloud from the start. Do one. Get good at it (or at least get a reasonable way into a successful implementation). Then add another. If there’s immediate business demand (with solid business-case justifications) for more than one, get an MSP to deal with the additional clouds.
Don’t build a complex governance and ops structure based on theory. Don’t delay adoption while you work out everything you think you’ll need to govern and manage it. If you’ve never used cloud before, the reality may be quite different than you have in your head. Run a sequence of increasingly complex pilot projects to gain practical experience while you do preparatory work in the background. Take the lessons learned and apply them to that work.
Don’t build massive RFPs to choose a provider. Almost all organizations are better off considering their strategic priorities and then matching a cloud provider to those priorities. (If priorities are bifurcated between running the legacy and building new digital capabilities, this might encourage two strategic providers, which is fine and commonplace.) Massive RFPs are a lot of work and are rarely optimal. (Government folks might have no choice, unfortunately.)
Don’t try to evaluate every service. Hyperscale cloud providers have dozens upon dozens of services. You won’t use all of them. Don’t bother to evaluate all of them. If you think you might use a service in the future, and you want to compare that service across providers… well, by the time you get around to implementing it, all of the providers will have radically updated that service, so any work you do now will be functionally useless. Look just at the services that you are certain you will use immediately and in the very near (no more than one year) future. Validate a subset of services for use, and add new validations as needed later on.
Focus on thoughtful workload placement. Decide who your approved and preferred providers are, and build a workload placement policy. Look for “good technical fit” and not necessarily ideal technical fit; integration affinities and similar factors are more important. The time to do a detailed comparison of an individual service’s technical capabilities is when deciding workload placement, not during the RFP phase.
Accept the limits of cloud portability. Cloud providers don’t and will probably never offer commoditized services. Even when infrastructure resources seem superficially similar, there are still meaningful differences, and the management capabilities wrapped around those resources are highly differentiated. You’re buying into ecosystems that have the long-term stickiness of middleware and management software. Don’t waste time on single-pane-of-glass no-lock-in fantasies, no matter how glossily pretty the vendor marketing material is. And no, containers aren’t magic in this regard.
Links are to Gartner research and are paywalled.
Pondering the care and feeding of your multicloud gelatinous cube. (Which engulfs everything in its path, and digests everything organic.)
Most organizations end up multicloud, rather than intending to be multicloud in a deliberate and structured way. So typical tales go like this: The org started doing digital business-related new applications on AWS and now AWS has become the center of gravity for all new cloud-native apps and cloud-related skills. Then the org decided to migrate “boring” LOB Windows-based COTS to the cloud for cost-savings, and lifted-and-shifted them onto Azure (thereby not actually saving money, but that’s a post for another day). Now the org has a data science team that thinks that GCP is unbearably sexy. And there’s a floating island out there of Oracle business applications where OCI is being contemplated. And don’t forget about the division in China, that hosts on Alibaba Cloud…
Multicloud is inevitable in almost all organizations. Cloud IaaS+PaaS spans such a wide swathe of IT functions that it’s impractical and unrealistic to assume that the organization will be single-vendor over the long term. Just like the enterprise tends to have at least three of everything (if not ten of everything), the enterprise is similarly not going to resist the temptation of being multicloud, even if it’s complex and challenging to manage, and significantly increases management costs. It is a rare organization that both has diverse business needs, and can exercise the discipline to use a single provider.
Despite recognizing the giant ooze that we see squelching our way, along with our unavoidable doom, there are things we can do to prepare, govern, and ensure that we retain some of our sanity.
For starters, we can actively choose our multicloud strategy and stance. We can classify providers into tiers, decide what providers are approved for use and under what circumstances, and decide what providers are preferred and/or strategic.
We can then determine the level of support that the organization is going to have for each tier — decide, for instance, that we’ll provide full governance and operations for our primary strategic provider, a lighter-weight approach that leans on an MSP to support our secondary strategic provider, and less support (or no support beyond basic risk management) for other providers.
After that, we can build an explicit workload placement policy that has an algorithm that guides application owners/architects in deciding where particular applications live, based on integration affinities, good technical fit, etc.
Note that cost-based provider selection and cost-based long-term workload placement are both terrible ideas. This is a constant fight between cloud architects and procurement managers. It is rooted in the erroneous idea that IaaS is a commodity, and that provider pricing advantages are long-term rather than short-lived. Using cost-based placement often leads to higher long-term TCO, not to mention a grand mess with data gravity and thus data management, and fragile application integrations.
See my new research note, “Comparing Cloud Workload Placement Strategies” (Gartner paywall) for a guide to multicloud IaaS / IaaS+PaaS strategies (including when you should pursue a single-cloud approach). In a few weeks, you’ll see the follow-up doc “Designing a Cloud Workload Placement Policy” publish, which provides a guide to writing such policies, with an analysis of different placement factors and their priorities.
A nontrivial chunk of my client conversations are centered on the topic of cloud IaaS/PaaS self-service, and how to deal with development teams (and other technical end-user teams, i.e. data scientists, researchers, hardware engineers, etc.) that use these services. These teams, and the individuals within those teams, often have different levels of competence with the clouds, operations, security, etc. but pretty much all of them want unfettered access.
Responsible governance requires appropriate guidelines (policies) and guardrails, and some managers and architects feel that there should be one universal policy, and everyone — from the highly competent digital business team, to the data scientists with a bit of ad-hoc infrastructure knowledge — should be treated identically for the sake of “fairness”. This tends to be a point of particular sensitivity if there are numerous application development teams with similar needs, but different levels of cloud competence. In these situations, applying a single approach is deadly — either for agility or your crisis-induced ulcer.
Creating a structured, tiered approach, with different levels of self-service and associated governance guidelines and guardrails, is the most flexible approach. Furthermore, teams that deploy primarily using a CI/CD pipeline have different needs from teams working manually in the cloud provider portal, which in turn are different from teams that would benefit from having an easy-vend template that gets provisioned out of a ServiceNow request.
The degree to which each team can reasonably create its own configurations is related to the team’s competence with cloud solution architecture, cloud engineering, and cloud security. Not every person on the team may have a high level of competence; in fact, that will generally not be the case. However, the very least, for full self-service there needs to be at least one person with strong competencies in each of those areas, who has oversight responsibilities, acts an expert (provides assistance/mentorship within the team), and does any necessary code review.
If you use CI/CD, you also want automation of such review in your pipeline, that includes your infrastructure-as-code (IaC) and cloud configs, not just the app code; i.e. a tool like Concourse Labs). Even if your whole pipeline isn’t automated, review of IaC during the dev stage, and not just when it triggers a cloud security posture management tool (like Palo Alto’s Prisma Cloud or Turbot), whether in dev, test, or production.
Who determines “competence”? To avoid nasty internal politics, it’s best to set this standard objectively. Certifications are a reasonable approach, but if your org isn’t the sort that tends to pay for internal certifications or the external certifications (AWS/Azure Solution Architect, DevOps Engineer, Security Engineer, etc.) seem like too high a bar, you can develop an internal training course and certification. It’s not a bad idea for all of your coders (whether app developers, data scientists, etc.) that use the cloud to get some formal training on creating good and secure cloud configurations, anyway.
(For Gartner clients: I’m happy to have a deeper discussion in inquiry. And yes, a formal research note on this is currently going through our editing process and will be published soon.)
Building cloud expertise is hard. Building multicloud expertise is even harder. By “multicloud” in this context, I mean “adopting, within your organization, multiple cloud providers that do something similar” (such as adopting both AWS and Azure).
Integrated IaaS+PaaS providers are complex and differentiated entities, in both technical and business aspects. Add in their respective ecosystems — and the way that “multicloud” vendors, managed service providers (MSPs) etc. often deliver subtly (or obviously) different capabilities on different cloud providers — and you can basically end up with a multicloud katamari that picks up whatever capabilities it randomly rolls over. You can’t treat them like commodities (a topic I cover extensively in my research note on Managing Vendor Lock-In in Cloud IaaS).
For this reason, cloud-successful organizations that build a Cloud Center of Excellence (CCOE), or even just try to wrap their arms around some degree of formalized cloud operations and governance, almost always start by implementing a single cloud provider but plan for a multicloud future.
Successfully multicloud organizations have cloud architects that deeply educate themselves on a single provider, and their cloud team initially builds tools and processes around a single provider — but the cloud architects and engineers also develop some basic understanding of at least one additional provider in order to be able to make more informed decisions. Some basic groundwork is laid for a multicloud future, often in the form of frameworks, but the actual initial implementation is single-cloud.
Governance and support for a second strategic cloud provider is added at a later date, and might not necessarily be at the same level of depth as the primary strategic provider. Scenario-specific (use-case-specific or tactical) providers are handled on a case-by-case basis; the level of governance and support for such a provider may be quite limited, or may not be supported through central IT at all.
Individual cloud engineers may continue to have single-cloud rather than multicloud skills, especially because being highly expert in multiple cloud providers tend to boost market-rate salaries to levels that many enterprises and mid-market businesses consider untenable. (Forget using training-cost payback as a way to retain people; good cloud engineers can easily get a signing bonus more than large enough to deal with that.)
In other words: while more than 80% of organizations are multicloud, very few of them consider their multiple providers to be co-equal.
What sort of org structures work well for helping to drive successful cloud adoption? Every day I talk to businesses and public-sector entities about this topic. Some have been successful. Others are struggling. And the late-adopters are just starting out and want to get it right from the start.
Back in 2014, I started giving conference talks about an emerging industry best practice — the “Cloud Center of Excellence” (CCOE) concept. I published a research note at the start of 2019 distilling a whole bunch of advice on how to build a CCOE, and I’ve spent a significant chunk of the last year and a half talking to customers about it. Now I’ve revised that research, turning it into a hefty two-part note on How to Build a Cloud Center of Excellence: part 1 (organizational design) and part 2 (Year 1 tasks).
Gartner’s approach to the CCOE is fundamentally one that is rooted in the discipline of enterprise architecture and the role of EA in driving business success through the adoption of innovative technologies. We advocate a CCOE based on three core pillars — governance (cost management, risk management, etc.), brokerage (solution architecture and vendor management), and community (driving organizational collaboration, knowledge-sharing, and cloud best practices surfaced organically).
Note that it is vital for the CCOE to be focused on governance rather than on control. Organizations who remain focused on control are less likely to deliver effective self-service, or fully unlock key cloud benefits such as agility, flexibility and access to innovation. Indeed, IT organizations that attempt to tighten their grip on cloud control often face rebellion from the business that actually decreases the power of the CIO and the IT organization.
Also importantly, we do not think that the single-vendor CCOE approaches (which are currently heavily advocated by the professional services organizations of the hyperscalers) are the right long-term solution for most customers. A CCOE should ideally be vendor-neutral and span IaaS, PaaS, and SaaS in a multicloud world, with a focus on finding the right solutions to business problems (which may be cloud or noncloud). And a CCOE is not an IaaS/PaaS operations organization — cloud engineering/operations is a separate set of organizational decisions (I’ll have a research note out on that soon, too).
Please dive into the research (Gartner paywall) if you are interested in reading all the details. I have discussed this topic with literally thousands of clients over the last half-dozen years. If you’re a Gartner for Technical Professionals client, I’d be happy to talk to you about your own unique situation.
Preface added 20 November 2020: This post received a lot more attention than I expected. I must reiterate that it is not in any way an endorsement. Indeed, sparkly pink unicorns are, by their nature, fanciful. Caution must be exercised, as sparkly pink glitter can conceal deficiencies in the equine body.
Digging into my archive of past predictions… In a research note on the convergence of public and private cloud, published almost exactly eight years ago in July 2012, I predicted that the cloud IaaS market would eventually deliver a service that delivered a full public cloud experience as if it were private cloud — at the customer’s choice of data center, in a fully single-tenant fashion.
Since that time, there have been many attempts to introduce public-cloud-consistent private cloud offerings. Gartner now has a term, “distributed cloud”, to refer to the on-premises and edge services delivered by public cloud providers. AWS Outposts deliver, as a service, a subset of AWS’s incredibly rich product porfolio. Azure Stack (now Azure Stack Hub) delivers, as software, a set of “Azure-consistent” capabilities (meaning you can transfer your scripts, tooling, conceptual models, etc., but it only supports a core set of mostly infrastructure capabilities). Various cloud MSPs, notably Avanade, will deliver Azure Stack as a managed service. And folks like IBM and Google want you to take their container platform software to facilitate a hybrid IT model.
But no one has previously delivered what I think is what customers really want:
- Location of the customer’s choice
- Single-tenant; no other customer shares the hardware/service; data guaranteed to stay within the environment
- Isolated control plane and private self-service interfaces (portal, API endpoints); no tethering or dependence on the public cloud control plane, or Internet exposure of the self-service interfaces
- Delivered as a service with the same pricing model as the public cloud services; not significantly more expensive than public cloud as long as minimum commitment is met
- All of the provider’s services (IaaS+PaaS), identical to the way that they are exposed in the provider’s public cloud regions
Why do customers want that? Because customers like everything the public cloud has to offer — all the things, IaaS and PaaS — but there are still plenty of customers who want it on-premises and dedicated to them. They might need it somewhere that public cloud regions generally don’t live and may never live (small countries, small cities, edge locations, etc.), they might have regulatory requirements they believe they can only meet through isolation, they may have security (even “national security”) requirements that demand isolation, or they may have concerns about the potential to be cut off from the rest of the world (as the result of sanctions, for instance). And because when customers describe what they want, they inevitably ask for sparkly pink unicorns, they also want all that to be as cheap as a multi-tenant solution.
And now it’s here, and given that it’s 2020… the sparkly pink unicorn comes from Oracle. Specifically, the world now has Oracle Dedicated Regions Cloud @ Customer. (Which I’m going to shorthand as OCI-DR, even though you can buy Oracle SaaS hosted on this infrastructure) OCI’s region model, unlike its competitors, has always been all-services-in-all-regions, so the OCI-DR model continues that consistency.
In an OCI-DR deal, the customer basically provides colo (either their own data center or a third party colo) to Oracle, and Oracle delivers the same SLAs as it does in OCI public cloud. The commit is very modest — it’s $6 million a year, for a 3-year minimum, per OCI-DR Availability Zone (a region can have multiple AZs, and you can also buy multiple regions). There are plenty of cloud customers that easily meet that threshold. (The typical deal size we see for AWS contracts at Gartner is in the $5 to $15 million/year range, on 3+ year commitments.) And the pricing model and actual price for OCI-DR services is identical to OCI’s public regions.
The one common pink sparkly desire that OCI doesn’t meet is the ability to use your own hardware, which can help customers address capex vs. opex desires, may have perceived cost advantages, and may address secure supply chain requirements. OCI-DR uses some Oracle custom hardware, and the hardware is bundled as part of the service.
I predict that this will raise OCI’s profile as an alternative to the big hyperscalers, among enterprise customers and even among digital-native customers. Prior to today’s announcement, I’d already talked to Gartner clients who had been seriously engaged in sales discussions on OCI-DR; Oracle has quietly been actively engaged in selling this for some time. Oracle has made significant strides (surprisingly so) in expanding OCI’s capabilities over this last year, so when they say “all services” that’s now a pretty significant portfolio — likely enough for more customers to give OCI a serious look and decide whether access to private regions is worth dealing with the drawbacks (OCI’s more limited ecosystem and third-party tool support probably first and foremost).
As always, I’m happy to talk to Gartner clients who are interested in a deeper discussion. We’ve recently finished our Solution Scorecards (an in-depth assessment of 270 IaaS+PaaS capabilities), including our new assessment of OCI. The scores are summarized in a publicly-reprinted document. The full scorecard has been published, and the publicly-available summary says, “OCI’s overall solution score is 62 out of 100, making it a scenario-specific option for technical professionals responsible for cloud production deployments.”
We’ve just completed our 2019 evaluations of cloud IaaS providers, resulting in a new Magic Quadrant, Critical Capabilities, and six Solution Scorecards — one for each of the providers included in the Magic Quadrant. This process has also resulted in fresh benchmarking data within Gartner’s Cloud Decisions tool, a SaaS offering available to Gartner for Technical Professionals clients, which contains benchmarks and monitoring results for many cloud providers.
As part of this, we are pleased to introduce Gartner’s new Solution Scorecards, an updated document format for what we used to call In-Depth Assessments. Solution Scorecards assess an individual vendor solution against our recently-revised Solution Criteria (formerly branded Evaluation Criteria). They are highly detailed documents — typically 60 pages or so, assessing 265 individual capabilities as well as providing broader recommendations to Gartner clients.
The criteria are always divided into Required, Preferred, and Optional categories — essentially, things that everyone wants (and where they need to compensate/risk-mitigate if something is missing), things that most people want but can live without or work around readily, and things that are use case-specific. The Required, Preferred, and Optional criteria are weighted into a 4:2:1 ratio in order to calculate an overall Solution Score.
If you are a Gartner for Technical Professionals client, the scorecards are available to you today. You can access them from the links below (Gartner paywall):
- Amazon Web Services
- Microsoft Azure
- Google Cloud Platform
- IBM Cloud
- Oracle Cloud Infrastructure
- Alibaba Cloud International (English-language offerings available outside of China)
We will be providing a comparison of these vendors and their Solution Scorecards at the annual “Cloud Wars” presentation at the Gartner Catalyst conference — one of numerous great reasons to come to San Diego the week of August 11th (or Catalyst UK in London the week of September 15th)! Catalyst has tons of great content for cloud architects and other technical professionals involved in implementing cloud computing.
Note that we are specifically assessing just the integrated IaaS+PaaS offerings — everything offered through a single integrated self-service experience and on a single contract. Also, only cloud services count; capabilities offered as software, hosting, or a human-managed service do not count. Capabilities also have to be first-party.
Also note that this is not a full evaluation of a cloud provider’s entire portfolio. The scorecards have “IaaS” in the title, and the scope is specified clearly in the Solution Criteria. For the details of which specific provider services or products were or were not evaluated, please refer to each specific Scorecard document.
All the scores are current as of the end of March, and count only generally-available (GA) capabilities. Because it takes weeks to work with vendors for them to review and ensure accuracy, and time to edit and publish, some capabilities will have gone beta or GA since that time; because we only score what we’re able to test, the evaluation period has a cut-off date. After that, we update the document text for accuracy but we don’t change the numerical scores. We expect to update the Solution Scorecards approximately every 6 months, and working to increase our cadence for evaluation updates.
This year’s scores vs. last year’s
When you review the scores, you’ll see that broadly, the scores are lower than they were in 2018, even though all the providers have improved their capabilities. There are several reasons why the 2019 scores are lower than in previous years. (For a full explanation of the revision of the Solution Criteria in 2019, see the related blog post.)
First, for many feature-sets, several Required criteria were consolidated into a single multi-part criterion with “table stakes” functionality; missing any part of that criterion caused the vendor to receive a “No” score for that criterion (“Yes” is 1 point; “No” is zero points; there is no partial credit). The scorecard text explains how the vendor does or does not meet each portion of a criterion. The text also mentions if there is beta functionality, or if a feature was introduced after the evaluation period.
Second, many criteria that were Preferred in 2018 were promoted to Required in 2019, due to increasing customer expectations. Similarly, many criteria that were Optional in 2018 are now Preferred. We introduced some brand-new criteria to all three categories as well, but providers that might have done well primarily on table-stakes Required functionality in previous years may have scored lower this year due to the increased customer expectations reflected by revised and new criteria.
Customizing the scores
The solution criteria, with all of the criteria detail, is available to all Gartner for Technical Professionals clients, and comes with a spreadsheet that allows you to score any provider yourself; we also provide a filled-out spreadsheet with each Solution Scorecard so you can adapt the evaluation for your own needs. The Solution Scorecards are similarly transparent on which parts of a criterion are or aren’t met, and we link to documentation that provides evidence for each point (in some cases Gartner was provided with NDA information, in which case we tell you how to get that info from the provider).
This allows you to customize the scores as you see fit. Thus, if you decide that getting 3 out of 4 elements of a criteria is good enough for you, or you think that the thing they miss isn’t relevant to you, or you want to give the provider credit for newly-released capabilities, or you want to do region-specific scoring, you can modify the spreadsheet accordingly.
If you’re a Gartner client and are interested in discussing the solution criteria, assessment process, and the cloud providers, please schedule an inquiry or a 1-on-1 at Catalyst. We’d be happy to talk to you!
In February of this year, we revised the Evaluation Criteria for Cloud IaaS (Gartner paywall). The evaluation criteria (now rebranded Solution Criteria) are essentially the sort of criteria that prospective customers typically include in RFPs. They are highly detailed technical criteria, along with some objectively-verifiable business capabilities (such as elements in a technical support program, enterprise ISV partnerships, ability to support particular compliance requirements, etc.).
The Solution Criteria are intended to help cloud architects evaluate cloud IaaS providers (and integrated IaaS+PaaS providers such as the hyperscale cloud providers), whether public or private, or assess their own internal private cloud. We are about to publish Solution Scorecards (formerly branded In-Depth Assessments) for multiple providers; Gartner analysts assess these solutions hands-on and determine whether or not they have capabilities that meet the requirements of a criterion.
The TL;DR version
In summary, we revised the Solution Criteria extensively in 2019, and the results were as follows:
- The criteria have been updated to reflect the current IaaS+PaaS market.
- Expectations are significantly higher than in previous years.
- Expectations have been aligned to other Gartner research, taking into account customer wants and needs in the relevant market, not just in a cloud-specific context.
- Many capabilities have been consolidated and are now required.
- Most vendor scores in the Solution Scorecards have dropped dramatically since last year, and there is a much broader spread of vendor scores.
The Evolution of Customer Demands
The Evaluation Criteria (EC) for Cloud IaaS was first published in 2012. It received a significant update every other year (each even-numbered year) thereafter. When first written, the EC reflected the concerns of our clients at the time, many of whom were infrastructure and operations (I&O) professionals with VMware backgrounds. With each iteration, the EC evolved significantly, yet incrementally.
In the meantime, the market moved extremely quickly. The market evolution towards cloud integrated IaaS and PaaS (IaaS+PaaS) providers, and the market exit (or strategic de-investment) of many of the “commodity” providers, radically changed the structure and nature of the market over time. Cloud IaaS providers weren’t just expected to provide “hardware infrastructure”, but also “software infrastructure”, including all of the necessary management and automation. This essentially forced these providers into introducing services that compete in many IT markets and in an extraordinary number of software niches.
Furthermore, as the market matured, the roles and expectations of our clients also evolved significantly. The focus shifted to enterprise-wide initiatives, rather than project-based adoption. Digital business transformation elevated the importance of cloud-native workloads, while IT transformation emphasized the need for high-quality cloud migration of existing workloads. The notion that a cloud IaaS provider could successfully run all, or almost all, of a customer’s IT became part of the assumptions that needed to underpin the provider evaluation process.
Today’s cloud IaaS customers have high expectations. Experienced customers are becoming more sophisticated, but late adopters also have high expectations of a provider that have to be met to help the customer overcome barriers to adoption.
For 2019, we decided to take a look at the EC“from scratch”, in order to try to construct a list of criteria that are the most relevant to the initiatives of customers today. In many cases, our clients are trying to pick a primary strategic IaaS provider. In other cases, our clients already have a primary provider but are trying to pick a strategic secondary provider as they implement a multicloud strategy. Finally, some of our clients are choosing a provider for a tactical need, but still need to understand that provider’s capabilities in detail.
Constructing the Revision
The revision needed to keep a similar number of criteria (in order to keep the assessment time manageable and the assessment itself at a readable length) — we ended up with 265 for 2019.
In order to keep the total number of criteria down, we needed to consolidate closely-related criteria into a single criterion. Many criteria became multi-part as a result. We tried to consolidate the “table stakes” functionality that could be assumed to be a part of all (or almost all) cloud IaaS offerings, in order to make room for more differentiated capabilities.
We tried to be as vendor-neutral as possible. The evaluation criteria have evolved since the initial 2012 introduction; when we introduced new criteria in the past, we often ended up with criteria requirements that closely mirrored the feature-set of the first provider to offer a capability, since that provider shaped customer expectations. In this 2019 revision, we tried to go back to the core customer requirements, without concern as to whether cloud provider implementations fully aligned with those requirements — the criteria are intended to reflect what customers want and not what vendors offer. There are requirements that no vendors meet, but which we often hear our clients ask for; in such cases we tried to phrase those requirements in ways that are reasonable and implementable at scale, as it’s okay for the criteria to be somewhat aspirational for the market.
We tried to make sure that the criteria were worded using standard Gartner terms or general market terminology, avoiding vendor-specific terms. (Note that because vendors not-infrequently adopt Gartner terms, there were cases where providers had adopted terminology from earlier versions of EC, and we made no attempt to alter such terms.)
We tried to keep to requirements, without dictating implementation, where possible. However, we had to keep in mind that in cloud IaaS, where there are customers who want fine-grained visibility and control over the infrastructure, there still must be implementation specificity when the customer explicitly wants those elements exposed.
Defining the Criteria
During the process of determining the criteria, we sought input broadly within Gartner, both in terms of discussing the criteria with other analysts as well as incorporating things from existing Gartner written research. (And the criteria reflect, as much as possible, the discussions we’ve had with clients about what they’re looking for, and what they’re putting into their RFPs.)
In some cases, we needed input from specialists in a topic. In some areas of technology, clients who need to have deep-dive discussions on features may talk almost exclusively to analysts specialized in those areas. Those analysts are familiar with current requirements as well as the future of those technology areas, and are thus the best source for determining those needs. For example, areas such as machine learning and IoT are primarily covered by analysts with those specializations, even when the customers are implementing cloud solutions. There are also areas, such as Security, where we have detailed cloud recommendations from those teams. So we extensively incorporated their input..
We also looked at non-cloud capabilities when there were market gaps relative to customer desires. There are areas where either cloud providers do not currently have capabilities, or where those capabilities are relatively nascent. Thus, we needed to identify where customers are using on-premises solutions, and want cloud solutions. We also needed to determine what the “minimum viable product” should be for the purposes of constructing a criterion around it.
Feedback from non-cloud analysts was also important because it identified areas where clients were not using a cloud solution because of something that was missing. In many cases, these were not technology features, but issues around transparency, or the lack of solutions acceptable on a global basis.
Finally, the way that customers source solutions, build applications, and manage their data is changing. We tried to ensure that the new criteria aligned with these trends.
Because more and more of our clients are deploying cloud solutions globally, every criterion also had some requirements as to its global availability. These are used only for advisory purposes and are not part of scoring.
The vendors were allowed to give feedback on the criteria prior to publication. We wanted to check if the criteria were reasonable, and seemed fair. We incorporated feedback that constituted good, vendor-neutral suggestions that aligned to customer requirements.
The End Results
When you see the Solution Scorecards, you may be surprised by lower scores on the part of many of the providers. We’re being transparent about the Evaluation Criteria (Solution Criteria) revision in order to help you understand why the scores are lower.
The lower scores were an unintentional side-effect of the revision, but reflect, to some degree, the state of the market relative to the very high expectations of customers. Note that this year’s lower scores do not indicate that providers have “gone backwards” or removed capabilities; they just reflect the provider’s status against a raised bar of customer expectations.
We expect that when we update the scorecards in the second half of this year, scores will increase, as many of the vendors have since introduced missing capabilities, or will do so by the next update. We retain confidence that the solution criteria are a good reflection of a broad range of current customer expectations. Because many vendors are doing a good job of listening to what customers and prospects want, and planning accordingly, we think that the solution criteria will also be reflected in future vendor roadmaps and market development.
We discuss the Solution Scorecards and scores in a separate blog post.