Cloud self-service doesn’t need to invite the orc apocalypse
I spend quite a bit of time talking to clients about developer self-service, largely in the context of public cloud governance and cloud operations. There are still lots of infrastructure and operations (I&O) executives who instinctively cringe at the notion of developer self-service, as if self-service would open formerly well-defended gates onto a pristine plain of well-controlled infrastructure, and allow a horde of unwashed orcs to overrun the concrete landscape in a veritable explosion of Lego structures, dot-matrix printouts, Snickers wrappers and lost whiteboard marker caps… never to be clean and orderly again.
It doesn’t have to be that way.
Self-service — and more broadly, developer control over infrastructure — isn’t an all-or-nothing proposition. Responsibility can be divided across the application life cycle, so that you can get benefits from “You build it, you run it” without necessarily parachuting your developers into an untamed and unknown wilderness and wishing them luck in surviving because it’s not an Infrastructure & Operations (I&O) team problem any more.
So we ask, instead:
- Will developers design their own infrastructure?
- Will developers control their dev/test environments?
- How much autonomy will developers have in building production environments?
- How much autonomy will developers have for production deployments?
- To what extent are developers responsible for day-to-day production maintenance (patching, OS updates, infrastructure rightsizing, etc.)?
- To what extent are developers responsible for incident management?
- How much help will developers receive for the things they’re responsible for?
I talk to far too many IT leaders who say, “We can’t give developers cloud self-service because we’re not ready for You build it, you run it!” whereupon I need to gently but firmly remind them that it’s perfectly okay to allow your developers full self-service access to development and testing environments, and the ability to build infrastructure as code (IaC) templates for production, without making them fully responsible for production.
This is the subject of my new research note, “How to Empower Technical Teams Through Self-Service Public Cloud IaaS and PaaS“. (Gartner for Technical Professionals paywall)
This is a step along the way to a deeper exploration of finding the right balance between “Dev” and “Ops” in DevOps, which is an organization-specific thing. This is not just a cloud thing; it also impacts the structure of operations on-premises. Every discussion of SRE, platform ops, etc. ultimately revolves around the questions of autonomy, governance, and collaboration, and no two organizations are likely to arrive at the exact same balance. (And don’t get me started on how many orgs rename their I&O teams to SRE teams without actually implementing much if anything from the principles of SRE.)
Resilience: Cloudy without a chance of meatballs
In the wake of AWS’s major US-East-1 incident of December 7th, 2021, I’ve fielded plenty of panicked client inquiry about whether anyone can trust any cloud provider, whether the availability zone model actually works, and whether or not the customer’s current architecture offers adequate resilience for their needs.
I’ve also dealt with more than a handful of journalists who have wanted to push a narrative that AWS customers are fleeing in droves and/or are going multicloud as a result of that outage. Every story I’ve read on that subject has tried its darnedest to imply something which just isn’t true. Yes, many organizations use multiple cloud providers. No, they don’t do so for resilience, but rather, because differing preferences within the organization have led to adopting more than one provider.
The fact that it’s now more than two months since the outage and I’m still talking about it with clients (and my colleagues are too) does reflect how large it looms in the mind of customers — including customers of other cloud providers — though. Indeed, it looms large in the mind of many AWS customers who were not affected, either because they don’t run in US-East-1 or because their failover to another region worked as planned.
At this point, not only have my colleagues and I talked to quite a few organizations but we’ve also talked to providers of disaster recovery software and services. Thus far, it appears that customers that had problems with cross-region recovery during the 12/7/21 incident either violated AWS best practices for such, or violated their vendor’s advice.
That’s not to say that there weren’t two important unpleasant surprises in terms of US-East-1 dependencies:
- The global console URL was pointing to US-East-1 alone (rather than being geo load balanced or the like, which most people would probably have assumed). Customers could get around this by going to a regional console URL instead. I believe (but haven’t confirmed) AWS has now introduced a truly global endpoint with the introduction of the new console experience.
- Route 53 and Cloudfront’s control plane APIs are hosted solely in US-East-1. People reasonably expect to be able to make DNS changes during an outage, even though AWS advises that you use health checks for your failover instead.
Either of those two things could have thrown a wrench into cross-region recovery, along with needing to create new S3 buckets (the global namespace conflict checks are done against US-East-1), needing very specific instance types in short supply in the target region, needing to create new IAM roles (which are first created in US-East-1 and then replicated to other regions), and depending on the legacy STS global namespace (also US-East-1 dependent). But by and large, cross-region recovery worked as expected.
Now, there are certainly plenty of people who can’t do fast failover into another region and who therefore sat tight and suffered through the incident, and there’s a nontrivial number of customers who haven’t laid foundations for disaster recovery (however slowly) into another region. I get it — being able to do this kind of recovery requires an investment. You want cloud providers to be so resilient that you don’t need to make that investment yourself. But hope is not a strategy, here.
But the sky did not fall, and the sky is not falling. Cloud has not suddenly become less attractive or significantly more risky. AZ architectures work, but as always, problems with regional services (which are already designed to be multi-AZ) mean that multi-AZ might not be enough for the most critical applications. Cross-region failover works, when properly architected. (Fast and seamless failover and failback are critical, though; major cloud incidents to date have generally been multi-hour, but not multi-day. If you can’t fail over easily and fail back without a lot of effort, you tend to just wait out the outage and hope it’s short.)
Yes, there were significant problems for many customers in US-East-1. API Gateway was essentially down, and many people are dependent on API Gateway to invoke Lambda, and tons of customers use Lambda in a mission-critical fashion. Amazon Connect also depends on API Gateway, and it was also affected. (Other casualties of the backend network issues: ELB launches, S3 private endpoints, Fargate APIs impacting container launches, STS for EKS, and the support APIs.) But EC2 virtual machines continued to function just fine (although you couldn’t launch new ones). The overwhelming majority of AWS services in the region continued to operate unimpacted, and customers who did not have dependencies on affected services were able to continue operating in the region.
In a way, this was a stark demonstration of how much cloud outages are usually confined to specific services… but if a down service is critical to your application, you’re probably boned unless you have a workaround or you can failover into another region. Unfortunately, far too many customers persist in planning as if physical data center failure was the most likely event. (AWS had one of those in December, too — a power outage in a single data center, thus impacting a percentage of infrastructure in one of the six US-East-1 AZs.)
Yes, the incident was a wake-up call for a lot of cloud customers, and it was a rallying cry for on-premises server-huggers. However, not only is the sky not falling, but there should be no anticipation that it will rain meatballs.
I wrote a number of blog posts months before this outage:
- The cloud is NOT just someone else’s computer
- Multicloud failover is almost always a terrible idea
- Improving cloud resilience through stuff that works
and I still firmly stand by those posts now. (Importantly, I still believe multicloud for resilience is almost always impractical. Successful implementations are vanishingly rare and have horrible drawbacks.) Indeed, I’d been working on a piece of Gartner research with my colleagues Kevin Matheny (who covers application architecture), Stanton Cole and Fintan Quinn (who cover backup and DR), which I’m glad to say has finally published:
Designing Availability and Resilience for Applications in Public Cloud IaaS and PaaS (Gartner for Technical Professionals paywall)
In this, you’ll find what I hope is a pragmatic set of guidance advising that you figure out how critical an application is, and then choose your availability approach and your failover approach accordingly — and not forget the critical importance of designing and implementing resilience within your application. It’s got a lengthy dissection of all the things that can go wrong in the cloud, and what you should be thinking about when you architect. It also contains a sample architectural standard that cloud governance teams can provide to application architects to help them make these decisions. (The main doc runs 65 pages. The impatient will probably find the architectural standard, which is fairly short, to be easier reading.)
My Q1 2022 research agenda
This is the time of year when AR professionals ask analysts what they’re planning for next year. I don’t plan a year in advance. I tend not to even plan a quarter in advance. I write when the mood seizes me, which is probably unfortunate, but given that I write a lot it’s… okay-ish?
But I have a bunch of things drafted (either fully or partially) and that should get released in Q1 of next year. I have a general goal of trying to ensure that I back the advice I write for cloud architects with something for the CIO and other executive leaders that provides a bottom-line strategic summary, and/or material for the other teams the architects work with, so that I publish stuff as part of a set, either alone or in collaboration with analysts in other areas.
Cloud resilience (January): It’s increasingly common for clients to ask about architectural standards for HA/DR in the cloud. This note dissects why cloud services break, how to set architectural standards for HA and DR/failover (i.e. when to be multi-AZ, when to do cross-region failover, etc.), and some basic guidance on stability patterns (use of partitioning, bulkheads, backpressure, etc.)
Cloud self-service (January): Thankfully, most organizations are moving away from a service catalog-driven approach to cloud self-service in favor of cloud-native self-service. This note is about how to empower technical teams with self-service, while still providing appropriate governance.
The cloud operating model (February): Many clients are asking about how to organize for the cloud. This will be a triple-note set — one on designing a cloud operating model, one on implementing the operating model, and a colorful infographic summarizing the concept for the CIO and other executive leaders. It combines my previous guidance on Cloud Center of Excellence (CCOE), structuring FinOps and cloud sourcing, etc. with some new work on program management, and takes a deeper look at all the ways you can put this stuff together.
Cloud concentration risk (March): Concentration risk is a hot topic right now, especially in regulated industries. This concern spans IaaS, PaaS, and SaaS, and the dependencies are not always clear, so many organizations have concentration risk they’re not aware of. I intend to write a baseline note that other analysts have committed to contextualizing for audiences in different industries, as well as for cloud managed and professional services providers. While the sourcing risk of concentration remains minimal, the availability risks of concentration can be meaningful. An organization’s risk appetite and the business benefits of concentration should determine what, if any, steps they take to address concentration risk.
IaaS+PaaS provider evaluation update (April): Getting updated vendor evaluation research out in April basically means spending a good chunk of the first quarter doing that evaluation. (My January notes have already been written. And the February ones are mostly complete, so the schedule above isn’t implausible.) We are not currently discussing the form that this evaluation will take. Gartner management will communicate appropriately when the time comes (i.e. please don’t ask me, as I’m not at liberty to discuss it).
Cloud cost overruns may be a business leadership failure
A couple of months back, some smart folks at VC firm Andreesen Horowitz wrote a blog post called “The Cost of Cloud, a Trillion Dollar Paradox“. Among other things, the blog made a big splash because it claimed, quote: “[W]hile cloud clearly delivers on its promise early on in a company’s journey, the pressure it puts on margins can start to outweigh the benefits, as a company scales and growth slows.” It claimed that cloud overspending was resulting in huge loss of market value, and that developers needed incentives to reduce spending.
The blog post is pretty sane, but plenty of people misinterpreted it, or took away only its most sensationalistic aspects. I think it’s critical to keep in mind the following:
Decisions about cloud expenditures are ultimately business decisions. Unnecessarily high cloud costs are the result of business decisions about priorities — specifically, about the time that developers and engineers devote to cost optimization versus other priorities.
For example, when developer time is at a premium, and pushing out features as fast as possible is the highest priority, business leadership can choose to allow the following things that are terrible for cloud cost:
- Developers can ignore all annoying administrative tasks, like rightsizing the infrastructure or turning off stuff that isn’t in active use.
- Architects can choose suboptimal designs that are easier and faster to implement, but which will cost more to run.
- Developers can implement crude algorithms and inefficient code in order to more rapidly deliver a feature, without thinking about performance optimizations that would result in less resource consumption.
- Developers can skip implementing support for more efficient consumption patterns, such as autoscaling.
- Developers can skip implementing deployment automation that would make it easier to automatically rightsize — potentially compounded by implementing the application in ways that are fragile and make it too risky and effortful to manually rightsize.
All of the above is effectively a form of technical debt. In the pursuit of speed, developers can consume infrastructure more aggressively themselves — not bothering to shut down unused infrastructure, running more CI jobs (or other QA tests), running multiple CI jobs in parallel, allocating bigger faster dev/test servers, etc. — but that’s short-term, not an ongoing cost burden the way that the technical debt is. (Note that the same prioritization issues also impact the extent to which developers cooperate in implementing security directives. That’s a tale for another day.)
The more those things are combined — bad designs, poorly implemented, that you can’t easily rightsize or scale — the more that you have a mess that you can’t untangle without significant expenditure of development time.
Now, some organizations will go put together a “FinOps” team to play whack-a-mole with infrastructure — killing/parking stuff that is idle and rightsizing the waste. And that might help short-term, but until you can automate that basic cost hygiene, this is non-value-added people-intensive work. And woe betide you if your implementations are fragile enough that rightsizing is operationally risky.
Once you’ve got your whack-a-mole down to a nice quick automated cadence, you’ve got to address the application design and implementation technical debt — and invest in the discipline of performance engineering — or you’ll continue paying unnecessarily high bills month after month. (You’d also be oversizing on-prem infrastructure, but people are used to that, and the capital expenditure is money spent, versus the grind of a monthly cloud bill.)
Business leaders have to step up to prioritize cloud cost optimization — or acknowledge that it isn’t a priority, and that it’s okay to waste money on resources as long as the top line is increasing faster. As long that’s a conscious, articulated decision, that’s fine. But we shouldn’t pretend that developers are inherently irresponsible. Developers, like other employees, respond to incentives, and if they’re evaluated on their velocity of feature delivery, they’re going to optimize their work efforts towards that end.
For more details, check out my new research note called “Is FinOps the Answer to Cloud Cost Governance?” which is paywalled and targeted at Gartner’s executive leader clients — a combination of CxOs and business leaders.
The cloud is NOT just someone else’s computer
I recently wrote a Twitter thread about cloud risk and resilience that drew a lot of interest, so I figured I’d expand on it in a blog post. I’ve been thinking about cloud resilience a lot recently, given that clients have been asking about how they manage their risks.
Inquiries about this historically come in waves, almost always triggered by incidents that raise awareness (unfortunately often because the customer has been directly impacted). A wave generally spans a multi-week period, causing waves to bleed into one another. Three distinct sets come to mind over the course of 2021:
- The Azure AD outages earlier this year had a huge impact on client thinking about concentration risks and critical service dependencies — often more related to M365 than Azure, though (and exacerbated by the critical dependency that many organizations have on Teams during this pandemic). Azure AD is core to SSO for many organizations, making its resilience enormously impactful. These impacts are still very top of mind for many clients, months later.
- The Akamai outage (and other CDN outages with hidden dependencies) this summer raised application and infrastructure dependency awareness, and came as a shock to many customers, as Akamai has generally been seen as a bedrock of dependability.
- The near-daily IBM Cloud “Severity 1” outages over the last month have drawn selective client mentions, rather than a wave, but add to the broader pattern of cloud risk concerns. (To my knowledge, there has been no public communication from IBM regarding root cause of these issues. Notifications indicate the outages are multi-service and multi-regional, often impacting all Gen 2 multizone regions. Kubernetes may be something of a common factor, to guess from the impact scope.)
Media amplification of outage awareness appears to have a lot to do with how seriously they’re taken by customers — or non-customers. Affecting stuff that’s consumed by end-users — i.e. office suites, consumer websites, etc. — gets vastly more attention than things that are “just” a really bad day for enterprise ops people. And there’s a negative halo effect — i.e. if Provider X fails, it tends to raise worries about all their competitors too. But even good media explanations and excellent RCAs tend to be misunderstood by readers — and even by smart IT people. This leads, in turn, to misunderstanding why cloud services fail and what the real risks are.
I recently completed my writing on a note about HA and failover (DR) patterns in cloud IaaS and PaaS, with a light touch on application design patterns for resilience. However, concerns about cloud resilience applies just as much — if not more so — to SaaS, especially API SaaS, which creates complicated and deep webs of dependencies.
You can buy T-shirts, stickers, and all manner of swag that says, “The cloud is just somebody else’s computer.” Cute slogan, but not true. Cloud services — especially at massive scale — are incredibly complex software systems. Complex software systems don’t fail the way a “computer” fails. The cloud exemplifies the failure principles laid out by Richard Cook in his classic “How Complex Systems Fail“.
As humans, we are really bad at figuring out the risk of complex systems, especially because the good ones are heavily defended against failure. And we tend to over-index on rare but dramatic risks (a plane crash) versus more commonplace risks (a car crash).
If you think about “my application hosted on AWS” as “well, it’s just sitting on a server in an AWS data center rather than mine”, then at some point in time, the nature of a failure is going to shock you, because you are wrong.
Cloud services fail after all of the resiliency mechanisms have failed (or sometimes, gone wrong in ways that contribute to the failure). Cloud services tend to go boom because of one or more software bugs, likely combined with either a configuration error or some kind of human error (often related to the deployment process for new configs and software versions). They are only rarely related to a physical failure — and generally the physical failure only became apparent to customers because the software intended to provide resilience against it failed in some fashion.
Far too many customers still think about cloud failure as a simple, fundamentally physical thing. Servers fail, so we should use more than one. Data centers fail, so we should be able to DR into another. Etc. But that model is wrong for cloud and for the digital age. We want to strive for continuous availability and resilience (including graceful degradation and other ways to continue business functionality when the application fails). And we have to plan for individual services failures rather than total cloud failure (whether in an AZ, region, or globally). Such failures can be small-scale, and effectively merely “instability”, rather than an “outage” — and therefore demands apps that are resilient to service errors.
So as cloud buyers, we have to think about our risks differently, and we need to architect and operate differently. But we also need to trust our providers — and trust smartly. To that end, cloud providers need to support us with transparency, so we can make more informed decisions. Key elements of that include:
- Publicly-documented engineering service-level objectives (SLOs), which are usually distinct from the financially-backed SLAs. This is what cloud providers design to internally and measure themselves against, and knowing that helps inform our own designs and internal SLOs for our apps.
- Service architecture documentation that helps us understand the ways a service is and isn’t resilient, so we can design accordingly.
- Documented service dependency maps, which allow us to see the chain of dependencies for each of the services we use, allowing us to think about if Service X is really the best fallback alternative if Service Y goes down, as well as inform our troubleshooting.
- Public status dashboards, clearly indicating the status of services, with solid historical data that allows us to see the track record of service operations. This helps with our troubleshooting and user communication.
- Public outage root-cause analysis (RCA), which allow us to understand why outages occurred, and receive a public pledge as to what will be done to prevent similar failures in the future. A historical archive of these is also a valuable resource.
- Change transparency that could help predict stability concerns. Because so many outages end up being related to new deployments / config changes, and the use of SRE principles, including error budgets, is pretty pervasive amongst cloud providers, there is often an interesting pattern to outages. Changes tend to freeze when the error budget is exceeded, leading to an on-and-off pattern of outages; instability can resume at intervals unpredictable to the customer.
Mission-critical cloud applications are becoming commonplace — both in the pervasive use of SaaS, along with widespread production use of IaaS and PaaS. It’s past time to modernize thinking about cloud operations, cloud resilience, and cloud BC/DR. Cloud risk management needs to be about intelligent mitigation and not avoidance, as forward-thinking businesses are will not accept simply avoiding the cloud at this point.
I am interested in your experiences with resilience as well as cloud instability and outages. Feel free to DM me on Twitter to chat about it.
Refining the Cloud Center of Excellence
What sort of org structures work well for helping to drive successful cloud adoption? Every day I talk to businesses and public-sector entities about this topic. Some have been successful. Others are struggling. And the late-adopters are just starting out and want to get it right from the start.
Back in 2014, I started giving conference talks about an emerging industry best practice — the “Cloud Center of Excellence” (CCOE) concept. I published a research note at the start of 2019 distilling a whole bunch of advice on how to build a CCOE, and I’ve spent a significant chunk of the last year and a half talking to customers about it. Now I’ve revised that research, turning it into a hefty two-part note on How to Build a Cloud Center of Excellence: part 1 (organizational design) and part 2 (Year 1 tasks).
Gartner’s approach to the CCOE is fundamentally one that is rooted in the discipline of enterprise architecture and the role of EA in driving business success through the adoption of innovative technologies. We advocate a CCOE based on three core pillars — governance (cost management, risk management, etc.), brokerage (solution architecture and vendor management), and community (driving organizational collaboration, knowledge-sharing, and cloud best practices surfaced organically).
Note that it is vital for the CCOE to be focused on governance rather than on control. Organizations who remain focused on control are less likely to deliver effective self-service, or fully unlock key cloud benefits such as agility, flexibility and access to innovation. Indeed, IT organizations that attempt to tighten their grip on cloud control often face rebellion from the business that actually decreases the power of the CIO and the IT organization.
Also importantly, we do not think that the single-vendor CCOE approaches (which are currently heavily advocated by the professional services organizations of the hyperscalers) are the right long-term solution for most customers. A CCOE should ideally be vendor-neutral and span IaaS, PaaS, and SaaS in a multicloud world, with a focus on finding the right solutions to business problems (which may be cloud or noncloud). And a CCOE is not an IaaS/PaaS operations organization — cloud engineering/operations is a separate set of organizational decisions (I’ll have a research note out on that soon, too).
Please dive into the research (Gartner paywall) if you are interested in reading all the details. I have discussed this topic with literally thousands of clients over the last half-dozen years. If you’re a Gartner for Technical Professionals client, I’d be happy to talk to you about your own unique situation.
Gartner’s cloud IaaS assessments, 2019 edition
We’ve just completed our 2019 evaluations of cloud IaaS providers, resulting in a new Magic Quadrant, Critical Capabilities, and six Solution Scorecards — one for each of the providers included in the Magic Quadrant. This process has also resulted in fresh benchmarking data within Gartner’s Cloud Decisions tool, a SaaS offering available to Gartner for Technical Professionals clients, which contains benchmarks and monitoring results for many cloud providers.
As part of this, we are pleased to introduce Gartner’s new Solution Scorecards, an updated document format for what we used to call In-Depth Assessments. Solution Scorecards assess an individual vendor solution against our recently-revised Solution Criteria (formerly branded Evaluation Criteria). They are highly detailed documents — typically 60 pages or so, assessing 265 individual capabilities as well as providing broader recommendations to Gartner clients.
The criteria are always divided into Required, Preferred, and Optional categories — essentially, things that everyone wants (and where they need to compensate/risk-mitigate if something is missing), things that most people want but can live without or work around readily, and things that are use case-specific. The Required, Preferred, and Optional criteria are weighted into a 4:2:1 ratio in order to calculate an overall Solution Score.
If you are a Gartner for Technical Professionals client, the scorecards are available to you today. You can access them from the links below (Gartner paywall):
- Amazon Web Services
- Microsoft Azure
- Google Cloud Platform
- IBM Cloud
- Oracle Cloud Infrastructure
- Alibaba Cloud International (English-language offerings available outside of China)
We will be providing a comparison of these vendors and their Solution Scorecards at the annual “Cloud Wars” presentation at the Gartner Catalyst conference — one of numerous great reasons to come to San Diego the week of August 11th (or Catalyst UK in London the week of September 15th)! Catalyst has tons of great content for cloud architects and other technical professionals involved in implementing cloud computing.
Note that we are specifically assessing just the integrated IaaS+PaaS offerings — everything offered through a single integrated self-service experience and on a single contract. Also, only cloud services count; capabilities offered as software, hosting, or a human-managed service do not count. Capabilities also have to be first-party.
Also note that this is not a full evaluation of a cloud provider’s entire portfolio. The scorecards have “IaaS” in the title, and the scope is specified clearly in the Solution Criteria. For the details of which specific provider services or products were or were not evaluated, please refer to each specific Scorecard document.
All the scores are current as of the end of March, and count only generally-available (GA) capabilities. Because it takes weeks to work with vendors for them to review and ensure accuracy, and time to edit and publish, some capabilities will have gone beta or GA since that time; because we only score what we’re able to test, the evaluation period has a cut-off date. After that, we update the document text for accuracy but we don’t change the numerical scores. We expect to update the Solution Scorecards approximately every 6 months, and working to increase our cadence for evaluation updates.
This year’s scores vs. last year’s
When you review the scores, you’ll see that broadly, the scores are lower than they were in 2018, even though all the providers have improved their capabilities. There are several reasons why the 2019 scores are lower than in previous years. (For a full explanation of the revision of the Solution Criteria in 2019, see the related blog post.)
First, for many feature-sets, several Required criteria were consolidated into a single multi-part criterion with “table stakes” functionality; missing any part of that criterion caused the vendor to receive a “No” score for that criterion (“Yes” is 1 point; “No” is zero points; there is no partial credit). The scorecard text explains how the vendor does or does not meet each portion of a criterion. The text also mentions if there is beta functionality, or if a feature was introduced after the evaluation period.
Second, many criteria that were Preferred in 2018 were promoted to Required in 2019, due to increasing customer expectations. Similarly, many criteria that were Optional in 2018 are now Preferred. We introduced some brand-new criteria to all three categories as well, but providers that might have done well primarily on table-stakes Required functionality in previous years may have scored lower this year due to the increased customer expectations reflected by revised and new criteria.
Customizing the scores
The solution criteria, with all of the criteria detail, is available to all Gartner for Technical Professionals clients, and comes with a spreadsheet that allows you to score any provider yourself; we also provide a filled-out spreadsheet with each Solution Scorecard so you can adapt the evaluation for your own needs. The Solution Scorecards are similarly transparent on which parts of a criterion are or aren’t met, and we link to documentation that provides evidence for each point (in some cases Gartner was provided with NDA information, in which case we tell you how to get that info from the provider).
This allows you to customize the scores as you see fit. Thus, if you decide that getting 3 out of 4 elements of a criteria is good enough for you, or you think that the thing they miss isn’t relevant to you, or you want to give the provider credit for newly-released capabilities, or you want to do region-specific scoring, you can modify the spreadsheet accordingly.
If you’re a Gartner client and are interested in discussing the solution criteria, assessment process, and the cloud providers, please schedule an inquiry or a 1-on-1 at Catalyst. We’d be happy to talk to you!
Updating Gartner’s cloud IaaS evaluation criteria
In February of this year, we revised the Evaluation Criteria for Cloud IaaS (Gartner paywall). The evaluation criteria (now rebranded Solution Criteria) are essentially the sort of criteria that prospective customers typically include in RFPs. They are highly detailed technical criteria, along with some objectively-verifiable business capabilities (such as elements in a technical support program, enterprise ISV partnerships, ability to support particular compliance requirements, etc.).
The Solution Criteria are intended to help cloud architects evaluate cloud IaaS providers (and integrated IaaS+PaaS providers such as the hyperscale cloud providers), whether public or private, or assess their own internal private cloud. We are about to publish Solution Scorecards (formerly branded In-Depth Assessments) for multiple providers; Gartner analysts assess these solutions hands-on and determine whether or not they have capabilities that meet the requirements of a criterion.
The TL;DR version
In summary, we revised the Solution Criteria extensively in 2019, and the results were as follows:
- The criteria have been updated to reflect the current IaaS+PaaS market.
- Expectations are significantly higher than in previous years.
- Expectations have been aligned to other Gartner research, taking into account customer wants and needs in the relevant market, not just in a cloud-specific context.
- Many capabilities have been consolidated and are now required.
- Most vendor scores in the Solution Scorecards have dropped dramatically since last year, and there is a much broader spread of vendor scores.
The Evolution of Customer Demands
The Evaluation Criteria (EC) for Cloud IaaS was first published in 2012. It received a significant update every other year (each even-numbered year) thereafter. When first written, the EC reflected the concerns of our clients at the time, many of whom were infrastructure and operations (I&O) professionals with VMware backgrounds. With each iteration, the EC evolved significantly, yet incrementally.
In the meantime, the market moved extremely quickly. The market evolution towards cloud integrated IaaS and PaaS (IaaS+PaaS) providers, and the market exit (or strategic de-investment) of many of the “commodity” providers, radically changed the structure and nature of the market over time. Cloud IaaS providers weren’t just expected to provide “hardware infrastructure”, but also “software infrastructure”, including all of the necessary management and automation. This essentially forced these providers into introducing services that compete in many IT markets and in an extraordinary number of software niches.
Furthermore, as the market matured, the roles and expectations of our clients also evolved significantly. The focus shifted to enterprise-wide initiatives, rather than project-based adoption. Digital business transformation elevated the importance of cloud-native workloads, while IT transformation emphasized the need for high-quality cloud migration of existing workloads. The notion that a cloud IaaS provider could successfully run all, or almost all, of a customer’s IT became part of the assumptions that needed to underpin the provider evaluation process.
Today’s cloud IaaS customers have high expectations. Experienced customers are becoming more sophisticated, but late adopters also have high expectations of a provider that have to be met to help the customer overcome barriers to adoption.
For 2019, we decided to take a look at the EC“from scratch”, in order to try to construct a list of criteria that are the most relevant to the initiatives of customers today. In many cases, our clients are trying to pick a primary strategic IaaS provider. In other cases, our clients already have a primary provider but are trying to pick a strategic secondary provider as they implement a multicloud strategy. Finally, some of our clients are choosing a provider for a tactical need, but still need to understand that provider’s capabilities in detail.
Constructing the Revision
The revision needed to keep a similar number of criteria (in order to keep the assessment time manageable and the assessment itself at a readable length) — we ended up with 265 for 2019.
In order to keep the total number of criteria down, we needed to consolidate closely-related criteria into a single criterion. Many criteria became multi-part as a result. We tried to consolidate the “table stakes” functionality that could be assumed to be a part of all (or almost all) cloud IaaS offerings, in order to make room for more differentiated capabilities.
We tried to be as vendor-neutral as possible. The evaluation criteria have evolved since the initial 2012 introduction; when we introduced new criteria in the past, we often ended up with criteria requirements that closely mirrored the feature-set of the first provider to offer a capability, since that provider shaped customer expectations. In this 2019 revision, we tried to go back to the core customer requirements, without concern as to whether cloud provider implementations fully aligned with those requirements — the criteria are intended to reflect what customers want and not what vendors offer. There are requirements that no vendors meet, but which we often hear our clients ask for; in such cases we tried to phrase those requirements in ways that are reasonable and implementable at scale, as it’s okay for the criteria to be somewhat aspirational for the market.
We tried to make sure that the criteria were worded using standard Gartner terms or general market terminology, avoiding vendor-specific terms. (Note that because vendors not-infrequently adopt Gartner terms, there were cases where providers had adopted terminology from earlier versions of EC, and we made no attempt to alter such terms.)
We tried to keep to requirements, without dictating implementation, where possible. However, we had to keep in mind that in cloud IaaS, where there are customers who want fine-grained visibility and control over the infrastructure, there still must be implementation specificity when the customer explicitly wants those elements exposed.
Defining the Criteria
During the process of determining the criteria, we sought input broadly within Gartner, both in terms of discussing the criteria with other analysts as well as incorporating things from existing Gartner written research. (And the criteria reflect, as much as possible, the discussions we’ve had with clients about what they’re looking for, and what they’re putting into their RFPs.)
In some cases, we needed input from specialists in a topic. In some areas of technology, clients who need to have deep-dive discussions on features may talk almost exclusively to analysts specialized in those areas. Those analysts are familiar with current requirements as well as the future of those technology areas, and are thus the best source for determining those needs. For example, areas such as machine learning and IoT are primarily covered by analysts with those specializations, even when the customers are implementing cloud solutions. There are also areas, such as Security, where we have detailed cloud recommendations from those teams. So we extensively incorporated their input..
We also looked at non-cloud capabilities when there were market gaps relative to customer desires. There are areas where either cloud providers do not currently have capabilities, or where those capabilities are relatively nascent. Thus, we needed to identify where customers are using on-premises solutions, and want cloud solutions. We also needed to determine what the “minimum viable product” should be for the purposes of constructing a criterion around it.
Feedback from non-cloud analysts was also important because it identified areas where clients were not using a cloud solution because of something that was missing. In many cases, these were not technology features, but issues around transparency, or the lack of solutions acceptable on a global basis.
Finally, the way that customers source solutions, build applications, and manage their data is changing. We tried to ensure that the new criteria aligned with these trends.
Because more and more of our clients are deploying cloud solutions globally, every criterion also had some requirements as to its global availability. These are used only for advisory purposes and are not part of scoring.
The vendors were allowed to give feedback on the criteria prior to publication. We wanted to check if the criteria were reasonable, and seemed fair. We incorporated feedback that constituted good, vendor-neutral suggestions that aligned to customer requirements.
The End Results
When you see the Solution Scorecards, you may be surprised by lower scores on the part of many of the providers. We’re being transparent about the Evaluation Criteria (Solution Criteria) revision in order to help you understand why the scores are lower.
The lower scores were an unintentional side-effect of the revision, but reflect, to some degree, the state of the market relative to the very high expectations of customers. Note that this year’s lower scores do not indicate that providers have “gone backwards” or removed capabilities; they just reflect the provider’s status against a raised bar of customer expectations.
We expect that when we update the scorecards in the second half of this year, scores will increase, as many of the vendors have since introduced missing capabilities, or will do so by the next update. We retain confidence that the solution criteria are a good reflection of a broad range of current customer expectations. Because many vendors are doing a good job of listening to what customers and prospects want, and planning accordingly, we think that the solution criteria will also be reflected in future vendor roadmaps and market development.
We discuss the Solution Scorecards and scores in a separate blog post.
Critical Capabilities launched, new Magic Quadrant starting
The Critical Capabilities for Public Cloud IaaS, 2016 has now been published. The Critical Capabilities is a technical assessment of public cloud IaaS offerings against a set of use cases — cloud-native applications, general business applications, application development environments, batch computing, and (new for 2016) the Internet of Things. It’s part of our integrated series of cloud IaaS assessments and complements our Magic Quadrant for Cloud IaaS (Gartner clients: see interactive version).
We are now launching right back into the Magic Quadrant cycle for 2017, with the goal of publishing a new Magic Quadrant in April 2017, and a new Critical Capabilities shortly thereafter.
A lot has happened since the early-2016 research process for our 2016 Magic Quadrant and Critical Capabilities cycle for this market. Multiple providers have launched new offerings and are phasing out their previous offerings, and there are some important new market entrants. We want to make sure that our research notes offer current representations of provider capabilities. (Usefully, a shift to April publication also gets us back to a schedule that aligns with our infrastructure & operations conference season.)
In previous years, we’ve issued an open invitation for the pre-qualification survey to all cloud IaaS providers. This year, we are not doing so; instead, we have issued invitations only to providers who we believe are highly likely to qualify.
If you are a cloud IaaS provider that did not receive an invitation, but you believe you are highly likely to qualify for inclusion, please email me at Lydia dot Leong at Gartner dot com to discuss it.
Gartner’s cloud IaaS assessments, 2016 edition
We’re pleased to announce that the 2016 Magic Quadrant for Cloud Infrastructure, Worldwide has been published. (Link requires a Gartner subscription. If you’re not a Gartner client, there are free reprints available through vendors, and various press articles, such as the Tech Republic analysis. Note that press articles do not always accurately reflect our opinions, though.)
Producing the Magic Quadrant is a huge team effort that involves many people across Gartner, including many analysts who aren’t credited as co-authors, administrative support staff, and people in our primary-research and benchmarking groups. The team effort also reflects the way that we produce an entire body of IaaS research as an integrated effort across Gartner’s research divisions. (The approach described below is specific to our IaaS research and may not apply to Gartner’s assessments in other markets.)
Whether you already have a cloud IaaS provider and are just looking for a competitive check-up, you’re thinking of adding one or more additional providers, or you’re just getting started with cloud IaaS, our work can help you find the providers that are right for you.
The TL;DR list of assessments:
- Magic Quadrant (market and technical evaluation)
- Evaluation Criteria (230+ technical and service traits to look for in a provider)
- In-Depth Assessments (detailed assessments of specific providers against the Evaluation Criteria)
- Critical Capabilities (use-case-based technical evaluations; 2016 update coming soon)
- CloudHarmony and Tech Planner Cloud Module (real-world stats and cost-performance comparisons)
- Peer Insights (IT leaders review providers)
(Note that not all of these might be available as part of your current Gartner client subscription.)
Gartner has produced a Magic Quadrant for Cloud IaaS since 2011. The MQ is our overall perspective on the market, looking at the provider solutions from both a technical and business angle. Gartner clients can use the interactive MQ tool to change the weightings of the criteria to suit their own evaluation priorities (if you read the detailed criteria descriptions, there’s an explanation of how each criterion maps to buyer priorities). The interactive MQ can also be used to get a multi-year historical perspective.
The MQ covers public, hosted private, and industrialized outsourced private cloud IaaS; it’s not just a public cloud MQ. We look at multi-tenant and single-tenant, located in either provider or customer premises, cloud IaaS offerings. We also look at the full range of compute options (VMs, bare-metal servers, containers) that are delivered in a cloud model (API-provisionable via automation, and metered by the hour or less), not just VMs. In addition, we consider some integrated PaaS-layer services (we call these cloud software infrastructure services, which include things like database as a service), but we have a separate enterprise application PaaS MQ for pure aPaaS. While we consider the provider’s overall value proposition in the context of cloud IaaS (including their ability to deliver managed services, network services, etc.), this isn’t a general cloud computing or outsourcing MQ.
2016 marks our sixth iteration of a pure cloud IaaS MQ. Previously, in 2009 and 2010, we included cloud IaaS in our hosting MQ, but by 2010, it was already clear that the hosting and cloud IaaS buyer wants and needs were distinctly different. Since 2011, we’ve produced a global cloud IaaS MQ, along with three regional hosting MQs (suitable for customers looking for dedicated servers or managed hosting on a monthly or annual basis), and three regional data center outsourcing MQs (which include customized private cloud services as part of a broader portfolio of infrastructure outsourcing capabiities). Not every infrastructure need can or should be met with cloud IaaS.
The core foundation of our assessment is our Evaluation Criteria for Cloud IaaS. Over the years, we’ve converged the technical-detail questionnaire that we ask providers to fill out during the Magic Quadrant research process, with the Gartner for Technical Professionals (GTP) document that we produce to guide buyers on evaluating providers. This has resulted in nearly 250 service traits that the Evaluation Criteria document categorizes as Required (almost all Gartner clients are likely to want these things and these have the potential to be showstoppers if missing), Preferred (many will want these things), and Optional (use-case-specific needs). This gives us a consistent set of formal definitions for service features — things you can put a clear yes/no to. As a buyer, you can use the Evaluation Criteria to score any cloud IaaS provider — and even score your own IT department’s private cloud.
In the course of doing this particular Magic Quadrant, providers fill out very detailed questionnaires that list these service features and capabilities (broken down even more granularly than in the Evaluation Criteria), indicating whether their service has those traits, and they’re also asked to provide evidence, like documentation. We also ask them to provide other information like the location of their data centers, languages supported across various aspects of service delivery (like portal localization and tech-support languages spoken), a copy of their standard contract and SLAs, and so forth. We score those questionnaires (and check service features against documentation, and with hands-on testing if need be). We also score things like the buyer-friendliness of contracts, based on the presence/absence of particular clauses. Those component scores are used in many different individual scoring categories within the Magic Quadrant.
We also produce a set of In-Depth Assessments for the providers that our clients are most interested in evaluating. The In-Depth Assessments are detailed documents that score an individual provider against the Evaluation Criteria; for every criteria, we explain how the provider does and doesn’t meet it, and we provide links to the corresponding documentation or other evidence. The results of our hands-on testing are noted, as well. For many buyers, this minimizes the need to conduct an RFP that dives into the technical solution; here we’ve done a very detailed fact-based analysis for you, and the provider has verified the accuracy of the information. (Buyer beware, though: Providers sometimes produce something that looks like one of these assessments, even quoting the Gartner definitions, but with their own more generous self-assessment rather than the stringent Gartner-produced assessment!)
Then, we produce Critical Capabilities for Public Cloud IaaS (2016 update still in progress). This technical assessment looks at a single public cloud IaaS offering from each of the providers included in the Magic Quadrant. The same technical traits used in the other assessments are used here, but they are divided into categories of capabilities, and those capabilities are weighted in a set of common use cases. You can also customize your own set of weightings. In addition to providing quantitative scores, we summarize, in a fair amount of detail, the technical capabilities of each evaluated provider. This allows you to get a sense of what providers are likely to be right for your needs, without having to go through the full deep-dive of reading the In-Depth Assessments. (Critical Capabilities are also available to all Gartner clients and reprints may be offered by providers on their websites, whereas the In-Depth Assessments are only available to GTP clients.)
Performance, and price-performance, is important to many buyers. Gartner provides hardware benchmarking via a SaaS offering called Tech Planner. We offer a Cloud Module within Tech Planner that uses technology that we derived from our acquisition of CloudHarmony. We conduct continuous automated testing on many cloud IaaS providers, including all providers in the Magic Quadrant. We benchmark compute performance for the full range of VMs and bare-metal cloud servers offered by the provider, along with storage performance and network performance; we use this to calculate price-performance metrics. We monitor the availability of their services across the globe. We track provisioning times. All this data is used as objective components to the scores within the Magic Quadrant. Much of this data is directly available to Tech Planner customers, who can use these tools to calculate performance-equivalencies as well as determine where workloads will be most cost-effective.
Finally, we collect end-user reviews of cloud IaaS providers, called Peer Insights. IT leaders (who do not need to be Gartner clients) can submit reviews of their providers; we verify that reviews are legitimate, and it’s one of the very few places where you’ll see enter senior IT executives and architects writing detailed reviews of their providers. We use this data, along with vendor-provided customer references, and the many thousands of clients conversations we have each year with cloud IaaS buyers, as part of the fact base for our Magic Quadrant scoring.
More than a dozen analysts are directly involved in all of these assessments, and many more analysts provide peer-review input into those assessments. It’s an enormous effort, involving a great deal of teamwork, to produce this body of interlinked research. We’re always trying to improve its quality, so we welcome your feedback!
You can DM me on Twitter at @cloudpundit or send email to lydia dot leong at gartner.com.