Category Archives: Governance
I spend quite a bit of time talking to clients about developer self-service, largely in the context of public cloud governance and cloud operations. There are still lots of infrastructure and operations (I&O) executives who instinctively cringe at the notion of developer self-service, as if self-service would open formerly well-defended gates onto a pristine plain of well-controlled infrastructure, and allow a horde of unwashed orcs to overrun the concrete landscape in a veritable explosion of Lego structures, dot-matrix printouts, Snickers wrappers and lost whiteboard marker caps… never to be clean and orderly again.
It doesn’t have to be that way.
Self-service — and more broadly, developer control over infrastructure — isn’t an all-or-nothing proposition. Responsibility can be divided across the application life cycle, so that you can get benefits from “You build it, you run it” without necessarily parachuting your developers into an untamed and unknown wilderness and wishing them luck in surviving because it’s not an Infrastructure & Operations (I&O) team problem any more.
So we ask, instead:
- Will developers design their own infrastructure?
- Will developers control their dev/test environments?
- How much autonomy will developers have in building production environments?
- How much autonomy will developers have for production deployments?
- To what extent are developers responsible for day-to-day production maintenance (patching, OS updates, infrastructure rightsizing, etc.)?
- To what extent are developers responsible for incident management?
- How much help will developers receive for the things they’re responsible for?
I talk to far too many IT leaders who say, “We can’t give developers cloud self-service because we’re not ready for You build it, you run it!” whereupon I need to gently but firmly remind them that it’s perfectly okay to allow your developers full self-service access to development and testing environments, and the ability to build infrastructure as code (IaC) templates for production, without making them fully responsible for production.
This is the subject of my new research note, “How to Empower Technical Teams Through Self-Service Public Cloud IaaS and PaaS“. (Gartner for Technical Professionals paywall)
This is a step along the way to a deeper exploration of finding the right balance between “Dev” and “Ops” in DevOps, which is an organization-specific thing. This is not just a cloud thing; it also impacts the structure of operations on-premises. Every discussion of SRE, platform ops, etc. ultimately revolves around the questions of autonomy, governance, and collaboration, and no two organizations are likely to arrive at the exact same balance. (And don’t get me started on how many orgs rename their I&O teams to SRE teams without actually implementing much if anything from the principles of SRE.)
Cloud budget overruns don’t have a singular cause. Instead, they come in a bright rainbow of jelly belly flavors (the Bertie Botts ones, especially, will combine into a non-mouthwatering delight). Each needs different forms of response.
Ungoverned costs. This is the black licorice of FinOps problems. The organization has no idea what it’s spending, really, much less where the money is going, other than the big bills (or often, many little credit card bills) that they pay each month. This requires basic cost hygiene: analyze your cloud bills, get a cost management tool into place and make it useful through some tagging or partitioning discipline.
Unanticipated usage. This is the sour watermelon flavor of cost overruns — deliciously sweet yet mouth-puckering. In this situation, the organization is the victim of its own cloud success. Cloud has been such a great thing for the organization that more and more unanticipated cloud projects are showing up, blowing out the original budget estimates for cloud resources. Those cloud projects are delivering business value and it doesn’t make sense to say no to them (and even if central IT says no, the cloud costs can usually be paid for out of a line-of-business budget). Nevertheless, it’s causing a lot of organizational angst because central IT or the sourcing team didn’t anticipate this spending. This organization needs to learn to shift its budgeting processes for the digital future, and cloud chargeback will help support future decision-making.
No commitments. This is the minty wrongness of Bertie Botts toothpaste. The organization could get discounts by using public discounting mechanisms for commits (like AWS Savings Plans and Azure Reserved Instances) as well making a contractual commitment for a negotiated discount. But because the organization feels like they can’t perfectly predict their use and aren’t sure if they’ll use all of what they’re using today, they commit to nothing, therefore ensuring that they spend grotesquely more than they could be. This is universally a terrible idea. Organizations that aren’t in early pilot stage have long-term production applications and some predictability of usage; commit to the stuff you know you’re not killing off.
Dev/test waste. This is the mundane bleah-ness of Bertie Botts earwax. Developers are provisioning the biggest things they can get away with (or at least being overaggressive in their estimates of what they need), there are lots of abandoned resources idling away, and dev/test infrastructure that isn’t used outside of business hours isn’t being suspended when unused. This is what cloud cost management tools are great at doing — identifying obvious waste so that it can be eliminated, largely by shutting it down or suspending it, preferably via automation.
Too much production headroom. This is the mild weirdness of the Bertie Botts grass flavor. Application teams haven’t implemented autoscaling for applications that can scale horizontally, or they’ve overestimated how much production headroom an application with variable usage needs (which may result in oversizing compute units, or being overly aggressive with autoscaling). This requires implementing autoscaling with some thoughtful tuning of parameters, and possibly a business value conversation on the cost/benefit tradeoff of having higher application performance on a consistent basis.
Wrongsizing production. This is the awful lingering terribleness of Bertie Botts vomit, whose taste you cannot get out of your mouth. Production environments are statically overprovisioned and therefore overly costly. On-prem, 30% utilization is common, but it’s all capex and as long as it’s within budget, no one really cares about the waste. But in the cloud, you pay for that excess resource monthly, forcing you to confront the ongoing cost of the waste.
However, anyone who tells you to “just” rightsize has never actually tried to do this in practice within an enterprise. The problem is that applications that scale vertically typically can’t be easily rightsized. It’s likely difficult-to-impossible to do automatically, due to complicated application installation. The application is fragile and may be mission-critical, so you are cautious about maintenance downtime. And the application team — the only people who really understand how this thing works — is likely busy with other priorities.
If this is your situation, your cloud cost management tool may cause you to cry hopeless tears, because you can see the waste but taking remediation actions is a complicated cross-functional war dance and delicate negotiation that leaves everyone wondering if it wouldn’t have been easier to just keep paying a larger bill.
Suboptimal design and implementation. The controversial popcorn flavor. Architects are sometimes cost-oblivious when they design cloud solutions. They may make bad design choices, or changes in application features and behavior over time may have turned out to make a design choice unexpectedly expensive. Developers may write poorly-performing code that consumes a lot of infrastructure resources, or code that makes excessive (and, cumulatively, expensive) calls to cloud services. Your cloud cost management tools are unlikely to be of any use for detecting these situations. This needs to be addressed through performance engineering, with attention paid to the business value of the time/effort/money necessary to do so — and for many organizations may require bringing in third-party expertise to diagnose the problems and offer recommendations.
Notably, the answer to most of these issues is not “implement a cloud cost management tool”. The challenges aren’t really as simple as a lot of vendors (and talking heads) make them out to be.
Many of my client inquiries deal with the seemingly overwhelming complexity of maturing cloud adoption — especially with the current wave of pandemic-driven late adopters, who are frequently facing business directives to move fast but see only an immense tidal wave of insurmountably complex tasks.
A lot of my advice is focused on starting small — or at least tackling reasonably-scoped projects. The following is specifically applicable to IaaS / IaaS+PaaS:
Build a cloud center of excellence. You can start a CCOE with just a single person designated as a cloud architect. Standing up a CCOE is probably going to take you a year of incremental work, during which cloud adoption, especially pilot projects, can move along. You might have to go back and retroactively apply governance and good practices to some projects. That’s usually okay.
Start with one cloud. Don’t go multicloud from the start. Do one. Get good at it (or at least get a reasonable way into a successful implementation). Then add another. If there’s immediate business demand (with solid business-case justifications) for more than one, get an MSP to deal with the additional clouds.
Don’t build a complex governance and ops structure based on theory. Don’t delay adoption while you work out everything you think you’ll need to govern and manage it. If you’ve never used cloud before, the reality may be quite different than you have in your head. Run a sequence of increasingly complex pilot projects to gain practical experience while you do preparatory work in the background. Take the lessons learned and apply them to that work.
Don’t build massive RFPs to choose a provider. Almost all organizations are better off considering their strategic priorities and then matching a cloud provider to those priorities. (If priorities are bifurcated between running the legacy and building new digital capabilities, this might encourage two strategic providers, which is fine and commonplace.) Massive RFPs are a lot of work and are rarely optimal. (Government folks might have no choice, unfortunately.)
Don’t try to evaluate every service. Hyperscale cloud providers have dozens upon dozens of services. You won’t use all of them. Don’t bother to evaluate all of them. If you think you might use a service in the future, and you want to compare that service across providers… well, by the time you get around to implementing it, all of the providers will have radically updated that service, so any work you do now will be functionally useless. Look just at the services that you are certain you will use immediately and in the very near (no more than one year) future. Validate a subset of services for use, and add new validations as needed later on.
Focus on thoughtful workload placement. Decide who your approved and preferred providers are, and build a workload placement policy. Look for “good technical fit” and not necessarily ideal technical fit; integration affinities and similar factors are more important. The time to do a detailed comparison of an individual service’s technical capabilities is when deciding workload placement, not during the RFP phase.
Accept the limits of cloud portability. Cloud providers don’t and will probably never offer commoditized services. Even when infrastructure resources seem superficially similar, there are still meaningful differences, and the management capabilities wrapped around those resources are highly differentiated. You’re buying into ecosystems that have the long-term stickiness of middleware and management software. Don’t waste time on single-pane-of-glass no-lock-in fantasies, no matter how glossily pretty the vendor marketing material is. And no, containers aren’t magic in this regard.
Links are to Gartner research and are paywalled.
Responsibility for cloud operations is often a political football in enterprises. Sometimes nobody wants it; it’s a toxic hot potato that’s apparently coated in developer cooties. Sometimes everybody wants it, and some executives think that control over it are going to ensure their next promotion / a handsome bonus / attractiveness for their next job. Frequently, developers and the infrastructure & operations (I&O) orgs clash over it. Sometimes, CIOs decide to just stuff it into a Cloud Center of Excellence team which started out doing architecture and governance, and then finds itself saddled with everything else, too.
Lots of arguments are made for it to live in particular places and to be executed in various ways. There’s inevitably a clash between the “boring” stuff that is basically lifted-and-shifted and rarely changes, and the fast-moving agile stuff. And different approaches to IaaS, PaaS, and SaaS. And and and…
Well, the fact of the matter is that multiple people are probably right. You don’t actually want to take a one-size-fits-all approach. You want to fit operational approaches to your business needs. And you maybe even want to have specialized teams for each major hyperscale provider, even if you adopt some common approaches across a multicloud environment. (Azure vs. non-Azure, i.e. Azure vs. AWS, is a common split, often correlated closely to Windows-based application environments vs Linux-based application environments.)
Ideally, you’re going to be highly automated, agile, cloud-native, and collaborative between developers and operators (i.e. DevOps). But maybe not for everything (i.e. not all apps are under active development).
Plus, once you’ve chosen your basic operations approach (or approaches), you have to figure out how you’re going to handle cloud configuration, release engineering, and security responsibilities. (And all the upskilling necessary to do that well!)
That’s where people tend to really get hung up. How much responsibility can I realistically push to my development teams? How much responsibility do they want? How do I phase in new operational approaches over time? How do I hook this into existing CI/CD, agile, and DevOps initiatives?
There’s no one right answer. However, there’s one answer that is almost always wrong, and that’s splitting cloud operations across the I&O functional silos — i.e., the server team deals with your EC2 VMs, your NetApp storage admin deals with your Azure Blobs, your F5 specialist configures your Google Load Balancers, your firewall team fights with your network team over who controls the VPC config (often settled, badly, by buying firewall virtual appliances), etc.
When that approach is taken, the admins almost always treat the cloud portals like they’re the latest pointy-clicky interface for a piece of hardware. This pretty much guarantees incompetence, lack of coordination, and gross inefficiency. It’s usually terrible at regardless of what scale you’re at. Unfortunately, it’s also the first thing that most people try (closely followed by massively overburdening some poor cloud architect with Absolutely Everything Cloud-Related.)
What works for most orgs: Some form of cloud platform operations, where cloud management is treated like a “product”. It’s almost an internal cloud MSP approach, where the cloud platform ops team delivers a CMP suite, cloud-enabled CI/CD pipeline integrations, templates and automation, other cloud engineering, and where necessary, consultative assistance to coders and to application management teams. That team is usually on call for incident response, but the first line for incidents is usually the NOC or the like, and the org’s usual incident management team.
But there are lots of options. Gartner clients: Want a methodical dissection of pros and cons; cloud engineering, operating, and administration tasks; job roles; coder responsibilities; security integration; and other issues? Read my new note, “Comparing Cloud Operations Approaches“, which looks at eleven core patterns along with guidance for choosing between them, andmaking a range of accompanying decisions.
A nontrivial chunk of my client conversations are centered on the topic of cloud IaaS/PaaS self-service, and how to deal with development teams (and other technical end-user teams, i.e. data scientists, researchers, hardware engineers, etc.) that use these services. These teams, and the individuals within those teams, often have different levels of competence with the clouds, operations, security, etc. but pretty much all of them want unfettered access.
Responsible governance requires appropriate guidelines (policies) and guardrails, and some managers and architects feel that there should be one universal policy, and everyone — from the highly competent digital business team, to the data scientists with a bit of ad-hoc infrastructure knowledge — should be treated identically for the sake of “fairness”. This tends to be a point of particular sensitivity if there are numerous application development teams with similar needs, but different levels of cloud competence. In these situations, applying a single approach is deadly — either for agility or your crisis-induced ulcer.
Creating a structured, tiered approach, with different levels of self-service and associated governance guidelines and guardrails, is the most flexible approach. Furthermore, teams that deploy primarily using a CI/CD pipeline have different needs from teams working manually in the cloud provider portal, which in turn are different from teams that would benefit from having an easy-vend template that gets provisioned out of a ServiceNow request.
The degree to which each team can reasonably create its own configurations is related to the team’s competence with cloud solution architecture, cloud engineering, and cloud security. Not every person on the team may have a high level of competence; in fact, that will generally not be the case. However, the very least, for full self-service there needs to be at least one person with strong competencies in each of those areas, who has oversight responsibilities, acts an expert (provides assistance/mentorship within the team), and does any necessary code review.
If you use CI/CD, you also want automation of such review in your pipeline, that includes your infrastructure-as-code (IaC) and cloud configs, not just the app code; i.e. a tool like Concourse Labs). Even if your whole pipeline isn’t automated, review of IaC during the dev stage, and not just when it triggers a cloud security posture management tool (like Palo Alto’s Prisma Cloud or Turbot), whether in dev, test, or production.
Who determines “competence”? To avoid nasty internal politics, it’s best to set this standard objectively. Certifications are a reasonable approach, but if your org isn’t the sort that tends to pay for internal certifications or the external certifications (AWS/Azure Solution Architect, DevOps Engineer, Security Engineer, etc.) seem like too high a bar, you can develop an internal training course and certification. It’s not a bad idea for all of your coders (whether app developers, data scientists, etc.) that use the cloud to get some formal training on creating good and secure cloud configurations, anyway.
(For Gartner clients: I’m happy to have a deeper discussion in inquiry. And yes, a formal research note on this is currently going through our editing process and will be published soon.)
Building cloud expertise is hard. Building multicloud expertise is even harder. By “multicloud” in this context, I mean “adopting, within your organization, multiple cloud providers that do something similar” (such as adopting both AWS and Azure).
Integrated IaaS+PaaS providers are complex and differentiated entities, in both technical and business aspects. Add in their respective ecosystems — and the way that “multicloud” vendors, managed service providers (MSPs) etc. often deliver subtly (or obviously) different capabilities on different cloud providers — and you can basically end up with a multicloud katamari that picks up whatever capabilities it randomly rolls over. You can’t treat them like commodities (a topic I cover extensively in my research note on Managing Vendor Lock-In in Cloud IaaS).
For this reason, cloud-successful organizations that build a Cloud Center of Excellence (CCOE), or even just try to wrap their arms around some degree of formalized cloud operations and governance, almost always start by implementing a single cloud provider but plan for a multicloud future.
Successfully multicloud organizations have cloud architects that deeply educate themselves on a single provider, and their cloud team initially builds tools and processes around a single provider — but the cloud architects and engineers also develop some basic understanding of at least one additional provider in order to be able to make more informed decisions. Some basic groundwork is laid for a multicloud future, often in the form of frameworks, but the actual initial implementation is single-cloud.
Governance and support for a second strategic cloud provider is added at a later date, and might not necessarily be at the same level of depth as the primary strategic provider. Scenario-specific (use-case-specific or tactical) providers are handled on a case-by-case basis; the level of governance and support for such a provider may be quite limited, or may not be supported through central IT at all.
Individual cloud engineers may continue to have single-cloud rather than multicloud skills, especially because being highly expert in multiple cloud providers tend to boost market-rate salaries to levels that many enterprises and mid-market businesses consider untenable. (Forget using training-cost payback as a way to retain people; good cloud engineers can easily get a signing bonus more than large enough to deal with that.)
In other words: while more than 80% of organizations are multicloud, very few of them consider their multiple providers to be co-equal.
What sort of org structures work well for helping to drive successful cloud adoption? Every day I talk to businesses and public-sector entities about this topic. Some have been successful. Others are struggling. And the late-adopters are just starting out and want to get it right from the start.
Back in 2014, I started giving conference talks about an emerging industry best practice — the “Cloud Center of Excellence” (CCOE) concept. I published a research note at the start of 2019 distilling a whole bunch of advice on how to build a CCOE, and I’ve spent a significant chunk of the last year and a half talking to customers about it. Now I’ve revised that research, turning it into a hefty two-part note on How to Build a Cloud Center of Excellence: part 1 (organizational design) and part 2 (Year 1 tasks).
Gartner’s approach to the CCOE is fundamentally one that is rooted in the discipline of enterprise architecture and the role of EA in driving business success through the adoption of innovative technologies. We advocate a CCOE based on three core pillars — governance (cost management, risk management, etc.), brokerage (solution architecture and vendor management), and community (driving organizational collaboration, knowledge-sharing, and cloud best practices surfaced organically).
Note that it is vital for the CCOE to be focused on governance rather than on control. Organizations who remain focused on control are less likely to deliver effective self-service, or fully unlock key cloud benefits such as agility, flexibility and access to innovation. Indeed, IT organizations that attempt to tighten their grip on cloud control often face rebellion from the business that actually decreases the power of the CIO and the IT organization.
Also importantly, we do not think that the single-vendor CCOE approaches (which are currently heavily advocated by the professional services organizations of the hyperscalers) are the right long-term solution for most customers. A CCOE should ideally be vendor-neutral and span IaaS, PaaS, and SaaS in a multicloud world, with a focus on finding the right solutions to business problems (which may be cloud or noncloud). And a CCOE is not an IaaS/PaaS operations organization — cloud engineering/operations is a separate set of organizational decisions (I’ll have a research note out on that soon, too).
Please dive into the research (Gartner paywall) if you are interested in reading all the details. I have discussed this topic with literally thousands of clients over the last half-dozen years. If you’re a Gartner for Technical Professionals client, I’d be happy to talk to you about your own unique situation.