FinOps can be a big waste of money
Of late, my colleagues and I have been talking to a lot of clients who want to build a “FinOps team”, which they seem to hope will wave magic wands and reduce their cloud IaaS+PaaS bill. I’m struck by how many clients I talk to don’t have cloud cost problems that are reasonably solvable with FinOps.
Bluntly: For many organizations, there is no reasonable ROI on FinOps (and certainly no sensible business case for building a FinOps team).
This doesn’t mean that the organization shouldn’t manage their cloud finances. It just means that they don’t need to manage their cloud finances in a way that’s meaningfully different from the way that they’ve historically managed IT spending in their on-premises data center. I’ll use the term “FinOps” colloquially here to indicate an organization taking an approach and processes for cloud financial management that are different from their established on-premises IT financial management.
There are lots of common reasons why your organization might not need FinOps. For example:
- You don’t use self-service.Your developers, app management engineers, data scientists, and other technical end-users do not have direct self-service access to cloud services. Instead, all cloud design and provisioning is done by a central infrastructure and operations (I&O) team — or alternatively, all cloud requests go through a service catalog and are manually reviewed and approved. Therefore, nothing happens in the cloud that’s outside of central I&O’s knowledge or control — likely allowing you to manage budgets like you did on-premises.
- You have little to no variability in production: Your applications are allocated a static amount of infrastructure, and/or their usage is almost entirely predictable (for example, they autoscale up during the last week of the month, and then autoscale down after the close of the month). Therefore, your cloud bill for each application is essentially the same every month. You should nevertheless configure budget alerts in case something weird happens that makes usage spike, but that likely will be a one-time thing when the application is first deployed, perhaps with a once-a-year review.
- You’re not spending much money in the cloud. If you’re not spending much money, even a significant percentage reduction in spend (which you could potentially get, for instance, by eliminating all cloud dev/test VMs that aren’t used any longer and could simply be turned off) won’t be that many hard dollars of savings. Putting into place automation that automatically hibernates or deprovisions unused infrastructure may have a useful ROI, but playing manual whack-a-mole that involves a lot of people (whether in paperwork or actually mucking with infrastructure) almost certainly wastes more money in labor time than it saves in cloud costs.
- You don’t have infrastructure-hungry applications. Enterprises often don’t have the voracious scale-out cloud-native applications that are common in digital-native companies, or they only have a small handful of those applications. You might be spending significant money in the cloud, but it’s spread across dozens, hundreds, or even thousands of small applications. Therefore, even if you could cut the necessary capacity for a given application in half, it wouldn’t generate much in the way of monthly cost savings — likely not enough to justify the time of the people doing the work. Lots of enterprises run boring everyday “paperwork” apps on a VM or two (or these days, a container or two). A single-VM app often runs at 40% utilization at max, because of powers-of-two cloud VM sizing, so dropping a “T-shirt” size results in half the capacity and maybe 90% utilization, which many enterprises feel is uncomfortably tight. (And lots of organizations are slightly oversized across the board because they took the “safe” estimate of capacity needs from their cloud migration tools.)
Buying FinOps tools and allocating people to FinOps activities can cost you more than it saves.
Most people launch FinOps practices by purchasing a cloud cost optimization tool of some sort (i.e. a “FinOps tool”). Complicated FinOps processes and/or having a lot of teams and applications you have to corral within your cloud cost governance framework probably result in the genuine need to purchase a third-party FinOps tool — but those tools probably don’t represent a positive ROI until you’re spending more at least a million dollars a year. And then you have to remember that the percentage-of-cloud-spend pricing scheme of those tools can mean that you’re giving the FinOps-tool vendor a pile of money for service elements that they have no optimization capabilities for.
But in many cases, the cost of a tool will be dwarfed by the expense of the employees to do this work, especially in organizations who are making a misguided effort to hire a “FinOps team”. Not only does FinOps represent finance and sourcing overhead, but also cloud operations and engineering overhead — and, most of all, developer overhead (and overhead for any other technical team being asked to do cloud optimization work). If you go further and end up hiring a team that does performance engineering, those people are super rare and expensive.
In other words, being somewhat oversized in the cloud — or being somewhat inefficient in your application code — is a form of insidious creeping technical debt. But it’s the kind of technical debt that tends to linger, because when you look at the business case to actually go after that technical debt, there’s inadequate ROI to justify it. (Indeed, on-premises, people historically haven’t much cared. They throw hardware at the problem and run heavily oversized anyway. Nobody thinks about it because there was capital budget to buy the gear and once the gear was purchased, there wasn’t much reason to contemplate whether the money was efficiently used.)
Moreover, does your business actually want your highly-paid application development teams to chase performance issues in their code, or do they want them adding new features that will deliver new functionality to the business, saving you money elsewhere in your business processes and/or delivering something that will be compelling to customers, thus increasing your top-line revenue?
I certainly think it’s important for nearly all organizations to do some cloud financial management, which they will probably support with tooling. They’ve got to do the basics of cloud cost hygiene (preventing gross waste), budget alerts (to gain rapid awareness of accidents), spend allocation (showback/chargeback) and discount-related planning (what’s necessary for commits, reserved instances, saving plans etc.) — but even there the effort needs to be proportional to the potential cost savings.
But full-ceremony FinOps, so to speak, is usually something better left for big money-pit applications where cloud engineer or developer effort can have a significant impact on cost — for organizations with substantial self-service and no culture of cost discipline, or for the big spenders where even moving the needle a little bit on things like basic hygiene can have a pretty large absolute dollar effort relative to the investment.
GreenOps for sustainability must parallel FinOps for cost
Cloud customers are trying to make meaningful sustainability decisions. To really reduce carbon impact (or other types of environmental impact), they need the transparency to understand the impact of their architectural decisions. Just like they need to be able to estimate the cost of a solution, they need to be able to estimate its environmental impact. They need to be able to get an estimate of what the “environmental bill” will be based on the region (and maybe zone), services, and service options they choose. To the extent possible, they then need to see what impact they’re actually generating based on actual utilization.
In other words, they need “GreenOps” the way that they need “FinOps” (using FinOps as a generic term for cloud financial management in this context). And because sustainability is not just carbon impact, they’ll probably eventually need to see a multidimensional set of metrics (or a way to create a custom metric that weights different things that are important to them, like water impact vs carbon impact).
Cloud providers have relatively decent cost tools — cost calculators that allow you to choose solution elements and estimate your bill, cost reporting of various sorts, and so forth. Similarly, the third-party FinOps tooling ecosystem provides good visibility and recommendations to customers.
We don’t really need totally new dashboards and tools for sustainability. What we really need is an extension to the existing cloud cost optimization tools (and the cost transparency and billing APIs that enable those tools) to display environmental impacts as well, so we can manage them alongside our costs. Indeed, most customers will want to make trade-offs between their environmental footprint and costs. For instance, are they potentially willing to pay more to lower their greenhouse gas emissions?
Of course, there are many ways to measure sustainability and many different types of impacts, and not all of them are well suited to this kind of granular breakdown — but drawing a GreenOps parallel to FinOps would help customers extend the tools and processes that they already use (or are developing) for cost management to the emerging need for sustainability management.
The cloud budget overrun rainbow of flavors
Cloud budget overruns don’t have a singular cause. Instead, they come in a bright rainbow of jelly belly flavors (the Bertie Botts ones, especially, will combine into a non-mouthwatering delight). Each needs different forms of response.
Ungoverned costs. This is the black licorice of FinOps problems. The organization has no idea what it’s spending, really, much less where the money is going, other than the big bills (or often, many little credit card bills) that they pay each month. This requires basic cost hygiene: analyze your cloud bills, get a cost management tool into place and make it useful through some tagging or partitioning discipline.
Unanticipated usage. This is the sour watermelon flavor of cost overruns — deliciously sweet yet mouth-puckering. In this situation, the organization is the victim of its own cloud success. Cloud has been such a great thing for the organization that more and more unanticipated cloud projects are showing up, blowing out the original budget estimates for cloud resources. Those cloud projects are delivering business value and it doesn’t make sense to say no to them (and even if central IT says no, the cloud costs can usually be paid for out of a line-of-business budget). Nevertheless, it’s causing a lot of organizational angst because central IT or the sourcing team didn’t anticipate this spending. This organization needs to learn to shift its budgeting processes for the digital future, and cloud chargeback will help support future decision-making.
No commitments. This is the minty wrongness of Bertie Botts toothpaste. The organization could get discounts by using public discounting mechanisms for commits (like AWS Savings Plans and Azure Reserved Instances) as well making a contractual commitment for a negotiated discount. But because the organization feels like they can’t perfectly predict their use and aren’t sure if they’ll use all of what they’re using today, they commit to nothing, therefore ensuring that they spend grotesquely more than they could be. This is universally a terrible idea. Organizations that aren’t in early pilot stage have long-term production applications and some predictability of usage; commit to the stuff you know you’re not killing off.
Dev/test waste. This is the mundane bleah-ness of Bertie Botts earwax. Developers are provisioning the biggest things they can get away with (or at least being overaggressive in their estimates of what they need), there are lots of abandoned resources idling away, and dev/test infrastructure that isn’t used outside of business hours isn’t being suspended when unused. This is what cloud cost management tools are great at doing — identifying obvious waste so that it can be eliminated, largely by shutting it down or suspending it, preferably via automation.
Too much production headroom. This is the mild weirdness of the Bertie Botts grass flavor. Application teams haven’t implemented autoscaling for applications that can scale horizontally, or they’ve overestimated how much production headroom an application with variable usage needs (which may result in oversizing compute units, or being overly aggressive with autoscaling). This requires implementing autoscaling with some thoughtful tuning of parameters, and possibly a business value conversation on the cost/benefit tradeoff of having higher application performance on a consistent basis.
Wrongsizing production. This is the awful lingering terribleness of Bertie Botts vomit, whose taste you cannot get out of your mouth. Production environments are statically overprovisioned and therefore overly costly. On-prem, 30% utilization is common, but it’s all capex and as long as it’s within budget, no one really cares about the waste. But in the cloud, you pay for that excess resource monthly, forcing you to confront the ongoing cost of the waste.
However, anyone who tells you to “just” rightsize has never actually tried to do this in practice within an enterprise. The problem is that applications that scale vertically typically can’t be easily rightsized. It’s likely difficult-to-impossible to do automatically, due to complicated application installation. The application is fragile and may be mission-critical, so you are cautious about maintenance downtime. And the application team — the only people who really understand how this thing works — is likely busy with other priorities.
If this is your situation, your cloud cost management tool may cause you to cry hopeless tears, because you can see the waste but taking remediation actions is a complicated cross-functional war dance and delicate negotiation that leaves everyone wondering if it wouldn’t have been easier to just keep paying a larger bill.
Suboptimal design and implementation. The controversial popcorn flavor. Architects are sometimes cost-oblivious when they design cloud solutions. They may make bad design choices, or changes in application features and behavior over time may have turned out to make a design choice unexpectedly expensive. Developers may write poorly-performing code that consumes a lot of infrastructure resources, or code that makes excessive (and, cumulatively, expensive) calls to cloud services. Your cloud cost management tools are unlikely to be of any use for detecting these situations. This needs to be addressed through performance engineering, with attention paid to the business value of the time/effort/money necessary to do so — and for many organizations may require bringing in third-party expertise to diagnose the problems and offer recommendations.
Notably, the answer to most of these issues is not “implement a cloud cost management tool”. The challenges aren’t really as simple as a lot of vendors (and talking heads) make them out to be.
Cloud cost overruns may be a business leadership failure
A couple of months back, some smart folks at VC firm Andreesen Horowitz wrote a blog post called “The Cost of Cloud, a Trillion Dollar Paradox“. Among other things, the blog made a big splash because it claimed, quote: “[W]hile cloud clearly delivers on its promise early on in a company’s journey, the pressure it puts on margins can start to outweigh the benefits, as a company scales and growth slows.” It claimed that cloud overspending was resulting in huge loss of market value, and that developers needed incentives to reduce spending.
The blog post is pretty sane, but plenty of people misinterpreted it, or took away only its most sensationalistic aspects. I think it’s critical to keep in mind the following:
Decisions about cloud expenditures are ultimately business decisions. Unnecessarily high cloud costs are the result of business decisions about priorities — specifically, about the time that developers and engineers devote to cost optimization versus other priorities.
For example, when developer time is at a premium, and pushing out features as fast as possible is the highest priority, business leadership can choose to allow the following things that are terrible for cloud cost:
- Developers can ignore all annoying administrative tasks, like rightsizing the infrastructure or turning off stuff that isn’t in active use.
- Architects can choose suboptimal designs that are easier and faster to implement, but which will cost more to run.
- Developers can implement crude algorithms and inefficient code in order to more rapidly deliver a feature, without thinking about performance optimizations that would result in less resource consumption.
- Developers can skip implementing support for more efficient consumption patterns, such as autoscaling.
- Developers can skip implementing deployment automation that would make it easier to automatically rightsize — potentially compounded by implementing the application in ways that are fragile and make it too risky and effortful to manually rightsize.
All of the above is effectively a form of technical debt. In the pursuit of speed, developers can consume infrastructure more aggressively themselves — not bothering to shut down unused infrastructure, running more CI jobs (or other QA tests), running multiple CI jobs in parallel, allocating bigger faster dev/test servers, etc. — but that’s short-term, not an ongoing cost burden the way that the technical debt is. (Note that the same prioritization issues also impact the extent to which developers cooperate in implementing security directives. That’s a tale for another day.)
The more those things are combined — bad designs, poorly implemented, that you can’t easily rightsize or scale — the more that you have a mess that you can’t untangle without significant expenditure of development time.
Now, some organizations will go put together a “FinOps” team to play whack-a-mole with infrastructure — killing/parking stuff that is idle and rightsizing the waste. And that might help short-term, but until you can automate that basic cost hygiene, this is non-value-added people-intensive work. And woe betide you if your implementations are fragile enough that rightsizing is operationally risky.
Once you’ve got your whack-a-mole down to a nice quick automated cadence, you’ve got to address the application design and implementation technical debt — and invest in the discipline of performance engineering — or you’ll continue paying unnecessarily high bills month after month. (You’d also be oversizing on-prem infrastructure, but people are used to that, and the capital expenditure is money spent, versus the grind of a monthly cloud bill.)
Business leaders have to step up to prioritize cloud cost optimization — or acknowledge that it isn’t a priority, and that it’s okay to waste money on resources as long as the top line is increasing faster. As long that’s a conscious, articulated decision, that’s fine. But we shouldn’t pretend that developers are inherently irresponsible. Developers, like other employees, respond to incentives, and if they’re evaluated on their velocity of feature delivery, they’re going to optimize their work efforts towards that end.
For more details, check out my new research note called “Is FinOps the Answer to Cloud Cost Governance?” which is paywalled and targeted at Gartner’s executive leader clients — a combination of CxOs and business leaders.