FinOps can be a big waste of money
Of late, my colleagues and I have been talking to a lot of clients who want to build a “FinOps team”, which they seem to hope will wave magic wands and reduce their cloud IaaS+PaaS bill. I’m struck by how many clients I talk to don’t have cloud cost problems that are reasonably solvable with FinOps.
Bluntly: For many organizations, there is no reasonable ROI on FinOps (and certainly no sensible business case for building a FinOps team).
This doesn’t mean that the organization shouldn’t manage their cloud finances. It just means that they don’t need to manage their cloud finances in a way that’s meaningfully different from the way that they’ve historically managed IT spending in their on-premises data center. I’ll use the term “FinOps” colloquially here to indicate an organization taking an approach and processes for cloud financial management that are different from their established on-premises IT financial management.
There are lots of common reasons why your organization might not need FinOps. For example:
- You don’t use self-service.Your developers, app management engineers, data scientists, and other technical end-users do not have direct self-service access to cloud services. Instead, all cloud design and provisioning is done by a central infrastructure and operations (I&O) team — or alternatively, all cloud requests go through a service catalog and are manually reviewed and approved. Therefore, nothing happens in the cloud that’s outside of central I&O’s knowledge or control — likely allowing you to manage budgets like you did on-premises.
- You have little to no variability in production: Your applications are allocated a static amount of infrastructure, and/or their usage is almost entirely predictable (for example, they autoscale up during the last week of the month, and then autoscale down after the close of the month). Therefore, your cloud bill for each application is essentially the same every month. You should nevertheless configure budget alerts in case something weird happens that makes usage spike, but that likely will be a one-time thing when the application is first deployed, perhaps with a once-a-year review.
- You’re not spending much money in the cloud. If you’re not spending much money, even a significant percentage reduction in spend (which you could potentially get, for instance, by eliminating all cloud dev/test VMs that aren’t used any longer and could simply be turned off) won’t be that many hard dollars of savings. Putting into place automation that automatically hibernates or deprovisions unused infrastructure may have a useful ROI, but playing manual whack-a-mole that involves a lot of people (whether in paperwork or actually mucking with infrastructure) almost certainly wastes more money in labor time than it saves in cloud costs.
- You don’t have infrastructure-hungry applications. Enterprises often don’t have the voracious scale-out cloud-native applications that are common in digital-native companies, or they only have a small handful of those applications. You might be spending significant money in the cloud, but it’s spread across dozens, hundreds, or even thousands of small applications. Therefore, even if you could cut the necessary capacity for a given application in half, it wouldn’t generate much in the way of monthly cost savings — likely not enough to justify the time of the people doing the work. Lots of enterprises run boring everyday “paperwork” apps on a VM or two (or these days, a container or two). A single-VM app often runs at 40% utilization at max, because of powers-of-two cloud VM sizing, so dropping a “T-shirt” size results in half the capacity and maybe 90% utilization, which many enterprises feel is uncomfortably tight. (And lots of organizations are slightly oversized across the board because they took the “safe” estimate of capacity needs from their cloud migration tools.)
Buying FinOps tools and allocating people to FinOps activities can cost you more than it saves.
Most people launch FinOps practices by purchasing a cloud cost optimization tool of some sort (i.e. a “FinOps tool”). Complicated FinOps processes and/or having a lot of teams and applications you have to corral within your cloud cost governance framework probably result in the genuine need to purchase a third-party FinOps tool — but those tools probably don’t represent a positive ROI until you’re spending more at least a million dollars a year. And then you have to remember that the percentage-of-cloud-spend pricing scheme of those tools can mean that you’re giving the FinOps-tool vendor a pile of money for service elements that they have no optimization capabilities for.
But in many cases, the cost of a tool will be dwarfed by the expense of the employees to do this work, especially in organizations who are making a misguided effort to hire a “FinOps team”. Not only does FinOps represent finance and sourcing overhead, but also cloud operations and engineering overhead — and, most of all, developer overhead (and overhead for any other technical team being asked to do cloud optimization work). If you go further and end up hiring a team that does performance engineering, those people are super rare and expensive.
In other words, being somewhat oversized in the cloud — or being somewhat inefficient in your application code — is a form of insidious creeping technical debt. But it’s the kind of technical debt that tends to linger, because when you look at the business case to actually go after that technical debt, there’s inadequate ROI to justify it. (Indeed, on-premises, people historically haven’t much cared. They throw hardware at the problem and run heavily oversized anyway. Nobody thinks about it because there was capital budget to buy the gear and once the gear was purchased, there wasn’t much reason to contemplate whether the money was efficiently used.)
Moreover, does your business actually want your highly-paid application development teams to chase performance issues in their code, or do they want them adding new features that will deliver new functionality to the business, saving you money elsewhere in your business processes and/or delivering something that will be compelling to customers, thus increasing your top-line revenue?
I certainly think it’s important for nearly all organizations to do some cloud financial management, which they will probably support with tooling. They’ve got to do the basics of cloud cost hygiene (preventing gross waste), budget alerts (to gain rapid awareness of accidents), spend allocation (showback/chargeback) and discount-related planning (what’s necessary for commits, reserved instances, saving plans etc.) — but even there the effort needs to be proportional to the potential cost savings.
But full-ceremony FinOps, so to speak, is usually something better left for big money-pit applications where cloud engineer or developer effort can have a significant impact on cost — for organizations with substantial self-service and no culture of cost discipline, or for the big spenders where even moving the needle a little bit on things like basic hygiene can have a pretty large absolute dollar effort relative to the investment.
Cloud adoption will fail because of the skills gap
In order to adopt cloud IaaS and PaaS successfully (and arguably, to adopt SaaS optimally), an organization needs skills. Most of all, it needs technical skills for the whole application lifecycle in the cloud — the ability to architect applications (and their underlying stacks) for the cloud, develop for the cloud, secure the cloud, run and manage and govern the cloud environments and the applications in those environments. The more cloud-natively you can do these things, the better.
If you can’t do these things in cloud-native patterns (often because you’re migrating your legacy), you at least want to try to modernize and cloud-optimize — to leverage PaaS rather than IaaS, to automate everything you reasonably can, and otherwise exploit the cloud capabilities to maximum effectiveness. This, too, requires skills.
Cloud skills needs — and associated “soft skills” and mindset — are needed in infrastructure and operations (I&O) teams and security and risk management (SRM) teams. They’re needed in application teams, data science teams, and other technical end-user teams that exploit cloud services, along with enterprise architecture and other architecture teams. There are also nontechnical skills that have to be built in the appropriate teams — effective cloud sourcing, effective cloud financial management, and so on.
My colleagues and I have previously written that the cloud skills gap has reached a crisis level in many organizations. Organizational timelines for cloud adoption, cloud migration and cloud maturity are being impacted by the inability to hire and retain the people with the necessary qualifications.
There are lots of reasons for the skills gap — insufficient number of trained and experienced people to meet demand, escalating salaries plus a globalized market for talent that results in NYC banks nabbing skilled cloud architects working for enterprises in Iowa or Missouri (or Poland) for NYC banking salaries, and the quality of the opportunities available.
For instance, an increasing number of the technical professionals I talk to care more about good executive support for the cloud program, a cloud team that’s executing well and doing smart things, an opportunity to bring their best selves to work (excellent team management, great colleagues, feeling valued, etc.), and strong belief in the organization’s mission, than they do about pay per se. It isn’t just about pay — but at a lot of slow-moving enterprises where the pay isn’t great, there are also cultural issues that make highly-skilled cloud professionals feel out of place and not valued.
While many organizations are trying to retrain existing I&O personnel especially, these efforts can fail because the DevOps emphasis of successful cloud-optimized or cloud-native adoption results in fundamentally different jobs. Not only does this require the development of strong automation (and thus coding) skills, but it also results in a more project-driven workday, greater autonomy (but also more self-starting and self-motivation), and more communication and collaboration with application teams and other cloud users. Those who prefer “IT factory work”, solitarily executing repetitive ClickOps tasks driven by service requests, generally don’t enjoy the change in the nature of a job.
Organizations that can’t retain cloud-trained staff often react by leaning more heavily on the people that remain, which jacks up stress levels, leads to resentment, and often turns into a spiral of departures. Contractors can help fill the gap — if the organization is willing to spend the money to hire them.
Many organizations are successfully bridging the gap with consulting (professional services) and managed services (a Gartner survey showed about three-quarters of organizations use such services for at least a portion of their cloud IaaS+PaaS adoption). Many cloud managed services deals include explicit training and gradual handover to the customer’s personnel, allowing the customer to take over bit by bit as their team gets comfortable. However, MSPs, SIs, and other outsourcers are also struggling to fulfill the demand, which leads to both project delays as well as throwing less-qualified bodies into the mix in order to try to meet contractual obligations and grow revenues.
I believe that we are rapidly reaching the point where the skills gap is not only endangering the ability of individual organizations to fulfill their cloud computing ambitions, but where we may begin to see systemic back-off from cloud ambitions, resulting most notably in cancelled or substantially scaled-back cloud migrations as a common market pattern. (Disclaimer: This is a personal statement scribbled while eating lunch. It is not a peer-reviewed Gartner position.) Also, note that in no way am I claiming that this is likely to lead to repatriation!
Organizations that are late cloud adopters were already more hesitant about going to the cloud in the first place. They tend to have less of a belief that IT can help drive business success, have more technical debt, and tend to have lesser-skilled people (with less up-to-date skills). They may have been the recipient of many people who fled early cloud-adopting organizations because those people didn’t want to re-skill, so they face significantly harder internal pushback and potentially internal sandbagging of cloud projects. When they do manage to successfully train people, those people often leave within a year for both better pay and a more congenial, faster-moving environment. Late adopters may simply not be able to generate enough internal competence to even safely and successfully use outsourced assistance.
However, even organizations that are not late adopters often have different parts of the business at different paces of adoption. Notably, they may have digital business divisions, or more ambitious fast-moving business units in general, that have substantial cloud adoption, while other parts of the organization lag behind. Those that are charging ahead may remain successful and continue to expand in the cloud, while the rest of the organization remains unable to beg borrow or steal enough skills from those other successful outposts to overcome the on-premises inertia.
This may lead to an exacerbation of existing market patterns where the digitally ambitious have had outsized and potentially disruptive success… and where other organizations are unable to imitate those successes, leading not just to failures of IT projects, but also meaningful negative business impacts. This, in turn, has a follow-on effect on the cloud providers. As enterprise bets on the cloud grow bigger, one might begin to see these projects, especially mass migration and transformation, as gambles more so than realistically-executable plans. Any plan that is predicated on if we can get the people who can do this stuff is fraught with nontrivial probability of failure.
(As always, my Gartner colleagues and I are happy to advise on inquiry, but there’s only so much we can help you with your skills gap if your organization has the deadly triplet of not offering good pay, not providing a good working environment, and and not making people feel like they’re doing something valuable with their lives. However, I also spend some significant percentage of my inquiry time listening to people vent, so I’m happy to sympathize with your tale of woe, too, and in most cases reassure you that what you’re trying to get your organization to do would be the right thing…)
Cloud self-service doesn’t need to invite the orc apocalypse
I spend quite a bit of time talking to clients about developer self-service, largely in the context of public cloud governance and cloud operations. There are still lots of infrastructure and operations (I&O) executives who instinctively cringe at the notion of developer self-service, as if self-service would open formerly well-defended gates onto a pristine plain of well-controlled infrastructure, and allow a horde of unwashed orcs to overrun the concrete landscape in a veritable explosion of Lego structures, dot-matrix printouts, Snickers wrappers and lost whiteboard marker caps… never to be clean and orderly again.
It doesn’t have to be that way.
Self-service — and more broadly, developer control over infrastructure — isn’t an all-or-nothing proposition. Responsibility can be divided across the application life cycle, so that you can get benefits from “You build it, you run it” without necessarily parachuting your developers into an untamed and unknown wilderness and wishing them luck in surviving because it’s not an Infrastructure & Operations (I&O) team problem any more.
So we ask, instead:
- Will developers design their own infrastructure?
- Will developers control their dev/test environments?
- How much autonomy will developers have in building production environments?
- How much autonomy will developers have for production deployments?
- To what extent are developers responsible for day-to-day production maintenance (patching, OS updates, infrastructure rightsizing, etc.)?
- To what extent are developers responsible for incident management?
- How much help will developers receive for the things they’re responsible for?
I talk to far too many IT leaders who say, “We can’t give developers cloud self-service because we’re not ready for You build it, you run it!” whereupon I need to gently but firmly remind them that it’s perfectly okay to allow your developers full self-service access to development and testing environments, and the ability to build infrastructure as code (IaC) templates for production, without making them fully responsible for production.
This is the subject of my new research note, “How to Empower Technical Teams Through Self-Service Public Cloud IaaS and PaaS“. (Gartner for Technical Professionals paywall)
This is a step along the way to a deeper exploration of finding the right balance between “Dev” and “Ops” in DevOps, which is an organization-specific thing. This is not just a cloud thing; it also impacts the structure of operations on-premises. Every discussion of SRE, platform ops, etc. ultimately revolves around the questions of autonomy, governance, and collaboration, and no two organizations are likely to arrive at the exact same balance. (And don’t get me started on how many orgs rename their I&O teams to SRE teams without actually implementing much if anything from the principles of SRE.)
Resilience: Cloudy without a chance of meatballs
In the wake of AWS’s major US-East-1 incident of December 7th, 2021, I’ve fielded plenty of panicked client inquiry about whether anyone can trust any cloud provider, whether the availability zone model actually works, and whether or not the customer’s current architecture offers adequate resilience for their needs.
I’ve also dealt with more than a handful of journalists who have wanted to push a narrative that AWS customers are fleeing in droves and/or are going multicloud as a result of that outage. Every story I’ve read on that subject has tried its darnedest to imply something which just isn’t true. Yes, many organizations use multiple cloud providers. No, they don’t do so for resilience, but rather, because differing preferences within the organization have led to adopting more than one provider.
The fact that it’s now more than two months since the outage and I’m still talking about it with clients (and my colleagues are too) does reflect how large it looms in the mind of customers — including customers of other cloud providers — though. Indeed, it looms large in the mind of many AWS customers who were not affected, either because they don’t run in US-East-1 or because their failover to another region worked as planned.
At this point, not only have my colleagues and I talked to quite a few organizations but we’ve also talked to providers of disaster recovery software and services. Thus far, it appears that customers that had problems with cross-region recovery during the 12/7/21 incident either violated AWS best practices for such, or violated their vendor’s advice.
That’s not to say that there weren’t two important unpleasant surprises in terms of US-East-1 dependencies:
- The global console URL was pointing to US-East-1 alone (rather than being geo load balanced or the like, which most people would probably have assumed). Customers could get around this by going to a regional console URL instead. I believe (but haven’t confirmed) AWS has now introduced a truly global endpoint with the introduction of the new console experience.
- Route 53 and Cloudfront’s control plane APIs are hosted solely in US-East-1. People reasonably expect to be able to make DNS changes during an outage, even though AWS advises that you use health checks for your failover instead.
Either of those two things could have thrown a wrench into cross-region recovery, along with needing to create new S3 buckets (the global namespace conflict checks are done against US-East-1), needing very specific instance types in short supply in the target region, needing to create new IAM roles (which are first created in US-East-1 and then replicated to other regions), and depending on the legacy STS global namespace (also US-East-1 dependent). But by and large, cross-region recovery worked as expected.
Now, there are certainly plenty of people who can’t do fast failover into another region and who therefore sat tight and suffered through the incident, and there’s a nontrivial number of customers who haven’t laid foundations for disaster recovery (however slowly) into another region. I get it — being able to do this kind of recovery requires an investment. You want cloud providers to be so resilient that you don’t need to make that investment yourself. But hope is not a strategy, here.
But the sky did not fall, and the sky is not falling. Cloud has not suddenly become less attractive or significantly more risky. AZ architectures work, but as always, problems with regional services (which are already designed to be multi-AZ) mean that multi-AZ might not be enough for the most critical applications. Cross-region failover works, when properly architected. (Fast and seamless failover and failback are critical, though; major cloud incidents to date have generally been multi-hour, but not multi-day. If you can’t fail over easily and fail back without a lot of effort, you tend to just wait out the outage and hope it’s short.)
Yes, there were significant problems for many customers in US-East-1. API Gateway was essentially down, and many people are dependent on API Gateway to invoke Lambda, and tons of customers use Lambda in a mission-critical fashion. Amazon Connect also depends on API Gateway, and it was also affected. (Other casualties of the backend network issues: ELB launches, S3 private endpoints, Fargate APIs impacting container launches, STS for EKS, and the support APIs.) But EC2 virtual machines continued to function just fine (although you couldn’t launch new ones). The overwhelming majority of AWS services in the region continued to operate unimpacted, and customers who did not have dependencies on affected services were able to continue operating in the region.
In a way, this was a stark demonstration of how much cloud outages are usually confined to specific services… but if a down service is critical to your application, you’re probably boned unless you have a workaround or you can failover into another region. Unfortunately, far too many customers persist in planning as if physical data center failure was the most likely event. (AWS had one of those in December, too — a power outage in a single data center, thus impacting a percentage of infrastructure in one of the six US-East-1 AZs.)
Yes, the incident was a wake-up call for a lot of cloud customers, and it was a rallying cry for on-premises server-huggers. However, not only is the sky not falling, but there should be no anticipation that it will rain meatballs.
I wrote a number of blog posts months before this outage:
- The cloud is NOT just someone else’s computer
- Multicloud failover is almost always a terrible idea
- Improving cloud resilience through stuff that works
and I still firmly stand by those posts now. (Importantly, I still believe multicloud for resilience is almost always impractical. Successful implementations are vanishingly rare and have horrible drawbacks.) Indeed, I’d been working on a piece of Gartner research with my colleagues Kevin Matheny (who covers application architecture), Stanton Cole and Fintan Quinn (who cover backup and DR), which I’m glad to say has finally published:
Designing Availability and Resilience for Applications in Public Cloud IaaS and PaaS (Gartner for Technical Professionals paywall)
In this, you’ll find what I hope is a pragmatic set of guidance advising that you figure out how critical an application is, and then choose your availability approach and your failover approach accordingly — and not forget the critical importance of designing and implementing resilience within your application. It’s got a lengthy dissection of all the things that can go wrong in the cloud, and what you should be thinking about when you architect. It also contains a sample architectural standard that cloud governance teams can provide to application architects to help them make these decisions. (The main doc runs 65 pages. The impatient will probably find the architectural standard, which is fairly short, to be easier reading.)
The cloud budget overrun rainbow of flavors
Cloud budget overruns don’t have a singular cause. Instead, they come in a bright rainbow of jelly belly flavors (the Bertie Botts ones, especially, will combine into a non-mouthwatering delight). Each needs different forms of response.
Ungoverned costs. This is the black licorice of FinOps problems. The organization has no idea what it’s spending, really, much less where the money is going, other than the big bills (or often, many little credit card bills) that they pay each month. This requires basic cost hygiene: analyze your cloud bills, get a cost management tool into place and make it useful through some tagging or partitioning discipline.
Unanticipated usage. This is the sour watermelon flavor of cost overruns — deliciously sweet yet mouth-puckering. In this situation, the organization is the victim of its own cloud success. Cloud has been such a great thing for the organization that more and more unanticipated cloud projects are showing up, blowing out the original budget estimates for cloud resources. Those cloud projects are delivering business value and it doesn’t make sense to say no to them (and even if central IT says no, the cloud costs can usually be paid for out of a line-of-business budget). Nevertheless, it’s causing a lot of organizational angst because central IT or the sourcing team didn’t anticipate this spending. This organization needs to learn to shift its budgeting processes for the digital future, and cloud chargeback will help support future decision-making.
No commitments. This is the minty wrongness of Bertie Botts toothpaste. The organization could get discounts by using public discounting mechanisms for commits (like AWS Savings Plans and Azure Reserved Instances) as well making a contractual commitment for a negotiated discount. But because the organization feels like they can’t perfectly predict their use and aren’t sure if they’ll use all of what they’re using today, they commit to nothing, therefore ensuring that they spend grotesquely more than they could be. This is universally a terrible idea. Organizations that aren’t in early pilot stage have long-term production applications and some predictability of usage; commit to the stuff you know you’re not killing off.
Dev/test waste. This is the mundane bleah-ness of Bertie Botts earwax. Developers are provisioning the biggest things they can get away with (or at least being overaggressive in their estimates of what they need), there are lots of abandoned resources idling away, and dev/test infrastructure that isn’t used outside of business hours isn’t being suspended when unused. This is what cloud cost management tools are great at doing — identifying obvious waste so that it can be eliminated, largely by shutting it down or suspending it, preferably via automation.
Too much production headroom. This is the mild weirdness of the Bertie Botts grass flavor. Application teams haven’t implemented autoscaling for applications that can scale horizontally, or they’ve overestimated how much production headroom an application with variable usage needs (which may result in oversizing compute units, or being overly aggressive with autoscaling). This requires implementing autoscaling with some thoughtful tuning of parameters, and possibly a business value conversation on the cost/benefit tradeoff of having higher application performance on a consistent basis.
Wrongsizing production. This is the awful lingering terribleness of Bertie Botts vomit, whose taste you cannot get out of your mouth. Production environments are statically overprovisioned and therefore overly costly. On-prem, 30% utilization is common, but it’s all capex and as long as it’s within budget, no one really cares about the waste. But in the cloud, you pay for that excess resource monthly, forcing you to confront the ongoing cost of the waste.
However, anyone who tells you to “just” rightsize has never actually tried to do this in practice within an enterprise. The problem is that applications that scale vertically typically can’t be easily rightsized. It’s likely difficult-to-impossible to do automatically, due to complicated application installation. The application is fragile and may be mission-critical, so you are cautious about maintenance downtime. And the application team — the only people who really understand how this thing works — is likely busy with other priorities.
If this is your situation, your cloud cost management tool may cause you to cry hopeless tears, because you can see the waste but taking remediation actions is a complicated cross-functional war dance and delicate negotiation that leaves everyone wondering if it wouldn’t have been easier to just keep paying a larger bill.
Suboptimal design and implementation. The controversial popcorn flavor. Architects are sometimes cost-oblivious when they design cloud solutions. They may make bad design choices, or changes in application features and behavior over time may have turned out to make a design choice unexpectedly expensive. Developers may write poorly-performing code that consumes a lot of infrastructure resources, or code that makes excessive (and, cumulatively, expensive) calls to cloud services. Your cloud cost management tools are unlikely to be of any use for detecting these situations. This needs to be addressed through performance engineering, with attention paid to the business value of the time/effort/money necessary to do so — and for many organizations may require bringing in third-party expertise to diagnose the problems and offer recommendations.
Notably, the answer to most of these issues is not “implement a cloud cost management tool”. The challenges aren’t really as simple as a lot of vendors (and talking heads) make them out to be.
Improving cloud resilience through stuff that works
As I noted in a previous blog post, multicloud failover is almost always a terrible idea. While the notion that an entire cloud provider can go dark for a lengthy period of time (let’s say a day or more) is not entirely impossible, it’s the least probable of the many ways that an application can experience failure. Humans tend to over-index on catastrophic but low-probability events, so it’s not especially shocking that people fixate on the possibility, but before you spend precious people-effort (not to mention money) on multicloud failover, you should first properly resource all the other things you could be doing to improve your resilience in the cloud.
As I noted previously, five core things impact cloud resilience: physical design, logical (software) design, implementation quality, deployment processes, and operational processes. So you should select your cloud provider carefully. Some providers have a better track record of reliability than others — often related directly in differences in the five core resilience factors. I’m not suggesting that this be a primary selection criterion, but the less reliable your provider, the more you’re going to have to pour effort into resilience, knowing that the provider’s failures are going to test you in the real world. You should care most about the failure of global dependencies (identity, security certificates, NTP, DNS, etc.) that can affect all services worldwide, followed by multi-region failures (especially those that affect an entire geography).
However, those things aren’t just important for cloud providers. They also affect you, the application owner, and the way you should design, implement, update, and operate your application — whether that application is on-premises or in the cloud. Before you resort to multicloud failover, you should have done all of the below and concluded that you’ve already maximized your resilience via these techniques and still need more.
Start with local HA. When architecting a mission-critical application, design it to use whatever HA capabilities are available to you within an availability zone (AZ). Use a clustered (and preferably scale-out) architecture for the stuff you build yourself. Ensure you maximize the resilience options available from the cloud services.
Build good error-handling into your application. Your application should besmart about the way it handles errors, either from other application components or from cloud services (or other third-party components). It should exhibit polite retry behavior and implement circuit breakers to try to limit cascading failures. It should implement load-shedding, in recognition of the fact that rejecting excessive requests so that the requests that can be served receive decent performance is better than just collapsing into non-responsiveness. It should have fallback mechanisms for graceful degradation, to limit impact on users.
Architect the application’s internals for resilience. Techniques such as partitions and bulkheads are likely going to be reserved for larger-scale applications, but are vital for limiting the blast radius of failures. (If you have no idea what any of this terminology means, read Michael Nygard’s “Release It!” — in my personal opinion, if you read one book about mission-critical app design, that should probably be the one.)
Use multiple AZs. Run your application active-active across at least two, and preferably three, AZs within each region that you use. (Note that three can be considerably harder than two because most cloud provider services natively support running in two AZs simultaneously but not three. But that’s a far easier problem than multicloud failover.)
Use multiple regions. Run your application active-active across at least two, and preferably three regions. (Again, two is definitely much easier than three, due to a cloud service’s cross-region support generally being two regions.) If you can’t do that, do fast fully-automated regional failover.
Implement chaos engineering. Not only do you need to thoroughly test in your dev/QA environment to determine what happens under expected failure conditions, but you also need to experiment with fault injection in your production environment where there are complex unpredictable conditions that may cause unexpected failures. If this sounds scary and you expect it’ll blow up in your face, then you need to do a better job in the design and implementation of your application. Forcing constant failures into production systems (ala Netflix’s famed Chaos Monkey) helps you identify all the weak spots, builds resilience, and should help give you confidence that things will continue to work when cloud issues arise.
It’s really important to treat resilience as a systems concern, not purely an infrastructure concern. Your application architecture and implementation need to be resilient. If your developers can’t be trusted to write continuously available applications, imposing multicloud portability requirements (and attendant complexity) upon them will probably add to your operational risks.
And I’m not kidding about the chaos engineering. If you’re not mature enough for chaos engineering, you’re not mature enough to successfully implement multicloud failover. If you don’t routinely shoot your own AZs and regions, kill access to services, kill application components, make your container hosts die, deliberately screw up your permissions and fail-closed, etc. and survive that all without worrying, you need to go address your probable risks of failure that have solutions of reasonable complexity, before you tackle the giant complex beast of multicloud failover to address the enormously unlikely event of total provider failure.
Remember that we’re trying to achieve continuity of our business processes and not continuity of particular applications. If you’ve done all of the above and you’re still worried about the miniscule probability of total provider failure, consider building simple alternative applications in another cloud provider (or on-premises, or in colo/hosting). Such applications might simply display cached data, or queue transactions for later processing. This is almost always easier than maintaining full cross-cloud portability for a complex application. Plus, don’t forget that there might be (gasp) paper alternatives for some processes.
(And yes, I already have a giant brick of a research note written on this topic, slated for publication at the end of this year. Stay tuned…)
Multicloud failover is almost always a terrible idea
Most people — and notably, almost all regulators — are entirely wrong about addressing cloud resilience through the belief that they should do multicloud failover because, as I noted in a previous blog post, the cloud is NOT just someone else’s computer. (I have been particularly aghast at a recent Reuters article about the Bank of England’s stance.)
Regulators, risk managers, and plenty of IT management largely think of AWS, Azure, etc. as monolithic entities, where “the cloud” can just break for them, and then kaboom, everything is dead everywhere worldwide. They imagine one gargantuan amorphous data center, subject to all the problems that can afflict single data centers, or single systems. But that’s not how it works, that’s not the most effective way to address risk, and testing the “resilience of the provider” (as a generic whole) is both impossible and meaningless.
I mean, yes, there’s the possibility of the catastrophic failure of practically any software technology. There could be, for instance, a bug in the control systems of airplanes from fill-in-the-blank manufacturer that could be simultaneously triggered at a particular time and cause all their airplanes to drop out of the sky simultaneously. But we don’t plan to make commercial airlines maintain backup planes from some other manufacturer in case it happens. Instead, we try to ensure that each plane is resilient in many ways — which importantly addresses the most probable forms of failure, which will be electrical or mechanical failures of particular components.
Hyperscale cloud providers are full of moving parts — lots of components, assembled together into something that looks and feels like a cohesive whole. Each of those components has its own form of resilience, and some of those components are more fragile than others. Some of those components are typically operating well within engineered tolerances. Some of those components might be operating at the edge of those tolerances in certain circumstances — likely due to unexpected pressures from scale — and might be extra-scary if the provider isn’t aware that they’re operating at that edge. In addition to fault-tolerance within each component, there are many mechanisms for fault-tolerance built into the interaction between those components.
Every provider also has its own equivalent of “maintenance” (returning to the plane analogy). The quality of the “mechanics” and the operations will also impact how well the system as a whole operates. (See my previous blog post, “The multi-headed hydra of cloud resilience” for the factors that go into provider resilience.)
It’s not impossible for a provider to have a worldwide outage that effectively impacts all services (rather than just a single service). Such outages are all typically rooted in something that prevents components from communicating with each other, or customers from connecting to the services — global network issues, DNS, security certificates, or identity. The first major incident of this type was the 2012 Azure leap year outage. The 2019 Google “Chubby” outage had global network impact, including on GCP. There have been multiple Azure AD outages with broad impact across Microsoft’s cloud portfolio, most recently the 2021 Azure Active Directory outage. (But there are certainly other possibilities. As recently as yesterday, there was a global Azure Windows VM outage that impacted all Windows VM-dependent services.)
Provider architectural and operational differences do clearly make a difference. AWS, notably, has never had a full regional failure or a global outage. The unique nature of GCP’s global network has both benefits and drawbacks. Azure has been improving steadily in reliability over the years as Microsoft addresses both service architecture and deployment (and other operations) processes.
Note that while these outages can be multi-hour, they have generally been short enough that — given typical enterprise recovery-time objectives for disaster recovery, which are often lengthy — customers typically don’t activate a traditional DR plan. (Customers may take other mitigation actions, i.e. failover to another region, failover to an alternative application for a business process, and so forth.)
Multicloud failover requires that you maintain full portability between two providers, which is a massive burden on your application developers. The basic compute runtime (whether VMs or containers) is not the problem, so OpenShift, Anthos, or other “I can move my containers” solutions won’t really help you. The problem is all the differentiators — the different network architectures and features, the different storage capabilities, the proprietary PaaS capabilities, the wildly different security capabilities, etc. Sure, you can run all open source in VMs, but at that point, why are you bothering with the cloud at all? Plus, even in a DR situation, you need some operational capabilities on the other cloud (monitoring, logging, etc.), even if not your full toolset.
Moreover, the huge cost and complexity of a multicloud implementation is effectively a negative distraction from what you should actually be doing that would improve your uptime and reduce your risks, which is making your applications resilient to the types of failure that are actually probable. More on that in a future blog post.
The cloud is NOT just someone else’s computer
I recently wrote a Twitter thread about cloud risk and resilience that drew a lot of interest, so I figured I’d expand on it in a blog post. I’ve been thinking about cloud resilience a lot recently, given that clients have been asking about how they manage their risks.
Inquiries about this historically come in waves, almost always triggered by incidents that raise awareness (unfortunately often because the customer has been directly impacted). A wave generally spans a multi-week period, causing waves to bleed into one another. Three distinct sets come to mind over the course of 2021:
- The Azure AD outages earlier this year had a huge impact on client thinking about concentration risks and critical service dependencies — often more related to M365 than Azure, though (and exacerbated by the critical dependency that many organizations have on Teams during this pandemic). Azure AD is core to SSO for many organizations, making its resilience enormously impactful. These impacts are still very top of mind for many clients, months later.
- The Akamai outage (and other CDN outages with hidden dependencies) this summer raised application and infrastructure dependency awareness, and came as a shock to many customers, as Akamai has generally been seen as a bedrock of dependability.
- The near-daily IBM Cloud “Severity 1” outages over the last month have drawn selective client mentions, rather than a wave, but add to the broader pattern of cloud risk concerns. (To my knowledge, there has been no public communication from IBM regarding root cause of these issues. Notifications indicate the outages are multi-service and multi-regional, often impacting all Gen 2 multizone regions. Kubernetes may be something of a common factor, to guess from the impact scope.)
Media amplification of outage awareness appears to have a lot to do with how seriously they’re taken by customers — or non-customers. Affecting stuff that’s consumed by end-users — i.e. office suites, consumer websites, etc. — gets vastly more attention than things that are “just” a really bad day for enterprise ops people. And there’s a negative halo effect — i.e. if Provider X fails, it tends to raise worries about all their competitors too. But even good media explanations and excellent RCAs tend to be misunderstood by readers — and even by smart IT people. This leads, in turn, to misunderstanding why cloud services fail and what the real risks are.
I recently completed my writing on a note about HA and failover (DR) patterns in cloud IaaS and PaaS, with a light touch on application design patterns for resilience. However, concerns about cloud resilience applies just as much — if not more so — to SaaS, especially API SaaS, which creates complicated and deep webs of dependencies.
You can buy T-shirts, stickers, and all manner of swag that says, “The cloud is just somebody else’s computer.” Cute slogan, but not true. Cloud services — especially at massive scale — are incredibly complex software systems. Complex software systems don’t fail the way a “computer” fails. The cloud exemplifies the failure principles laid out by Richard Cook in his classic “How Complex Systems Fail“.
As humans, we are really bad at figuring out the risk of complex systems, especially because the good ones are heavily defended against failure. And we tend to over-index on rare but dramatic risks (a plane crash) versus more commonplace risks (a car crash).
If you think about “my application hosted on AWS” as “well, it’s just sitting on a server in an AWS data center rather than mine”, then at some point in time, the nature of a failure is going to shock you, because you are wrong.
Cloud services fail after all of the resiliency mechanisms have failed (or sometimes, gone wrong in ways that contribute to the failure). Cloud services tend to go boom because of one or more software bugs, likely combined with either a configuration error or some kind of human error (often related to the deployment process for new configs and software versions). They are only rarely related to a physical failure — and generally the physical failure only became apparent to customers because the software intended to provide resilience against it failed in some fashion.
Far too many customers still think about cloud failure as a simple, fundamentally physical thing. Servers fail, so we should use more than one. Data centers fail, so we should be able to DR into another. Etc. But that model is wrong for cloud and for the digital age. We want to strive for continuous availability and resilience (including graceful degradation and other ways to continue business functionality when the application fails). And we have to plan for individual services failures rather than total cloud failure (whether in an AZ, region, or globally). Such failures can be small-scale, and effectively merely “instability”, rather than an “outage” — and therefore demands apps that are resilient to service errors.
So as cloud buyers, we have to think about our risks differently, and we need to architect and operate differently. But we also need to trust our providers — and trust smartly. To that end, cloud providers need to support us with transparency, so we can make more informed decisions. Key elements of that include:
- Publicly-documented engineering service-level objectives (SLOs), which are usually distinct from the financially-backed SLAs. This is what cloud providers design to internally and measure themselves against, and knowing that helps inform our own designs and internal SLOs for our apps.
- Service architecture documentation that helps us understand the ways a service is and isn’t resilient, so we can design accordingly.
- Documented service dependency maps, which allow us to see the chain of dependencies for each of the services we use, allowing us to think about if Service X is really the best fallback alternative if Service Y goes down, as well as inform our troubleshooting.
- Public status dashboards, clearly indicating the status of services, with solid historical data that allows us to see the track record of service operations. This helps with our troubleshooting and user communication.
- Public outage root-cause analysis (RCA), which allow us to understand why outages occurred, and receive a public pledge as to what will be done to prevent similar failures in the future. A historical archive of these is also a valuable resource.
- Change transparency that could help predict stability concerns. Because so many outages end up being related to new deployments / config changes, and the use of SRE principles, including error budgets, is pretty pervasive amongst cloud providers, there is often an interesting pattern to outages. Changes tend to freeze when the error budget is exceeded, leading to an on-and-off pattern of outages; instability can resume at intervals unpredictable to the customer.
Mission-critical cloud applications are becoming commonplace — both in the pervasive use of SaaS, along with widespread production use of IaaS and PaaS. It’s past time to modernize thinking about cloud operations, cloud resilience, and cloud BC/DR. Cloud risk management needs to be about intelligent mitigation and not avoidance, as forward-thinking businesses are will not accept simply avoiding the cloud at this point.
I am interested in your experiences with resilience as well as cloud instability and outages. Feel free to DM me on Twitter to chat about it.
The multicloud gelatinous cube
Pondering the care and feeding of your multicloud gelatinous cube. (Which engulfs everything in its path, and digests everything organic.)
Most organizations end up multicloud, rather than intending to be multicloud in a deliberate and structured way. So typical tales go like this: The org started doing digital business-related new applications on AWS and now AWS has become the center of gravity for all new cloud-native apps and cloud-related skills. Then the org decided to migrate “boring” LOB Windows-based COTS to the cloud for cost-savings, and lifted-and-shifted them onto Azure (thereby not actually saving money, but that’s a post for another day). Now the org has a data science team that thinks that GCP is unbearably sexy. And there’s a floating island out there of Oracle business applications where OCI is being contemplated. And don’t forget about the division in China, that hosts on Alibaba Cloud…
Multicloud is inevitable in almost all organizations. Cloud IaaS+PaaS spans such a wide swathe of IT functions that it’s impractical and unrealistic to assume that the organization will be single-vendor over the long term. Just like the enterprise tends to have at least three of everything (if not ten of everything), the enterprise is similarly not going to resist the temptation of being multicloud, even if it’s complex and challenging to manage, and significantly increases management costs. It is a rare organization that both has diverse business needs, and can exercise the discipline to use a single provider.
Despite recognizing the giant ooze that we see squelching our way, along with our unavoidable doom, there are things we can do to prepare, govern, and ensure that we retain some of our sanity.
For starters, we can actively choose our multicloud strategy and stance. We can classify providers into tiers, decide what providers are approved for use and under what circumstances, and decide what providers are preferred and/or strategic.
We can then determine the level of support that the organization is going to have for each tier — decide, for instance, that we’ll provide full governance and operations for our primary strategic provider, a lighter-weight approach that leans on an MSP to support our secondary strategic provider, and less support (or no support beyond basic risk management) for other providers.
After that, we can build an explicit workload placement policy that has an algorithm that guides application owners/architects in deciding where particular applications live, based on integration affinities, good technical fit, etc.
Note that cost-based provider selection and cost-based long-term workload placement are both terrible ideas. This is a constant fight between cloud architects and procurement managers. It is rooted in the erroneous idea that IaaS is a commodity, and that provider pricing advantages are long-term rather than short-lived. Using cost-based placement often leads to higher long-term TCO, not to mention a grand mess with data gravity and thus data management, and fragile application integrations.
See my new research note, “Comparing Cloud Workload Placement Strategies” (Gartner paywall) for a guide to multicloud IaaS / IaaS+PaaS strategies (including when you should pursue a single-cloud approach). In a few weeks, you’ll see the follow-up doc “Designing a Cloud Workload Placement Policy” publish, which provides a guide to writing such policies, with an analysis of different placement factors and their priorities.
Tiering self-service by user competence
A nontrivial chunk of my client conversations are centered on the topic of cloud IaaS/PaaS self-service, and how to deal with development teams (and other technical end-user teams, i.e. data scientists, researchers, hardware engineers, etc.) that use these services. These teams, and the individuals within those teams, often have different levels of competence with the clouds, operations, security, etc. but pretty much all of them want unfettered access.
Responsible governance requires appropriate guidelines (policies) and guardrails, and some managers and architects feel that there should be one universal policy, and everyone — from the highly competent digital business team, to the data scientists with a bit of ad-hoc infrastructure knowledge — should be treated identically for the sake of “fairness”. This tends to be a point of particular sensitivity if there are numerous application development teams with similar needs, but different levels of cloud competence. In these situations, applying a single approach is deadly — either for agility or your crisis-induced ulcer.
Creating a structured, tiered approach, with different levels of self-service and associated governance guidelines and guardrails, is the most flexible approach. Furthermore, teams that deploy primarily using a CI/CD pipeline have different needs from teams working manually in the cloud provider portal, which in turn are different from teams that would benefit from having an easy-vend template that gets provisioned out of a ServiceNow request.
The degree to which each team can reasonably create its own configurations is related to the team’s competence with cloud solution architecture, cloud engineering, and cloud security. Not every person on the team may have a high level of competence; in fact, that will generally not be the case. However, the very least, for full self-service there needs to be at least one person with strong competencies in each of those areas, who has oversight responsibilities, acts an expert (provides assistance/mentorship within the team), and does any necessary code review.
If you use CI/CD, you also want automation of such review in your pipeline, that includes your infrastructure-as-code (IaC) and cloud configs, not just the app code; i.e. a tool like Concourse Labs). Even if your whole pipeline isn’t automated, review of IaC during the dev stage, and not just when it triggers a cloud security posture management tool (like Palo Alto’s Prisma Cloud or Turbot), whether in dev, test, or production.
Who determines “competence”? To avoid nasty internal politics, it’s best to set this standard objectively. Certifications are a reasonable approach, but if your org isn’t the sort that tends to pay for internal certifications or the external certifications (AWS/Azure Solution Architect, DevOps Engineer, Security Engineer, etc.) seem like too high a bar, you can develop an internal training course and certification. It’s not a bad idea for all of your coders (whether app developers, data scientists, etc.) that use the cloud to get some formal training on creating good and secure cloud configurations, anyway.
(For Gartner clients: I’m happy to have a deeper discussion in inquiry. And yes, a formal research note on this is currently going through our editing process and will be published soon.)