This is the time of year when AR professionals ask analysts what they’re planning for next year. I don’t plan a year in advance. I tend not to even plan a quarter in advance. I write when the mood seizes me, which is probably unfortunate, but given that I write a lot it’s… okay-ish?
But I have a bunch of things drafted (either fully or partially) and that should get released in Q1 of next year. I have a general goal of trying to ensure that I back the advice I write for cloud architects with something for the CIO and other executive leaders that provides a bottom-line strategic summary, and/or material for the other teams the architects work with, so that I publish stuff as part of a set, either alone or in collaboration with analysts in other areas.
Cloud resilience (January): It’s increasingly common for clients to ask about architectural standards for HA/DR in the cloud. This note dissects why cloud services break, how to set architectural standards for HA and DR/failover (i.e. when to be multi-AZ, when to do cross-region failover, etc.), and some basic guidance on stability patterns (use of partitioning, bulkheads, backpressure, etc.)
Cloud self-service (January): Thankfully, most organizations are moving away from a service catalog-driven approach to cloud self-service in favor of cloud-native self-service. This note is about how to empower technical teams with self-service, while still providing appropriate governance.
The cloud operating model (February): Many clients are asking about how to organize for the cloud. This will be a triple-note set — one on designing a cloud operating model, one on implementing the operating model, and a colorful infographic summarizing the concept for the CIO and other executive leaders. It combines my previous guidance on Cloud Center of Excellence (CCOE), structuring FinOps and cloud sourcing, etc. with some new work on program management, and takes a deeper look at all the ways you can put this stuff together.
Cloud concentration risk (March): Concentration risk is a hot topic right now, especially in regulated industries. This concern spans IaaS, PaaS, and SaaS, and the dependencies are not always clear, so many organizations have concentration risk they’re not aware of. I intend to write a baseline note that other analysts have committed to contextualizing for audiences in different industries, as well as for cloud managed and professional services providers. While the sourcing risk of concentration remains minimal, the availability risks of concentration can be meaningful. An organization’s risk appetite and the business benefits of concentration should determine what, if any, steps they take to address concentration risk.
IaaS+PaaS provider evaluation update (April): Getting updated vendor evaluation research out in April basically means spending a good chunk of the first quarter doing that evaluation. (My January notes have already been written. And the February ones are mostly complete, so the schedule above isn’t implausible.) We are not currently discussing the form that this evaluation will take. Gartner management will communicate appropriately when the time comes (i.e. please don’t ask me, as I’m not at liberty to discuss it).
Cloud budget overruns don’t have a singular cause. Instead, they come in a bright rainbow of jelly belly flavors (the Bertie Botts ones, especially, will combine into a non-mouthwatering delight). Each needs different forms of response.
Ungoverned costs. This is the black licorice of FinOps problems. The organization has no idea what it’s spending, really, much less where the money is going, other than the big bills (or often, many little credit card bills) that they pay each month. This requires basic cost hygiene: analyze your cloud bills, get a cost management tool into place and make it useful through some tagging or partitioning discipline.
Unanticipated usage. This is the sour watermelon flavor of cost overruns — deliciously sweet yet mouth-puckering. In this situation, the organization is the victim of its own cloud success. Cloud has been such a great thing for the organization that more and more unanticipated cloud projects are showing up, blowing out the original budget estimates for cloud resources. Those cloud projects are delivering business value and it doesn’t make sense to say no to them (and even if central IT says no, the cloud costs can usually be paid for out of a line-of-business budget). Nevertheless, it’s causing a lot of organizational angst because central IT or the sourcing team didn’t anticipate this spending. This organization needs to learn to shift its budgeting processes for the digital future, and cloud chargeback will help support future decision-making.
No commitments. This is the minty wrongness of Bertie Botts toothpaste. The organization could get discounts by using public discounting mechanisms for commits (like AWS Savings Plans and Azure Reserved Instances) as well making a contractual commitment for a negotiated discount. But because the organization feels like they can’t perfectly predict their use and aren’t sure if they’ll use all of what they’re using today, they commit to nothing, therefore ensuring that they spend grotesquely more than they could be. This is universally a terrible idea. Organizations that aren’t in early pilot stage have long-term production applications and some predictability of usage; commit to the stuff you know you’re not killing off.
Dev/test waste. This is the mundane bleah-ness of Bertie Botts earwax. Developers are provisioning the biggest things they can get away with (or at least being overaggressive in their estimates of what they need), there are lots of abandoned resources idling away, and dev/test infrastructure that isn’t used outside of business hours isn’t being suspended when unused. This is what cloud cost management tools are great at doing — identifying obvious waste so that it can be eliminated, largely by shutting it down or suspending it, preferably via automation.
Too much production headroom. This is the mild weirdness of the Bertie Botts grass flavor. Application teams haven’t implemented autoscaling for applications that can scale horizontally, or they’ve overestimated how much production headroom an application with variable usage needs (which may result in oversizing compute units, or being overly aggressive with autoscaling). This requires implementing autoscaling with some thoughtful tuning of parameters, and possibly a business value conversation on the cost/benefit tradeoff of having higher application performance on a consistent basis.
Wrongsizing production. This is the awful lingering terribleness of Bertie Botts vomit, whose taste you cannot get out of your mouth. Production environments are statically overprovisioned and therefore overly costly. On-prem, 30% utilization is common, but it’s all capex and as long as it’s within budget, no one really cares about the waste. But in the cloud, you pay for that excess resource monthly, forcing you to confront the ongoing cost of the waste.
However, anyone who tells you to “just” rightsize has never actually tried to do this in practice within an enterprise. The problem is that applications that scale vertically typically can’t be easily rightsized. It’s likely difficult-to-impossible to do automatically, due to complicated application installation. The application is fragile and may be mission-critical, so you are cautious about maintenance downtime. And the application team — the only people who really understand how this thing works — is likely busy with other priorities.
If this is your situation, your cloud cost management tool may cause you to cry hopeless tears, because you can see the waste but taking remediation actions is a complicated cross-functional war dance and delicate negotiation that leaves everyone wondering if it wouldn’t have been easier to just keep paying a larger bill.
Suboptimal design and implementation. The controversial popcorn flavor. Architects are sometimes cost-oblivious when they design cloud solutions. They may make bad design choices, or changes in application features and behavior over time may have turned out to make a design choice unexpectedly expensive. Developers may write poorly-performing code that consumes a lot of infrastructure resources, or code that makes excessive (and, cumulatively, expensive) calls to cloud services. Your cloud cost management tools are unlikely to be of any use for detecting these situations. This needs to be addressed through performance engineering, with attention paid to the business value of the time/effort/money necessary to do so — and for many organizations may require bringing in third-party expertise to diagnose the problems and offer recommendations.
Notably, the answer to most of these issues is not “implement a cloud cost management tool”. The challenges aren’t really as simple as a lot of vendors (and talking heads) make them out to be.
One of the problems in doing “root cause analysis” within complex systems is that there’s almost never “one bad thing” that’s truly at the root of the problem, and talking about the incident as if there’s One True Root is probably not productive. It’s important to identify the full range of contributing factors, so that you can do something about those elements individually as well as de-risking the system as a whole.
I recently heard someone talk about struggling to shift the language in their org around root cause, and it occurred to me that adapting Macneil’s Five P factors model from medicine/psychology would be very useful in SRE “blameless postmortems” (or traditional ITIL problem management RCAs). I’ve never seen anything about using this model in IT, and a casual Google search turned up nothing, so I figured I’d write a blog post about it.
The Five Ps (described in IT terms) — well, really six Ps, a problem and five P factors — are as follows:
- The presenting problem is not only the core impact, but also its broader consequences, which all should be examined and addressed. For instance, “The FizzBots service was down” becomes “Our network was unstable, resulting in FizzBots service failure. Our call center was overwhelmed, our customers are mad at us, and we need to pay out on our SLAs.”
- The precipitating factors are the things that triggered the incident. There might not be a single trigger, and the trigger might not be a one-time event (i.e. it could be a rising issue that eventually crossed a threshold, such as exhaustion of a connection pool or running out of server capacity). For example, “A network engineer made a typo in a router configuration.”
- The perpetuating factors are the things that resulted in the incident continuing (or becoming worse), once triggered. For instance, “When the network was down, application components queued requests, ran out of memory, crashed, and had to be manually recovered.”
- The predisposing factors are the long-standing things that made it more likely that a bad situation would result. For instance, “We do not have automation that checks for bad configurations and prevents their propagation.” or “We are running outdated software on our load-balancers that contains a known bug that results in sometimes sending requests to unresponsive backends.”
- The protective factors are things that helped to limit the impact and scope (essentially, your resilience mechanisms). For instance, “We have automation that detected the problem and reverted the configuration change, so the network outage duration was brief.”
- The present factors are other factors that were relevant to the outcome (including “where we got lucky”). For instance, “A new version of an application component had just been pushed shortly before the network outage, complicating problem diagnosis,” or “The incident began at noon, when much of the ops team was out having lunch, delaying response.”
If you think about the October 2021 Facebook outage in these terms, the presenting problem was the outage of multiple major Facebook properties and their attendant consequences. The precipitating factor was the bad network config change, but it’s clearly not truly the “root cause”. (If your conclusion is “they should fire the careless engineer who made a typo”, your thinking is Wrong.) There were tons of contributing factors, all of which should be addressed. “Blame” can’t be laid at the feet of anyone in particular, though some of the predisposing and perpetuating factors clearly had more impact than others (and therefore should be addressed with higher priority).
I like this terminology because it’s a clean classification that encompasses a lot of different sorts of contributing factors, and it’s intended to be used in situations that have a fair amount of uncertainty to them. I think it could be useful to structure incident postmortems, and I’d be keen to know how it works for you, if you try it out.
Of late, I’ve been talking to a lot of organizations that have learned cloud lessons the hard way — and even more organizations who are newer cloud adopters who seem absolutely determined to make the same mistakes. (Note: Those waving little cloud-repatriation flags shouldn’t be hopeful. Organizations are fixing their errors and moving on successfully with their cloud adoption.)
If your leadership adopts the adage, “Move fast and break things!” then no one should be surprised when things break. If you don’t adequately manage your risks, sometimes things will break in spectacularly public ways, and result in your CIO and/or CISO getting fired.
Many organizations that adopt that philosophy (often with the corresponding imposition of “You build it, you run it!” upon application teams) not only abdicate responsibility to the application teams, but they lose all visibility into what’s going on at the application team level. So they’re not even aware of the risks that are out there, much less whether those risks are being adequately managed. The first time central risk teams become aware of the cracks in the foundation might be when the building collapses in an impressive plume of dust.
(Note that boldness and the willingness to experiment are different from recklessness. Trying out new business ideas that end up failing, attempting different innovative paths for implementing solutions that end up not working out, or rapidly trying a bunch of different things to see which works well — these are calculated risks. They’re absolutely things you should do if you can. That’s different from just doing everything at maximum speed and not worrying about the consequences.)
Just like cloud cost optimization might not be a business priority, broader risk management (especially security risk management) might not be a business priority. If adding new features is more important than address security vulnerabilities, no one should be shocked when vulnerabilities are left in a state of “busy – fix later”. (This is quite possibly worse than “drunk – fix later“, as that at least implies that the fix will be coming as soon as the writer sobers up, whereas busy-ness is essentially a state that tends to persist until death).
It’s faster to build applications that don’t have much if any resilience. It’s faster to build applications if you don’t have to worry about application security (or any other form of security). It’s faster to build applications if you don’t have to worry about performance or cost. It’s faster to build applications if you only need to think about the here-and-now and not any kind of future. It is, in short, faster if you are willing to accumulate meaningful technical debt that will be someone else’s problem to deal with later. (It’s especially convenient if you plan to take your money and run by switching jobs, ensuring you’re free of the consequences.)
“We hope the business and/or dev teams will behave responsibly” is a nice thought, but hope is not a strategy. This is especially true when you do little to nothing to ensure that those teams have the skills to behave responsibly, are usefully incentivized to behave responsibly, and receive enough governance to verify that they are behaving responsibly.
When it all goes pear-shaped, the C-level IT executives (especially the CIO, chief information security officer, and the chief risk officer) are going to be the ones to be held accountable and forced to resign under humiliating circumstances. Even if it’s just because “You should have known better than to let these risks go ungoverned”.
(This usually holds true even if business leaders insisted that they needed to move too quickly to allow risk to be appropriately managed, and those leaders were allowed to override the CIO/CISO/CRO, business leaders pretty much always escape accountability here, because they aren’t expected to have known better. Even when risk folks have made business leaders sign letters that say, “I have been made aware of the risks, and I agree to be personally responsible for them” it’s generally the risk leaders who get held accountable. The business leaders usually get off scott-free even with the written evidence.)
Risk management doesn’t entail never letting things break. Rather, it entails a consideration of risk impacts and probabilities, and thinking intelligently about how to deal with the risks (including implementing compensating controls when you’re doing something that you know is quite risky). But one little crack can, in combination with other little cracks (that you might or might or might not be aware of), result in big breaches. Things rarely break because of black swan events. Rather, they break because you ignored basic hygiene, like “patch known vulnerabilities”. (This can even impact big cloud providers, i.e. the recent Azurescape vulnerability, where Microsoft continued to use 2017-era known-vulnerable open-source code in production.)
However, even in organizations with central governance of risk, it’s all too common to have vulnerability management teams inform you-build-it-you-run-it dev teams that they need to fix Known Issue X. A busy developer will look at their warning, which gives them, say, 30 days to fix the vulnerability, which is within the time bounds of good practice. Then on day 30, the developer will request an extension, and it will probably be granted, giving them, say, another 30 days. When that runs out, the developer will request another extension, and they will repeat this until they run out the extension clock, whereupon usually 90 days or more have elapsed. At that point there will probably be a further delay for the security team to get involved in an enforcement action and actually fix the thing.
There are no magic solutions for this, especially in organizations where teams are so overwhelmed and overworked that anything that might possibly be construed as optional or lower-priority gets dropped on the floor, where it is trampled, forgotten, and covered in old chewing gum. (There are non-magical solutions that require work — more on that in future research notes.)
Moving fast and breaking things takes a toll. And note that sometimes what breaks are people, as the sheer number of things they need to cope with overload their coping mechanisms and they burn out (either in impressive pillars or flame, or quiet extinguishment into ashes).
You shouldn’t relegate cloud cost governance, management and optimization to a dedicated FinOps team. Effective management of cloud economics requires cross-functional collaboration and the establishment of cloud economics as a pervasive cultural practice.
Cloud economics is a practice that goes beyond cloud cost management. It is focused on maximizing the value of cloud computing to the business, rather than minimizing cloud expenses. For example, business leaders may reasonably make the decision to spend more to deliver a better user experience, or to ignore cost-related technical debt so application teams can focus on delivering more features.
You can’t effectively manage your cloud providers or the consumption of cloud within your organization without a solid collaboration between cloud architects, cloud operations, developers, the sourcing team, and your business leadership. Indeed, the business leadership is absolutely vital, as I’ve noted in a previous blog post (“Cloud cost overruns may be a business leadership failure“), and a new research note titled “Is FinOps the Answer to Cloud Cost Governance?” (Gartner executive leaders paywall).
In fact, that’s the first of a just-published a trio of notes that I’ve been wanting to write for the last five years but hadn’t found the right collaborator. In almost 15 years of covering cloud computing at Gartner, I’ve spent giant amounts of time with IT management (up through the CIO level), cloud architects, and sourcing managers, reviewing cloud contracts, hearing cloud success stories and hearing cost-management woes (the two are certainly not mutually exclusive). I’ve moderated more than a few fights between sourcing managers and cloud architects over topics like “should we choose the cheapest provider” and “who’s responsible for controlling our cloud costs”. Probably unsurprisingly given my technical biases, I’ve generally sided with the cloud architects, even though I’ve spent sufficient time with sourcing managers to be sympathetic to their goals.
In the meantime, my sourcing-analyst colleague Tobi Bet (and the rest of her team) had seen those same fights, but primarily from the perspective of the sourcing team. So I roped Tobi into doing a paired set of research notes with me. They’ve now published under the title “Managing Cloud Economics: A Role‘s Guide to Productive Relationships With Other_Role“. There’s a huge note for cloud architects (Gartner for Technical Professionals paywall) and a concise note for sourcing leaders (Gartner for IT Leaders paywall).
The purpose of these notes is to provide a unified perspective on questions like:
- Who should decide what cloud providers we use?
- Who should “own” the relationship with cloud vendors?
- Who should be responsible for cost management in the cloud?
- How should we resolve battles over cloud costs?
- How should we deal with cloud vendor lock-in?
It provides guidance for how to think about cloud economics (i.e. core principles), priorities, and responsibilities. The lengthy note for cloud architects has a giant pile of responsibility matrices for the specific things you have to do, for varying levels of cloud self-service, and across IaaS, PaaS, and SaaS. Ideally, if your various functions are arguing about cost management or cloud provider management, this note has an answer for you.
So group hug time: Everyone’s got to collaborate together to make this work. (And everyone’s got to have some accountability for doing their part.)
As I noted in a previous blog post, multicloud failover is almost always a terrible idea. While the notion that an entire cloud provider can go dark for a lengthy period of time (let’s say a day or more) is not entirely impossible, it’s the least probable of the many ways that an application can experience failure. Humans tend to over-index on catastrophic but low-probability events, so it’s not especially shocking that people fixate on the possibility, but before you spend precious people-effort (not to mention money) on multicloud failover, you should first properly resource all the other things you could be doing to improve your resilience in the cloud.
As I noted previously, five core things impact cloud resilience: physical design, logical (software) design, implementation quality, deployment processes, and operational processes. So you should select your cloud provider carefully. Some providers have a better track record of reliability than others — often related directly in differences in the five core resilience factors. I’m not suggesting that this be a primary selection criterion, but the less reliable your provider, the more you’re going to have to pour effort into resilience, knowing that the provider’s failures are going to test you in the real world. You should care most about the failure of global dependencies (identity, security certificates, NTP, DNS, etc.) that can affect all services worldwide, followed by multi-region failures (especially those that affect an entire geography).
However, those things aren’t just important for cloud providers. They also affect you, the application owner, and the way you should design, implement, update, and operate your application — whether that application is on-premises or in the cloud. Before you resort to multicloud failover, you should have done all of the below and concluded that you’ve already maximized your resilience via these techniques and still need more.
Start with local HA. When architecting a mission-critical application, design it to use whatever HA capabilities are available to you within an availability zone (AZ). Use a clustered (and preferably scale-out) architecture for the stuff you build yourself. Ensure you maximize the resilience options available from the cloud services.
Build good error-handling into your application. Your application should besmart about the way it handles errors, either from other application components or from cloud services (or other third-party components). It should exhibit polite retry behavior and implement circuit breakers to try to limit cascading failures. It should implement load-shedding, in recognition of the fact that rejecting excessive requests so that the requests that can be served receive decent performance is better than just collapsing into non-responsiveness. It should have fallback mechanisms for graceful degradation, to limit impact on users.
Architect the application’s internals for resilience. Techniques such as partitions and bulkheads are likely going to be reserved for larger-scale applications, but are vital for limiting the blast radius of failures. (If you have no idea what any of this terminology means, read Michael Nygard’s “Release It!” — in my personal opinion, if you read one book about mission-critical app design, that should probably be the one.)
Use multiple AZs. Run your application active-active across at least two, and preferably three, AZs within each region that you use. (Note that three can be considerably harder than two because most cloud provider services natively support running in two AZs simultaneously but not three. But that’s a far easier problem than multicloud failover.)
Use multiple regions. Run your application active-active across at least two, and preferably three regions. (Again, two is definitely much easier than three, due to a cloud service’s cross-region support generally being two regions.) If you can’t do that, do fast fully-automated regional failover.
Implement chaos engineering. Not only do you need to thoroughly test in your dev/QA environment to determine what happens under expected failure conditions, but you also need to experiment with fault injection in your production environment where there are complex unpredictable conditions that may cause unexpected failures. If this sounds scary and you expect it’ll blow up in your face, then you need to do a better job in the design and implementation of your application. Forcing constant failures into production systems (ala Netflix’s famed Chaos Monkey) helps you identify all the weak spots, builds resilience, and should help give you confidence that things will continue to work when cloud issues arise.
It’s really important to treat resilience as a systems concern, not purely an infrastructure concern. Your application architecture and implementation need to be resilient. If your developers can’t be trusted to write continuously available applications, imposing multicloud portability requirements (and attendant complexity) upon them will probably add to your operational risks.
And I’m not kidding about the chaos engineering. If you’re not mature enough for chaos engineering, you’re not mature enough to successfully implement multicloud failover. If you don’t routinely shoot your own AZs and regions, kill access to services, kill application components, make your container hosts die, deliberately screw up your permissions and fail-closed, etc. and survive that all without worrying, you need to go address your probable risks of failure that have solutions of reasonable complexity, before you tackle the giant complex beast of multicloud failover to address the enormously unlikely event of total provider failure.
Remember that we’re trying to achieve continuity of our business processes and not continuity of particular applications. If you’ve done all of the above and you’re still worried about the miniscule probability of total provider failure, consider building simple alternative applications in another cloud provider (or on-premises, or in colo/hosting). Such applications might simply display cached data, or queue transactions for later processing. This is almost always easier than maintaining full cross-cloud portability for a complex application. Plus, don’t forget that there might be (gasp) paper alternatives for some processes.
(And yes, I already have a giant brick of a research note written on this topic, slated for publication at the end of this year. Stay tuned…)
A couple of months back, some smart folks at VC firm Andreesen Horowitz wrote a blog post called “The Cost of Cloud, a Trillion Dollar Paradox“. Among other things, the blog made a big splash because it claimed, quote: “[W]hile cloud clearly delivers on its promise early on in a company’s journey, the pressure it puts on margins can start to outweigh the benefits, as a company scales and growth slows.” It claimed that cloud overspending was resulting in huge loss of market value, and that developers needed incentives to reduce spending.
The blog post is pretty sane, but plenty of people misinterpreted it, or took away only its most sensationalistic aspects. I think it’s critical to keep in mind the following:
Decisions about cloud expenditures are ultimately business decisions. Unnecessarily high cloud costs are the result of business decisions about priorities — specifically, about the time that developers and engineers devote to cost optimization versus other priorities.
For example, when developer time is at a premium, and pushing out features as fast as possible is the highest priority, business leadership can choose to allow the following things that are terrible for cloud cost:
- Developers can ignore all annoying administrative tasks, like rightsizing the infrastructure or turning off stuff that isn’t in active use.
- Architects can choose suboptimal designs that are easier and faster to implement, but which will cost more to run.
- Developers can implement crude algorithms and inefficient code in order to more rapidly deliver a feature, without thinking about performance optimizations that would result in less resource consumption.
- Developers can skip implementing support for more efficient consumption patterns, such as autoscaling.
- Developers can skip implementing deployment automation that would make it easier to automatically rightsize — potentially compounded by implementing the application in ways that are fragile and make it too risky and effortful to manually rightsize.
All of the above is effectively a form of technical debt. In the pursuit of speed, developers can consume infrastructure more aggressively themselves — not bothering to shut down unused infrastructure, running more CI jobs (or other QA tests), running multiple CI jobs in parallel, allocating bigger faster dev/test servers, etc. — but that’s short-term, not an ongoing cost burden the way that the technical debt is. (Note that the same prioritization issues also impact the extent to which developers cooperate in implementing security directives. That’s a tale for another day.)
The more those things are combined — bad designs, poorly implemented, that you can’t easily rightsize or scale — the more that you have a mess that you can’t untangle without significant expenditure of development time.
Now, some organizations will go put together a “FinOps” team to play whack-a-mole with infrastructure — killing/parking stuff that is idle and rightsizing the waste. And that might help short-term, but until you can automate that basic cost hygiene, this is non-value-added people-intensive work. And woe betide you if your implementations are fragile enough that rightsizing is operationally risky.
Once you’ve got your whack-a-mole down to a nice quick automated cadence, you’ve got to address the application design and implementation technical debt — and invest in the discipline of performance engineering — or you’ll continue paying unnecessarily high bills month after month. (You’d also be oversizing on-prem infrastructure, but people are used to that, and the capital expenditure is money spent, versus the grind of a monthly cloud bill.)
Business leaders have to step up to prioritize cloud cost optimization — or acknowledge that it isn’t a priority, and that it’s okay to waste money on resources as long as the top line is increasing faster. As long that’s a conscious, articulated decision, that’s fine. But we shouldn’t pretend that developers are inherently irresponsible. Developers, like other employees, respond to incentives, and if they’re evaluated on their velocity of feature delivery, they’re going to optimize their work efforts towards that end.
For more details, check out my new research note called “Is FinOps the Answer to Cloud Cost Governance?” which is paywalled and targeted at Gartner’s executive leader clients — a combination of CxOs and business leaders.
Most people — and notably, almost all regulators — are entirely wrong about addressing cloud resilience through the belief that they should do multicloud failover because, as I noted in a previous blog post, the cloud is NOT just someone else’s computer. (I have been particularly aghast at a recent Reuters article about the Bank of England’s stance.)
Regulators, risk managers, and plenty of IT management largely think of AWS, Azure, etc. as monolithic entities, where “the cloud” can just break for them, and then kaboom, everything is dead everywhere worldwide. They imagine one gargantuan amorphous data center, subject to all the problems that can afflict single data centers, or single systems. But that’s not how it works, that’s not the most effective way to address risk, and testing the “resilience of the provider” (as a generic whole) is both impossible and meaningless.
I mean, yes, there’s the possibility of the catastrophic failure of practically any software technology. There could be, for instance, a bug in the control systems of airplanes from fill-in-the-blank manufacturer that could be simultaneously triggered at a particular time and cause all their airplanes to drop out of the sky simultaneously. But we don’t plan to make commercial airlines maintain backup planes from some other manufacturer in case it happens. Instead, we try to ensure that each plane is resilient in many ways — which importantly addresses the most probable forms of failure, which will be electrical or mechanical failures of particular components.
Hyperscale cloud providers are full of moving parts — lots of components, assembled together into something that looks and feels like a cohesive whole. Each of those components has its own form of resilience, and some of those components are more fragile than others. Some of those components are typically operating well within engineered tolerances. Some of those components might be operating at the edge of those tolerances in certain circumstances — likely due to unexpected pressures from scale — and might be extra-scary if the provider isn’t aware that they’re operating at that edge. In addition to fault-tolerance within each component, there are many mechanisms for fault-tolerance built into the interaction between those components.
Every provider also has its own equivalent of “maintenance” (returning to the plane analogy). The quality of the “mechanics” and the operations will also impact how well the system as a whole operates. (See my previous blog post, “The multi-headed hydra of cloud resilience” for the factors that go into provider resilience.)
It’s not impossible for a provider to have a worldwide outage that effectively impacts all services (rather than just a single service). Such outages are all typically rooted in something that prevents components from communicating with each other, or customers from connecting to the services — global network issues, DNS, security certificates, or identity. The first major incident of this type was the 2012 Azure leap year outage. The 2019 Google “Chubby” outage had global network impact, including on GCP. There have been multiple Azure AD outages with broad impact across Microsoft’s cloud portfolio, most recently the 2021 Azure Active Directory outage. (But there are certainly other possibilities. As recently as yesterday, there was a global Azure Windows VM outage that impacted all Windows VM-dependent services.)
Provider architectural and operational differences do clearly make a difference. AWS, notably, has never had a full regional failure or a global outage. The unique nature of GCP’s global network has both benefits and drawbacks. Azure has been improving steadily in reliability over the years as Microsoft addresses both service architecture and deployment (and other operations) processes.
Note that while these outages can be multi-hour, they have generally been short enough that — given typical enterprise recovery-time objectives for disaster recovery, which are often lengthy — customers typically don’t activate a traditional DR plan. (Customers may take other mitigation actions, i.e. failover to another region, failover to an alternative application for a business process, and so forth.)
Multicloud failover requires that you maintain full portability between two providers, which is a massive burden on your application developers. The basic compute runtime (whether VMs or containers) is not the problem, so OpenShift, Anthos, or other “I can move my containers” solutions won’t really help you. The problem is all the differentiators — the different network architectures and features, the different storage capabilities, the proprietary PaaS capabilities, the wildly different security capabilities, etc. Sure, you can run all open source in VMs, but at that point, why are you bothering with the cloud at all? Plus, even in a DR situation, you need some operational capabilities on the other cloud (monitoring, logging, etc.), even if not your full toolset.
Moreover, the huge cost and complexity of a multicloud implementation is effectively a negative distraction from what you should actually be doing that would improve your uptime and reduce your risks, which is making your applications resilient to the types of failure that are actually probable. More on that in a future blog post.
In the past couple of months, I have talked to the majority of the world’s largest banks about what is necessary to drive successful cloud adoption at enterprise scale. These conversations have a lot of things in common with one another, and I often send the same research notes as a follow-up to our conversations. Here are those notes, with some context. The notes are all behind the Gartner paywall, in most cases Gartner for Technical Professionals, but some of these are available to IT Leaders clients, or Executive Programs clients.
Banks are indeed really moving core banking to the cloud. The long-held adage that “banks might put new systems of innovation or systems of engagement in the cloud, but they’ll never move core banking”, is crumbling. Gartner has statistics supporting this, which you can find in “Core Banking Hot Spot: Moving the Core Into the Cloud“.
Banks cite application modernization as a critical driver for cloud adoption. An increasing number of banks are migrating a substantial percentage of their existing application estate to public cloud IaaS (and PaaS). Supporting survey data can be found in “Application Modernization Is the Most Common Identified Priority for End-User Cloud Adoption in Banking and Investment Services” (but other priorities are closely clustered in importance).
Banks are striving to mature their cloud adoption. Some banks have had a lot of ad hoc adoption over the years, while other banks have been more cautious (venturing into a bit of SaaS but sometimes zero IaaS or PaaS). But we’ve hit the inflection point (starting about two years ago) where banks became comfortable with cloud provider security and then seemingly all of a sudden went to a “go go go!” mode in which cloud was viewed as a critical accelerator of digital banking initiatives. (See “Advance Through Public Cloud Adoption Maturity” for a view of typical journeys.)
Central cloud governance is the norm for banks. Banks generally like the Gartner-style cloud center of excellence (CCOE) model where an enterprise architecture function provides cloud governance, brokerage, and transformation assistance. (See “How to Build a Cloud Center of Excellence“.) However, their CCOE model is likely to be federated to empower different business units or regions to take charge of their own destinies (especially when the cloud strategy is more regional than global). And many banks are splitting off a separate cloud IT unit under a deputy CIO, which is effectively a self-contained organization with hundreds of people devoted to the cloud migration and transformation effort.
While banks still do detailed technical evaluation of cloud providers, strategic selection is based on alignment to the IT strategy. Banks still really care about nitpicky technical details, but ultimately, their selection of strategic providers is based on broader IT priorities, just like most other cloud customers these days. (See “How to Initiate the Selection of Strategic Cloud IaaS Providers“.) Sometimes there’s a certain degree of hope for some kind of innovation partnership. (I am cynical about such “partnerships”, especially when they come in the form of vague sales platitudes without contractual guarantees or a close business development relationship.)
Banks tend to be multicloud. The larger the bank, the more likely it is to adopt a multicloud strategy, similar to other enterprises (see “Comparing Cloud Workload Placement Strategies“). However, this does not mean that all cloud providers are treated equally. My anecdotal impression is that in terms of primary strategic provider, AWS dominates the the top end of the market (the largest banks) but that Azure captures the middle of the pack (from the US midmarket banks that tend to outsource their processing, to the banks that are important at the country/region level but not highly global).
Banks are making the transition to a more systematic approach to multicloud. Like many large distributed enterprises, banks often have pockets of cloud adoption, each aligned to a different cloud provider. With the maturation of their cloud journeys, they are becoming more systematic, building workload placement policies to guide where workloads should go. (See “Designing a Cloud Workload Placement Policy Document“.)
Banks worry about cloud concentration risks. Many banks face regulatory regimes that require them to address concentration risk. Regulators tend not to provide prescriptive guidance for what they must do, though. Banks have told me that attempting to maintain multicloud portability for applications essentially destroys the business case for cloud. Portability significantly impacts application development time, thus reducing the agility benefits. Without the ability to exploit the unique differentiated capabilities of a cloud provider, there’s little compelling reason not to just do it on-premises — which might actually be more risky than doing it in the cloud. There are effective practical risk-reduction approaches that don’t involve “maintain constant portability of all my apps”, though. (See “How to Create a Public Cloud Integrated IaaS and PaaS Exit Plan“.)
I hope to collaborate with a Gartner colleague to write bank-targeted research in the future. If you’re a cloud architect at a bank, I’d love to speak with you in client inquiry.
I recently wrote a Twitter thread about cloud risk and resilience that drew a lot of interest, so I figured I’d expand on it in a blog post. I’ve been thinking about cloud resilience a lot recently, given that clients have been asking about how they manage their risks.
Inquiries about this historically come in waves, almost always triggered by incidents that raise awareness (unfortunately often because the customer has been directly impacted). A wave generally spans a multi-week period, causing waves to bleed into one another. Three distinct sets come to mind over the course of 2021:
- The Azure AD outages earlier this year had a huge impact on client thinking about concentration risks and critical service dependencies — often more related to M365 than Azure, though (and exacerbated by the critical dependency that many organizations have on Teams during this pandemic). Azure AD is core to SSO for many organizations, making its resilience enormously impactful. These impacts are still very top of mind for many clients, months later.
- The Akamai outage (and other CDN outages with hidden dependencies) this summer raised application and infrastructure dependency awareness, and came as a shock to many customers, as Akamai has generally been seen as a bedrock of dependability.
- The near-daily IBM Cloud “Severity 1” outages over the last month have drawn selective client mentions, rather than a wave, but add to the broader pattern of cloud risk concerns. (To my knowledge, there has been no public communication from IBM regarding root cause of these issues. Notifications indicate the outages are multi-service and multi-regional, often impacting all Gen 2 multizone regions. Kubernetes may be something of a common factor, to guess from the impact scope.)
Media amplification of outage awareness appears to have a lot to do with how seriously they’re taken by customers — or non-customers. Affecting stuff that’s consumed by end-users — i.e. office suites, consumer websites, etc. — gets vastly more attention than things that are “just” a really bad day for enterprise ops people. And there’s a negative halo effect — i.e. if Provider X fails, it tends to raise worries about all their competitors too. But even good media explanations and excellent RCAs tend to be misunderstood by readers — and even by smart IT people. This leads, in turn, to misunderstanding why cloud services fail and what the real risks are.
I recently completed my writing on a note about HA and failover (DR) patterns in cloud IaaS and PaaS, with a light touch on application design patterns for resilience. However, concerns about cloud resilience applies just as much — if not more so — to SaaS, especially API SaaS, which creates complicated and deep webs of dependencies.
You can buy T-shirts, stickers, and all manner of swag that says, “The cloud is just somebody else’s computer.” Cute slogan, but not true. Cloud services — especially at massive scale — are incredibly complex software systems. Complex software systems don’t fail the way a “computer” fails. The cloud exemplifies the failure principles laid out by Richard Cook in his classic “How Complex Systems Fail“.
As humans, we are really bad at figuring out the risk of complex systems, especially because the good ones are heavily defended against failure. And we tend to over-index on rare but dramatic risks (a plane crash) versus more commonplace risks (a car crash).
If you think about “my application hosted on AWS” as “well, it’s just sitting on a server in an AWS data center rather than mine”, then at some point in time, the nature of a failure is going to shock you, because you are wrong.
Cloud services fail after all of the resiliency mechanisms have failed (or sometimes, gone wrong in ways that contribute to the failure). Cloud services tend to go boom because of one or more software bugs, likely combined with either a configuration error or some kind of human error (often related to the deployment process for new configs and software versions). They are only rarely related to a physical failure — and generally the physical failure only became apparent to customers because the software intended to provide resilience against it failed in some fashion.
Far too many customers still think about cloud failure as a simple, fundamentally physical thing. Servers fail, so we should use more than one. Data centers fail, so we should be able to DR into another. Etc. But that model is wrong for cloud and for the digital age. We want to strive for continuous availability and resilience (including graceful degradation and other ways to continue business functionality when the application fails). And we have to plan for individual services failures rather than total cloud failure (whether in an AZ, region, or globally). Such failures can be small-scale, and effectively merely “instability”, rather than an “outage” — and therefore demands apps that are resilient to service errors.
So as cloud buyers, we have to think about our risks differently, and we need to architect and operate differently. But we also need to trust our providers — and trust smartly. To that end, cloud providers need to support us with transparency, so we can make more informed decisions. Key elements of that include:
- Publicly-documented engineering service-level objectives (SLOs), which are usually distinct from the financially-backed SLAs. This is what cloud providers design to internally and measure themselves against, and knowing that helps inform our own designs and internal SLOs for our apps.
- Service architecture documentation that helps us understand the ways a service is and isn’t resilient, so we can design accordingly.
- Documented service dependency maps, which allow us to see the chain of dependencies for each of the services we use, allowing us to think about if Service X is really the best fallback alternative if Service Y goes down, as well as inform our troubleshooting.
- Public status dashboards, clearly indicating the status of services, with solid historical data that allows us to see the track record of service operations. This helps with our troubleshooting and user communication.
- Public outage root-cause analysis (RCA), which allow us to understand why outages occurred, and receive a public pledge as to what will be done to prevent similar failures in the future. A historical archive of these is also a valuable resource.
- Change transparency that could help predict stability concerns. Because so many outages end up being related to new deployments / config changes, and the use of SRE principles, including error budgets, is pretty pervasive amongst cloud providers, there is often an interesting pattern to outages. Changes tend to freeze when the error budget is exceeded, leading to an on-and-off pattern of outages; instability can resume at intervals unpredictable to the customer.
Mission-critical cloud applications are becoming commonplace — both in the pervasive use of SaaS, along with widespread production use of IaaS and PaaS. It’s past time to modernize thinking about cloud operations, cloud resilience, and cloud BC/DR. Cloud risk management needs to be about intelligent mitigation and not avoidance, as forward-thinking businesses are will not accept simply avoiding the cloud at this point.
I am interested in your experiences with resilience as well as cloud instability and outages. Feel free to DM me on Twitter to chat about it.