Author Archives: Lydia Leong
Group hugs for managing cloud economics
You shouldn’t relegate cloud cost governance, management and optimization to a dedicated FinOps team. Effective management of cloud economics requires cross-functional collaboration and the establishment of cloud economics as a pervasive cultural practice.
Cloud economics is a practice that goes beyond cloud cost management. It is focused on maximizing the value of cloud computing to the business, rather than minimizing cloud expenses. For example, business leaders may reasonably make the decision to spend more to deliver a better user experience, or to ignore cost-related technical debt so application teams can focus on delivering more features.
You can’t effectively manage your cloud providers or the consumption of cloud within your organization without a solid collaboration between cloud architects, cloud operations, developers, the sourcing team, and your business leadership. Indeed, the business leadership is absolutely vital, as I’ve noted in a previous blog post (“Cloud cost overruns may be a business leadership failure“), and a new research note titled “Is FinOps the Answer to Cloud Cost Governance?” (Gartner executive leaders paywall).
In fact, that’s the first of a just-published a trio of notes that I’ve been wanting to write for the last five years but hadn’t found the right collaborator. In almost 15 years of covering cloud computing at Gartner, I’ve spent giant amounts of time with IT management (up through the CIO level), cloud architects, and sourcing managers, reviewing cloud contracts, hearing cloud success stories and hearing cost-management woes (the two are certainly not mutually exclusive). I’ve moderated more than a few fights between sourcing managers and cloud architects over topics like “should we choose the cheapest provider” and “who’s responsible for controlling our cloud costs”. Probably unsurprisingly given my technical biases, I’ve generally sided with the cloud architects, even though I’ve spent sufficient time with sourcing managers to be sympathetic to their goals.
In the meantime, my sourcing-analyst colleague Tobi Bet (and the rest of her team) had seen those same fights, but primarily from the perspective of the sourcing team. So I roped Tobi into doing a paired set of research notes with me. They’ve now published under the title “Managing Cloud Economics: A Role‘s Guide to Productive Relationships With Other_Role“. There’s a huge note for cloud architects (Gartner for Technical Professionals paywall) and a concise note for sourcing leaders (Gartner for IT Leaders paywall).
The purpose of these notes is to provide a unified perspective on questions like:
- Who should decide what cloud providers we use?
- Who should “own” the relationship with cloud vendors?
- Who should be responsible for cost management in the cloud?
- How should we resolve battles over cloud costs?
- How should we deal with cloud vendor lock-in?
It provides guidance for how to think about cloud economics (i.e. core principles), priorities, and responsibilities. The lengthy note for cloud architects has a giant pile of responsibility matrices for the specific things you have to do, for varying levels of cloud self-service, and across IaaS, PaaS, and SaaS. Ideally, if your various functions are arguing about cost management or cloud provider management, this note has an answer for you.
So group hug time: Everyone’s got to collaborate together to make this work. (And everyone’s got to have some accountability for doing their part.)
Improving cloud resilience through stuff that works
As I noted in a previous blog post, multicloud failover is almost always a terrible idea. While the notion that an entire cloud provider can go dark for a lengthy period of time (let’s say a day or more) is not entirely impossible, it’s the least probable of the many ways that an application can experience failure. Humans tend to over-index on catastrophic but low-probability events, so it’s not especially shocking that people fixate on the possibility, but before you spend precious people-effort (not to mention money) on multicloud failover, you should first properly resource all the other things you could be doing to improve your resilience in the cloud.
As I noted previously, five core things impact cloud resilience: physical design, logical (software) design, implementation quality, deployment processes, and operational processes. So you should select your cloud provider carefully. Some providers have a better track record of reliability than others — often related directly in differences in the five core resilience factors. I’m not suggesting that this be a primary selection criterion, but the less reliable your provider, the more you’re going to have to pour effort into resilience, knowing that the provider’s failures are going to test you in the real world. You should care most about the failure of global dependencies (identity, security certificates, NTP, DNS, etc.) that can affect all services worldwide, followed by multi-region failures (especially those that affect an entire geography).
However, those things aren’t just important for cloud providers. They also affect you, the application owner, and the way you should design, implement, update, and operate your application — whether that application is on-premises or in the cloud. Before you resort to multicloud failover, you should have done all of the below and concluded that you’ve already maximized your resilience via these techniques and still need more.
Start with local HA. When architecting a mission-critical application, design it to use whatever HA capabilities are available to you within an availability zone (AZ). Use a clustered (and preferably scale-out) architecture for the stuff you build yourself. Ensure you maximize the resilience options available from the cloud services.
Build good error-handling into your application. Your application should besmart about the way it handles errors, either from other application components or from cloud services (or other third-party components). It should exhibit polite retry behavior and implement circuit breakers to try to limit cascading failures. It should implement load-shedding, in recognition of the fact that rejecting excessive requests so that the requests that can be served receive decent performance is better than just collapsing into non-responsiveness. It should have fallback mechanisms for graceful degradation, to limit impact on users.
Architect the application’s internals for resilience. Techniques such as partitions and bulkheads are likely going to be reserved for larger-scale applications, but are vital for limiting the blast radius of failures. (If you have no idea what any of this terminology means, read Michael Nygard’s “Release It!” — in my personal opinion, if you read one book about mission-critical app design, that should probably be the one.)
Use multiple AZs. Run your application active-active across at least two, and preferably three, AZs within each region that you use. (Note that three can be considerably harder than two because most cloud provider services natively support running in two AZs simultaneously but not three. But that’s a far easier problem than multicloud failover.)
Use multiple regions. Run your application active-active across at least two, and preferably three regions. (Again, two is definitely much easier than three, due to a cloud service’s cross-region support generally being two regions.) If you can’t do that, do fast fully-automated regional failover.
Implement chaos engineering. Not only do you need to thoroughly test in your dev/QA environment to determine what happens under expected failure conditions, but you also need to experiment with fault injection in your production environment where there are complex unpredictable conditions that may cause unexpected failures. If this sounds scary and you expect it’ll blow up in your face, then you need to do a better job in the design and implementation of your application. Forcing constant failures into production systems (ala Netflix’s famed Chaos Monkey) helps you identify all the weak spots, builds resilience, and should help give you confidence that things will continue to work when cloud issues arise.
It’s really important to treat resilience as a systems concern, not purely an infrastructure concern. Your application architecture and implementation need to be resilient. If your developers can’t be trusted to write continuously available applications, imposing multicloud portability requirements (and attendant complexity) upon them will probably add to your operational risks.
And I’m not kidding about the chaos engineering. If you’re not mature enough for chaos engineering, you’re not mature enough to successfully implement multicloud failover. If you don’t routinely shoot your own AZs and regions, kill access to services, kill application components, make your container hosts die, deliberately screw up your permissions and fail-closed, etc. and survive that all without worrying, you need to go address your probable risks of failure that have solutions of reasonable complexity, before you tackle the giant complex beast of multicloud failover to address the enormously unlikely event of total provider failure.
Remember that we’re trying to achieve continuity of our business processes and not continuity of particular applications. If you’ve done all of the above and you’re still worried about the miniscule probability of total provider failure, consider building simple alternative applications in another cloud provider (or on-premises, or in colo/hosting). Such applications might simply display cached data, or queue transactions for later processing. This is almost always easier than maintaining full cross-cloud portability for a complex application. Plus, don’t forget that there might be (gasp) paper alternatives for some processes.
(And yes, I already have a giant brick of a research note written on this topic, slated for publication at the end of this year. Stay tuned…)
Cloud cost overruns may be a business leadership failure
A couple of months back, some smart folks at VC firm Andreesen Horowitz wrote a blog post called “The Cost of Cloud, a Trillion Dollar Paradox“. Among other things, the blog made a big splash because it claimed, quote: “[W]hile cloud clearly delivers on its promise early on in a company’s journey, the pressure it puts on margins can start to outweigh the benefits, as a company scales and growth slows.” It claimed that cloud overspending was resulting in huge loss of market value, and that developers needed incentives to reduce spending.
The blog post is pretty sane, but plenty of people misinterpreted it, or took away only its most sensationalistic aspects. I think it’s critical to keep in mind the following:
Decisions about cloud expenditures are ultimately business decisions. Unnecessarily high cloud costs are the result of business decisions about priorities — specifically, about the time that developers and engineers devote to cost optimization versus other priorities.
For example, when developer time is at a premium, and pushing out features as fast as possible is the highest priority, business leadership can choose to allow the following things that are terrible for cloud cost:
- Developers can ignore all annoying administrative tasks, like rightsizing the infrastructure or turning off stuff that isn’t in active use.
- Architects can choose suboptimal designs that are easier and faster to implement, but which will cost more to run.
- Developers can implement crude algorithms and inefficient code in order to more rapidly deliver a feature, without thinking about performance optimizations that would result in less resource consumption.
- Developers can skip implementing support for more efficient consumption patterns, such as autoscaling.
- Developers can skip implementing deployment automation that would make it easier to automatically rightsize — potentially compounded by implementing the application in ways that are fragile and make it too risky and effortful to manually rightsize.
All of the above is effectively a form of technical debt. In the pursuit of speed, developers can consume infrastructure more aggressively themselves — not bothering to shut down unused infrastructure, running more CI jobs (or other QA tests), running multiple CI jobs in parallel, allocating bigger faster dev/test servers, etc. — but that’s short-term, not an ongoing cost burden the way that the technical debt is. (Note that the same prioritization issues also impact the extent to which developers cooperate in implementing security directives. That’s a tale for another day.)
The more those things are combined — bad designs, poorly implemented, that you can’t easily rightsize or scale — the more that you have a mess that you can’t untangle without significant expenditure of development time.
Now, some organizations will go put together a “FinOps” team to play whack-a-mole with infrastructure — killing/parking stuff that is idle and rightsizing the waste. And that might help short-term, but until you can automate that basic cost hygiene, this is non-value-added people-intensive work. And woe betide you if your implementations are fragile enough that rightsizing is operationally risky.
Once you’ve got your whack-a-mole down to a nice quick automated cadence, you’ve got to address the application design and implementation technical debt — and invest in the discipline of performance engineering — or you’ll continue paying unnecessarily high bills month after month. (You’d also be oversizing on-prem infrastructure, but people are used to that, and the capital expenditure is money spent, versus the grind of a monthly cloud bill.)
Business leaders have to step up to prioritize cloud cost optimization — or acknowledge that it isn’t a priority, and that it’s okay to waste money on resources as long as the top line is increasing faster. As long that’s a conscious, articulated decision, that’s fine. But we shouldn’t pretend that developers are inherently irresponsible. Developers, like other employees, respond to incentives, and if they’re evaluated on their velocity of feature delivery, they’re going to optimize their work efforts towards that end.
For more details, check out my new research note called “Is FinOps the Answer to Cloud Cost Governance?” which is paywalled and targeted at Gartner’s executive leader clients — a combination of CxOs and business leaders.
Multicloud failover is almost always a terrible idea
Most people — and notably, almost all regulators — are entirely wrong about addressing cloud resilience through the belief that they should do multicloud failover because, as I noted in a previous blog post, the cloud is NOT just someone else’s computer. (I have been particularly aghast at a recent Reuters article about the Bank of England’s stance.)
Regulators, risk managers, and plenty of IT management largely think of AWS, Azure, etc. as monolithic entities, where “the cloud” can just break for them, and then kaboom, everything is dead everywhere worldwide. They imagine one gargantuan amorphous data center, subject to all the problems that can afflict single data centers, or single systems. But that’s not how it works, that’s not the most effective way to address risk, and testing the “resilience of the provider” (as a generic whole) is both impossible and meaningless.
I mean, yes, there’s the possibility of the catastrophic failure of practically any software technology. There could be, for instance, a bug in the control systems of airplanes from fill-in-the-blank manufacturer that could be simultaneously triggered at a particular time and cause all their airplanes to drop out of the sky simultaneously. But we don’t plan to make commercial airlines maintain backup planes from some other manufacturer in case it happens. Instead, we try to ensure that each plane is resilient in many ways — which importantly addresses the most probable forms of failure, which will be electrical or mechanical failures of particular components.
Hyperscale cloud providers are full of moving parts — lots of components, assembled together into something that looks and feels like a cohesive whole. Each of those components has its own form of resilience, and some of those components are more fragile than others. Some of those components are typically operating well within engineered tolerances. Some of those components might be operating at the edge of those tolerances in certain circumstances — likely due to unexpected pressures from scale — and might be extra-scary if the provider isn’t aware that they’re operating at that edge. In addition to fault-tolerance within each component, there are many mechanisms for fault-tolerance built into the interaction between those components.
Every provider also has its own equivalent of “maintenance” (returning to the plane analogy). The quality of the “mechanics” and the operations will also impact how well the system as a whole operates. (See my previous blog post, “The multi-headed hydra of cloud resilience” for the factors that go into provider resilience.)
It’s not impossible for a provider to have a worldwide outage that effectively impacts all services (rather than just a single service). Such outages are all typically rooted in something that prevents components from communicating with each other, or customers from connecting to the services — global network issues, DNS, security certificates, or identity. The first major incident of this type was the 2012 Azure leap year outage. The 2019 Google “Chubby” outage had global network impact, including on GCP. There have been multiple Azure AD outages with broad impact across Microsoft’s cloud portfolio, most recently the 2021 Azure Active Directory outage. (But there are certainly other possibilities. As recently as yesterday, there was a global Azure Windows VM outage that impacted all Windows VM-dependent services.)
Provider architectural and operational differences do clearly make a difference. AWS, notably, has never had a full regional failure or a global outage. The unique nature of GCP’s global network has both benefits and drawbacks. Azure has been improving steadily in reliability over the years as Microsoft addresses both service architecture and deployment (and other operations) processes.
Note that while these outages can be multi-hour, they have generally been short enough that — given typical enterprise recovery-time objectives for disaster recovery, which are often lengthy — customers typically don’t activate a traditional DR plan. (Customers may take other mitigation actions, i.e. failover to another region, failover to an alternative application for a business process, and so forth.)
Multicloud failover requires that you maintain full portability between two providers, which is a massive burden on your application developers. The basic compute runtime (whether VMs or containers) is not the problem, so OpenShift, Anthos, or other “I can move my containers” solutions won’t really help you. The problem is all the differentiators — the different network architectures and features, the different storage capabilities, the proprietary PaaS capabilities, the wildly different security capabilities, etc. Sure, you can run all open source in VMs, but at that point, why are you bothering with the cloud at all? Plus, even in a DR situation, you need some operational capabilities on the other cloud (monitoring, logging, etc.), even if not your full toolset.
Moreover, the huge cost and complexity of a multicloud implementation is effectively a negative distraction from what you should actually be doing that would improve your uptime and reduce your risks, which is making your applications resilient to the types of failure that are actually probable. More on that in a future blog post.
Banks are accelerating their cloud journeys
In the past couple of months, I have talked to the majority of the world’s largest banks about what is necessary to drive successful cloud adoption at enterprise scale. These conversations have a lot of things in common with one another, and I often send the same research notes as a follow-up to our conversations. Here are those notes, with some context. The notes are all behind the Gartner paywall, in most cases Gartner for Technical Professionals, but some of these are available to IT Leaders clients, or Executive Programs clients.
Banks are indeed really moving core banking to the cloud. The long-held adage that “banks might put new systems of innovation or systems of engagement in the cloud, but they’ll never move core banking”, is crumbling. Gartner has statistics supporting this, which you can find in “Core Banking Hot Spot: Moving the Core Into the Cloud“.
Banks cite application modernization as a critical driver for cloud adoption. An increasing number of banks are migrating a substantial percentage of their existing application estate to public cloud IaaS (and PaaS). Supporting survey data can be found in “Application Modernization Is the Most Common Identified Priority for End-User Cloud Adoption in Banking and Investment Services” (but other priorities are closely clustered in importance).
Banks are striving to mature their cloud adoption. Some banks have had a lot of ad hoc adoption over the years, while other banks have been more cautious (venturing into a bit of SaaS but sometimes zero IaaS or PaaS). But we’ve hit the inflection point (starting about two years ago) where banks became comfortable with cloud provider security and then seemingly all of a sudden went to a “go go go!” mode in which cloud was viewed as a critical accelerator of digital banking initiatives. (See “Advance Through Public Cloud Adoption Maturity” for a view of typical journeys.)
Central cloud governance is the norm for banks. Banks generally like the Gartner-style cloud center of excellence (CCOE) model where an enterprise architecture function provides cloud governance, brokerage, and transformation assistance. (See “How to Build a Cloud Center of Excellence“.) However, their CCOE model is likely to be federated to empower different business units or regions to take charge of their own destinies (especially when the cloud strategy is more regional than global). And many banks are splitting off a separate cloud IT unit under a deputy CIO, which is effectively a self-contained organization with hundreds of people devoted to the cloud migration and transformation effort.
While banks still do detailed technical evaluation of cloud providers, strategic selection is based on alignment to the IT strategy. Banks still really care about nitpicky technical details, but ultimately, their selection of strategic providers is based on broader IT priorities, just like most other cloud customers these days. (See “How to Initiate the Selection of Strategic Cloud IaaS Providers“.) Sometimes there’s a certain degree of hope for some kind of innovation partnership. (I am cynical about such “partnerships”, especially when they come in the form of vague sales platitudes without contractual guarantees or a close business development relationship.)
Banks tend to be multicloud. The larger the bank, the more likely it is to adopt a multicloud strategy, similar to other enterprises (see “Comparing Cloud Workload Placement Strategies“). However, this does not mean that all cloud providers are treated equally. My anecdotal impression is that in terms of primary strategic provider, AWS dominates the the top end of the market (the largest banks) but that Azure captures the middle of the pack (from the US midmarket banks that tend to outsource their processing, to the banks that are important at the country/region level but not highly global).
Banks are making the transition to a more systematic approach to multicloud. Like many large distributed enterprises, banks often have pockets of cloud adoption, each aligned to a different cloud provider. With the maturation of their cloud journeys, they are becoming more systematic, building workload placement policies to guide where workloads should go. (See “Designing a Cloud Workload Placement Policy Document“.)
Banks worry about cloud concentration risks. Many banks face regulatory regimes that require them to address concentration risk. Regulators tend not to provide prescriptive guidance for what they must do, though. Banks have told me that attempting to maintain multicloud portability for applications essentially destroys the business case for cloud. Portability significantly impacts application development time, thus reducing the agility benefits. Without the ability to exploit the unique differentiated capabilities of a cloud provider, there’s little compelling reason not to just do it on-premises — which might actually be more risky than doing it in the cloud. There are effective practical risk-reduction approaches that don’t involve “maintain constant portability of all my apps”, though. (See “How to Create a Public Cloud Integrated IaaS and PaaS Exit Plan“.)
I hope to collaborate with a Gartner colleague to write bank-targeted research in the future. If you’re a cloud architect at a bank, I’d love to speak with you in client inquiry.
The cloud is NOT just someone else’s computer
I recently wrote a Twitter thread about cloud risk and resilience that drew a lot of interest, so I figured I’d expand on it in a blog post. I’ve been thinking about cloud resilience a lot recently, given that clients have been asking about how they manage their risks.
Inquiries about this historically come in waves, almost always triggered by incidents that raise awareness (unfortunately often because the customer has been directly impacted). A wave generally spans a multi-week period, causing waves to bleed into one another. Three distinct sets come to mind over the course of 2021:
- The Azure AD outages earlier this year had a huge impact on client thinking about concentration risks and critical service dependencies — often more related to M365 than Azure, though (and exacerbated by the critical dependency that many organizations have on Teams during this pandemic). Azure AD is core to SSO for many organizations, making its resilience enormously impactful. These impacts are still very top of mind for many clients, months later.
- The Akamai outage (and other CDN outages with hidden dependencies) this summer raised application and infrastructure dependency awareness, and came as a shock to many customers, as Akamai has generally been seen as a bedrock of dependability.
- The near-daily IBM Cloud “Severity 1” outages over the last month have drawn selective client mentions, rather than a wave, but add to the broader pattern of cloud risk concerns. (To my knowledge, there has been no public communication from IBM regarding root cause of these issues. Notifications indicate the outages are multi-service and multi-regional, often impacting all Gen 2 multizone regions. Kubernetes may be something of a common factor, to guess from the impact scope.)
Media amplification of outage awareness appears to have a lot to do with how seriously they’re taken by customers — or non-customers. Affecting stuff that’s consumed by end-users — i.e. office suites, consumer websites, etc. — gets vastly more attention than things that are “just” a really bad day for enterprise ops people. And there’s a negative halo effect — i.e. if Provider X fails, it tends to raise worries about all their competitors too. But even good media explanations and excellent RCAs tend to be misunderstood by readers — and even by smart IT people. This leads, in turn, to misunderstanding why cloud services fail and what the real risks are.
I recently completed my writing on a note about HA and failover (DR) patterns in cloud IaaS and PaaS, with a light touch on application design patterns for resilience. However, concerns about cloud resilience applies just as much — if not more so — to SaaS, especially API SaaS, which creates complicated and deep webs of dependencies.
You can buy T-shirts, stickers, and all manner of swag that says, “The cloud is just somebody else’s computer.” Cute slogan, but not true. Cloud services — especially at massive scale — are incredibly complex software systems. Complex software systems don’t fail the way a “computer” fails. The cloud exemplifies the failure principles laid out by Richard Cook in his classic “How Complex Systems Fail“.
As humans, we are really bad at figuring out the risk of complex systems, especially because the good ones are heavily defended against failure. And we tend to over-index on rare but dramatic risks (a plane crash) versus more commonplace risks (a car crash).
If you think about “my application hosted on AWS” as “well, it’s just sitting on a server in an AWS data center rather than mine”, then at some point in time, the nature of a failure is going to shock you, because you are wrong.
Cloud services fail after all of the resiliency mechanisms have failed (or sometimes, gone wrong in ways that contribute to the failure). Cloud services tend to go boom because of one or more software bugs, likely combined with either a configuration error or some kind of human error (often related to the deployment process for new configs and software versions). They are only rarely related to a physical failure — and generally the physical failure only became apparent to customers because the software intended to provide resilience against it failed in some fashion.
Far too many customers still think about cloud failure as a simple, fundamentally physical thing. Servers fail, so we should use more than one. Data centers fail, so we should be able to DR into another. Etc. But that model is wrong for cloud and for the digital age. We want to strive for continuous availability and resilience (including graceful degradation and other ways to continue business functionality when the application fails). And we have to plan for individual services failures rather than total cloud failure (whether in an AZ, region, or globally). Such failures can be small-scale, and effectively merely “instability”, rather than an “outage” — and therefore demands apps that are resilient to service errors.
So as cloud buyers, we have to think about our risks differently, and we need to architect and operate differently. But we also need to trust our providers — and trust smartly. To that end, cloud providers need to support us with transparency, so we can make more informed decisions. Key elements of that include:
- Publicly-documented engineering service-level objectives (SLOs), which are usually distinct from the financially-backed SLAs. This is what cloud providers design to internally and measure themselves against, and knowing that helps inform our own designs and internal SLOs for our apps.
- Service architecture documentation that helps us understand the ways a service is and isn’t resilient, so we can design accordingly.
- Documented service dependency maps, which allow us to see the chain of dependencies for each of the services we use, allowing us to think about if Service X is really the best fallback alternative if Service Y goes down, as well as inform our troubleshooting.
- Public status dashboards, clearly indicating the status of services, with solid historical data that allows us to see the track record of service operations. This helps with our troubleshooting and user communication.
- Public outage root-cause analysis (RCA), which allow us to understand why outages occurred, and receive a public pledge as to what will be done to prevent similar failures in the future. A historical archive of these is also a valuable resource.
- Change transparency that could help predict stability concerns. Because so many outages end up being related to new deployments / config changes, and the use of SRE principles, including error budgets, is pretty pervasive amongst cloud providers, there is often an interesting pattern to outages. Changes tend to freeze when the error budget is exceeded, leading to an on-and-off pattern of outages; instability can resume at intervals unpredictable to the customer.
Mission-critical cloud applications are becoming commonplace — both in the pervasive use of SaaS, along with widespread production use of IaaS and PaaS. It’s past time to modernize thinking about cloud operations, cloud resilience, and cloud BC/DR. Cloud risk management needs to be about intelligent mitigation and not avoidance, as forward-thinking businesses are will not accept simply avoiding the cloud at this point.
I am interested in your experiences with resilience as well as cloud instability and outages. Feel free to DM me on Twitter to chat about it.
An analyst day in the life: Plague Year edition
I was chatting with someone in vendor analyst relations the other day, who was jokingly asking me about a “day in the life” glimpse at what I do. I thought that might make an entertaining post from pandemic-land. So this is what that day looked like.
8:30 am: Woken up by munchkin, who wants a snuggle, presumably to generate enough cuteness to have videos unlocked on the iPad. (It’s spring break at his preschool, but my husband and I are working, and therefore anything that buys silence is worthwhile.) Get up, get dressed, view the munchkin’s “Dino Center” which has been erected outside the master bedroom door as a set of plastic dinosaur landmines waiting to be stepped on by unsuspecting parents.
8:45 am: Starting-the-day tasks, which occupy all the time until my first meeting of the day.
- Work email: Respond to inquiry-routing requests. Contribute to various research community discussions, including consensus on the recent Azure Active Directory outage. Answer various other internal questions. Do some back-and-forth with vendors about research projects I’m working on.
- Catch up on Teams messages: Interesting stuff on what’s been seen the previous day on cloud contracts, cross-team discussion on the right HA/DR patterns in cloud architectures, miscellaneous salespeople wanting help with a client.
- Prep for the day’s client calls: Read request, supporting docs, company background, past inquiry history, etc.
- Personal stuff: Look at personal email, glance at Facebook, check in on social Slack, do daily tasks on an iPad RPG game (gotta hand it to those guys for ensuring daily addictive habits are maintained).
10 am: Weekly one-hour team meeting. Various administrative matters. Eat breakfast on call.
11 am: One-hour vendor briefing on new product release.
12 pm: 30 minutes of “free” time, which will be my only respite until the end of the work day. Order lunch. Answer email/Teams, check in on Twitter. Work on responding to peer review comments on a doc.
12:30 pm – 4 pm: Back-to-back inquiries with 6 different clients. This runs from everything from a 30-minute chat with the CTO of a major global outsourcer (who wants to know about the latest trends impacting cloud adoption) to spending an hour with a cloud architect who’s dealing with the merger of two companies (one of whom is all-in AWS and the other is all-in Azure) and needs to untangle wide swathe of multicloud issues. While on the phone:
- I feed lunch to the munchkin (who eats his fries, but none of his grilled cheese sandwich, it will turn out later), and get a few bites of lunch myself.
- Cope with munchkin bombarding me with drawings of invented “Pokemon” with increasingly weird names (and evolutions that look like they belong in Pacific Rim rather than cute anime), done in the style of his dog-eared Pokedex encyclopedia, complete with type, region, and diet.
- Navigate minor crisis created by munchkin’s uncertainty over how to spell “region” (thank you, mute button).
4 pm: 30-minute break so I can drag munchkin onto a Zoom call for his Suzuki violin lesson. Bribe him to cooperate with the promise of one forkful of pie if he also stays quiet until I’m done with calls for the day. Yay, bio break.
4:30 pm: Another inquiry.
5 pm: Talk to reporter in-depth regarding breaking news plus a longer-form article they’re working on.
5:30 pm: Check back in on Teams. Discuss breaking news, load-testing in the cloud, pricing comparison between cloud providers, data protectionism, the role-change of a colleague, and other miscellany.
6 pm: Faceplant. (Count for the day: Seven inquiries, plus three other meetings.) Husband works from the master bedroom, so there is background grumbling at his code. Munchkin reminds me that I still owe him pie.
7 pm: Order dinner. Promise munchkin he can also have ice cream if he cooperates with violin practice. Painstakingly teach munchkin to memorize exactly two lines of a new piece of music. Reward munchkin with precisely one bite of pie. Read various violin-related things for myself.
8 pm: Family time. Which is:
- Dinner (and ice cream). Insist munchkin consume some protein.
- Cleanup. Munchkin protests mightily at being told to clean up the floor, which is absolutely blanketed in faux-Pokemon drawings (sometimes cut out and taped to popsicle sticks to make “puppets”). Floor largely remains Study in Deconstructed Japanese Monsters.
- Bedtime negotiation. Munchkin points out that since he went to bed properly the previous night (rather than staying up reading surreptitiously in his closet out of the view of the camera), he has earned the fifth reward-chart star necessary to get the next book in the “Bad Guys” series, and that therefore I must order it from Amazon (despite the abject failure to actually get the reward system to produce a good habit). This is book #10 in the series; I admire the ability of children’s chapter book authors to turn out infinite content. Also, while he’s at it, he wants books on calligraphy and making “realistic” drawings rather than cartoons.
- I place the next of what has been a seemingly unending stream of Amazon orders for books and household supplies. This reminds me of doing a series of executive dinners all over the US years ago, where the icebreaker was “tell us the last thing you ordered from Amazon”. That turned out to be an unexpectedly intriguing glimpse into lives in different parts of the US, especially differing gender-role expectations.
- Ordinarily this would be (virtual Zoom) Scottish fiddle jam night for me, but a shoulder injury has left me largely unable to play, so PT exercises instead.
9:30 pm: Post-Munchkin bedtime, where the adults stare wearily at the TV and then do more work. So:
- Episode of The Good Doctor while multi-screening other stuff. Match-3 games on the iPad. Facebook, other social media, and reading. Chat on LinkedIn with someone interested in working at Gartner.
- Respond to more work email. Work on responding to Editing’s edits of a doc in the publication process. Do paperwork for previous day’s inquiries; send followup emails and research to clients. Prep for next day’s client calls.
- Write a letter to the board of directors for the community orchestra where I’m the concertmaster, inquiring as to post-pandemic plans.
- Write this blog post.
1:30 am: Bedtime.
Are B2B cloud service agreements safe?
I’m seeing various bits of angst around “Is it safe to use cloud services, if my business can be suspended or terminated at any time?” and I thought I’d take some time to explain how cloud providers (and other Internet ecosystem providers, collectively “service providers” [SPs] in this blog post) enforce their Terms of Service (ToS) and Acceptable Use Policies (AUPs).
The TL;DR: Service providers like money, and will strive to avoid terminating customers over policy violations. However, providers do routinely (and sometimes automatically) enforce these policies, although they vary in how much grace and assistance they offer with issues. You don’t have to be a “bad guy” to occasionally run afoul of the policies. But if your business is permanently unwilling or unable to comply with a particular provider’s policies, you cannot use that provider.
AUP enforcement actions are rarely publicized. The information in this post is drawn from personal experience running an ISP abuse department; 20 years of reviewing multiple ISP, hosting, CDN, and cloud IaaS contracts on a daily basis; many years of dialogue with Gartner clients about their experiences with these policies; and conversations with service providers about these topics. Note that I am not a lawyer, and this post is not legal advice. I would encourage you to work with your corporate counsel to understand service provider contract language and its implications for your business, whenever you contract for any form of digital service, whether cloud or noncloud.
The information in this post is intended to be factual; it is not advice and there is a minimum of opinion. Gartner clients interested in understanding how to negotiate terms of service with cloud providers are encouraged to consult our advice for negotiating with Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP), or with SaaS providers. My colleagues will cheerfully review your contracts and provide tailored advice in the context of client inquiry.
Click-thrus, negotiated contracts, ToS, and AUPs
Business-to-business (B2B) service provider agreements have taken two different forms for more than 20 years. There are “click-through agreements” (CTAs) that present you with a online contract that you click to sign; consequently, they are as-is, “take it or leave it” legal documents that tend to favor the provider in terms of business risk mitigation. Then there are negotiated contracts — “enterprise agreements” (EAs) that typically begin with more generous terms and conditions (T&Cs) that better balance the interests of the customer and the provider. EAs are typically negotiated between the provider’s lawyers and the customer’s procurement (sourcing & vendor management) team, as well as the customer’s lawyers (“counsel”).
Some service providers operate exclusively on either CTAs or EAs. But most cloud providers offer both. Not all customers may be eligible to sign an EA; that’s a business decision. Providers may set a minimum account size, minimum spend, minimum creditworthiness, etc., and these thresholds may be different in different countries. Providers are under no obligation to either publicize the circumstances under which an EA is offered, or to offer an EA to a particular customer (or prospective customer).
While in general, EAs would logically be negotiated by all customers who can qualify, providers do not necessarily proactively offer EAs. Furthermore, some companies — especially startups without cloud-knowledgeable sourcing managers — aren’t aware of the existence of EAs and therefore don’t pursue them. And many businesses that are new to cloud services don’t initially negotiate an EA, or take months to do that negotiation, operating on a CTA in the meantime. Therefore, there are certainly businesses that spend a lot of money with a provider, yet only have a CTA.
Terms of service are typically baked directly into both CTAs and EAs — they are one element of the T&Cs. As a result, a business on an EA benefits both from the greater “default” generosity of the EA’s T&Cs over the provider’s CTA (if the provider offers both), as well as whatever clauses they’ve been able to negotiate in their favor. In general, the bigger the deal, the more leverage the customer has to negotiate favorable T&Cs, which may include greater “cure time” for contract breaches, greater time to pay the bill, more notice of service changes, etc.
AUPs, on the other hand, are separate documents incorporated by reference into the T&Cs. They are usually a superset or expansion/clarification of the things mentioned directly in the ToS. For instance, the ToS may say “no illegal activity allowed”, and the AUP will give examples of prohibited activities (important since what is legal varies by country). AUPs routinely restrict valid use, even if such use is entirely legal. Service providers usually stipulate that an AUP can change with no notice (which essentially allows a provider to respond rapidly to a change in the regulatory or threat environment).
Unlike the EA T&Cs, an AUP is non-negotiable. However, an EA can be negotiated to clarify an AUP interpretation, especially if the customer is in a “grey area” that might be covered by the AUP even if the use is totally legitimate (i.e. a security vendor that performs penetration testing on other businesses at their request, may nevertheless ask for an explicit EA statement that such testing doesn’t violate the AUP).
Prospective customers of a service provider can’t safely make assumptions about the AUP intent. For example, some providers might not exclude even a fully white-hat pen-testing security vendor from the relevant portion of the AUP. Some providers with a gambling-excluding AUP may not choose to do business with an organization that has, for instance, anything to do with gambling, even if that gambling is not online (which can get into grey areas like, “Is running a state lottery a form of gambling?”). Some providers operating data centers in countries without full freedom of the press may be obliged to enforce restrictions on what content a media company can host in those regions. Anyone who could conceivably violate the AUP as part of the routine course of business would therefore want to gain clarity on interpretation up front — and get it in writing in an EA.
What does AUP enforcement look like?
If you’re not familiar with AUPs or why they exist and must be enforced, I encourage you to read my post “Terms of Service: from anti-spam to content takedown” first.
AUP enforcement is generally handled by a “fraud and abuse” department within a service provider, although in recent years, some service providers have adopted friendlier names, like “trust and safety”. When an enforcement action is taken, the customer is typically given a clear statement of what the violation is, any actions taken (or that will be taken within X amount of time if the violation isn’t fixed), and how to contact the provider regarding this issue. There is normally no ambiguity, although less technically-savvy customers can sometimes have difficulties understanding why what they did wrong — and in the case of automatic enforcement actions, the customer may be entirely puzzled by what they did to trigger this.
There is almost always a split in the way that enforcement is handled for customers on a CTA, vs customers on an EA. Because customers on a CTA undergo zero or minimal verification, there is no presumption that those customers are legit good actors. Indeed, some providers may assume, until proven otherwise, that such customers exist specifically to perpetuate fraud and/or abuse. Customers on an EA have effectively gone through more vetting — the account team has probably done the homework to figure out likely revenue opportunity, business model and drivers for the sale, etc. — and they also have better T&Cs, so they get the benefit of the doubt.
Consequently, CTA customers are often subject to more stringent policies and much harsher, immediate enforcement of those policies. Immediate suspension or termination is certainly possible, with relatively minimal communication. (To take a public GCP example: GCP would terminate without means to protest as recently as 2018, though that has changed. Its suspension guidelines and CTA restrictions offer clear statements of swift and significantly automated enforcement, including prevention of cryptocurrency mining for CTA customers who aren’t on invoicing, even though it’s perfectly legal.) The watchword for the cloud providers is “business risk management” when it comes to CTA customers.
Customers that are on a CTA but are spending a lot of money — and seem to be legit businesses with an established history on the platform — generally get a little more latitude in enforcement. (And if enforcement is automated, there may be a sliding threshold for automated actions based on spend history.) Similarly, customers on a CTA but who are actively negotiating an EA or engaged in the enterprise sales process may get more latitude.
Often-contrary to the handling of CTA customers, providers usually assume an EA customer who has breached the AUP has done so unintentionally. (For instance, one of the customer’s salespeople may have sent spam, or a customer VM may have been compromised and is now being used as part of a botnet.) Consequently, the provider tends to believe that what the customer needs is notification that something is wrong, education on why it’s problematic, and help in addressing the issue. EA customers are often completely spared from any automated form of policy enforcement. While business risk management is still a factor for the service provider, this is often couched politely as helping the customer avoid risk exposure for the customer’s own business.
Providers do, however, generally firmly hold the line on “the customer needs to deal with the problem”. For instance, I’ve encountered cloud customers who have said to me, “Well, my security breach isn’t so bad, and I don’t have time/resources to address this compromised VM, so I’d like more than 30 days grace to do so, how do I make my provider agree?” when the service provider has reasonably taken the stance that the breach potentially endangers others, and mandated that the customer promptly (immediately) address the breach. In many cases, the provider will offer technical assistance if necessary. Service providers vary in their response to this sort of recalcitrance and the extent of their enforcement actions. For instance, AWS normally takes actions against the narrowest feasible scope — i.e. against only the infrastructure involved in the policy violation. Broadly, cloud providers don’t punish customers for violations, but customers must do something about such violations.
Some providers have some form of variant of a “three strikes” policy, or escalating enforcement. For instance, if a customer has repeated issues — for example, it seems unable implement effective anti-spam compliance for itself, or it constantly fails to maintain effective security in a way that could impact other customers or the cloud provider’s services, or it can’t effectively moderate content it offers to the public, or it can’t prevent its employees from distributing illegally copied music using corporate cloud resources — then repeated warnings and limited enforcement actions can turn into suspensions or termination. Thus, even EA customers are essentially obliged treat every policy violation as something that they need to strive to ensure will not recur in the future. Resolution of a given violation is not evidence that the customer is in effective compliance with the agreement.
It’s not unusual for entirely legitimate, well-intentioned businesses to breach the ToS or AUP, but this is normally rare; a business might do this once or twice over the course of many years. New cloud customers on a CTA may also innocently exhibit behaviors that trigger automated enforcement actions that use algorithms to look for usage patterns that may be indicative of fraud or abuse. Service providers will take enforcement actions based on the customer history, the contractual agreement, and other business-risk and customer-relationship factors.
Intent matters. Accidental breaches are likely to be treated with a great deal more kindness. If breaches recur, though, the provider is likely to want to see evidence that the business has an effective plan for preventing further such issues. Even if the customer is willing to absorb the technical, legal, or business risks associated with a violation, the service provider is likely to insist that the issue be addressed — to protect their own services, their own customers, and for the customer’s own good.
(Update: Gartner clients, I have published a research note: “What is the risk of actually losing your cloud provider?“)
Terms of Service: From anti-spam to content takedown
Many consumers are familiar with the terms of service (ToS) that govern their use of consumer platforms that contain user-generated content (UGC), such as Facebook, Twitter, and YouTube. But many people are less familiar with the terms of service and acceptable use policy (AUP) that governs the relationships between businesses and their service providers.
In light of the recent decision undertaken by Amazon Web Services (AWS) to suspend service to Parler, a Twitter-like social network that was used to plan the January 6th insurrection at the US Capitol, there have been numerous falsehoods circulating on social media that seem related to a lack of understanding of service provider behavior, ToS and AUPs: claims of “free speech” violations, a coordinated conspiracy between big tech companies, and the like. This post is intended to examine, without judgment, how service providers — cloud, hosting and colo, CDN and other Internet infrastructure providers, and the ISPs that connect everyone to the Internet — came to a place of industry “standards” for behavior, and how enforcement is handled in B2B situations.
The TL;DR summary: The global service provider community, as a result of anti-spam efforts dating back to the mid-90s, enforces extremely similar policies governing content, including user-generated content, through a combination of B2B peer pressure and contractual obligations. Business customers who contravene these norms have very few options.
These norms will greatly limit Parler’s options for a new home. Many sites with far-right and similarly controversial content have ultimately ended up using a provider in a supply chain that relies on Russian connectivity, thus dodging the Internet norms that prevail in the rest of the world.
Internet Architecture and Service Provider Dependencies
While the Internet is a collection of loosely federated networks that are in theory independent from one another, it is also an interdependent web of interconnections between those networks. There are two ways that ISPs connect with one another — through “settlement-free peering” (essentially an exchange of traffic between two ISPs that view themselves as equals) and through the purchase of “transit” (making a “downstream ISP” the customer of an “upstream ISP”).
This results in a three-tier model for ISPs. The Tier 1 ISPs are big global carriers of network connectivity — companies like AT&T, BT and NTT — who have settlement-free peers with each other, and sell transit to smaller ISPs. Tier 2 ISPs are usually regional, and have settlement-free peers with others in and around their region, but also are reliant on transit from the Tier 1s. Tier 3 ISPs are entirely dependent on purchasing transit. ISPs at all three tiers also sell connectivity directly to businesses and/or consumers.
In practice, this means that ISPs are generally contractually bound to other ISPs. All transit contracts are governed by terms of service that normally incorporate, by reference, an AUP. Even settlement-free peering agreements are legal contracts, which normally includes the mutual agreement to maintain and enforce some form of AUP. (In the earlier days of the Internet, peering was done on a handshake, but anything of that sort is basically a legacy that can come to an abrupt end should one party suddenly decide to behave badly.)
AUP documents are interesting because they are deliberately created as living documents, allowing AUPs to be adapted to changing circumstances — unlike standard contract terms, which apply for the length of what is usually a multiyear contract. AUPs are also normally ironclad; it’s usually difficult to impossible for a business to get any form of AUP exemption written into their contract. Most contracts provide minimal or no notice for AUP changes. Businesses tend to simply agree to them because most businesses do not plan to engage in the kind of behavior that violates an AUP — and because they don’t have much choice.
The existence of ISP tiering means that upstream providers have significant influence over the behavior of their downstream. Upstream ISPs normally mandate that their downstream ISPs — and other service providers that use their connectivity, like hosting providers — enforce an AUP that enables the downstream provider to be compliant with the upstream’s terms of service. Downstream providers that fail to do so can have their connectivity temporary suspended or their contract terminated. And between the Tier 1 providers, peer pressure ensures a common global understanding and enforcement of acceptable behavior on the Internet.
Note that this has all occurred in the absence of regulation. ISPs have come to these arrangements through decisions about what’s good for their individual businesses first and foremost, with the general agreement that these community standards for AUPs are good for the community of network operators as a whole.
We’re Here Because Nobody Likes Spammers
So how did we arrive at this state in the first place?
In the mid-90s, as the Internet was growing rapidly, in the near-total absence of regulation, spam was a growing problem. Spam came from both legitimate businesses who simply weren’t aware of or didn’t especially care about Internet etiquette, as well as commercial spammers (bad actors with deceptive or fraudulent ads, and/or illegal/grey-market products).
Many B2B ISPs did not feel that it was necessarily their responsibility to intervene, despite general distaste for spammers — and, sometimes, a flood of consumer complaints. Some percentage of spammers were otherwise “good customers” — i.e. they paid their bills on time and bought a lot of bandwidth. Many more, however, obtained services under fraudulent pretenses, didn’t pay their bills, or tended not to pay on time.
Gradually, the community of network operators came to a common understanding that spammers were generally bad for business, whether they were your own customers, or whether they were the customers of, say, a web hosting company that you provided Internet connectivity for.
This resulted in upstream ISPs exerting pressure on downstream ISPs. Downstream ISPs, in turn, exerted pressure on their customers — kicking spammers off their networks and pushing hosters to kick spammers out of hosting environments. ISPs formalized AUPs. AUP enforcement took longer. Many ISPs were initially pretty shoddy and inconsistent in their enforcement — either because they needed the revenue they were getting from spammers, or due to unwillingness or inability to fund a staff to deal with abuse, or corporate lawyers who urged caution. It took years, but ISPs eventually arrived at AUPs that were contractually enforceable, processes for handling complaints, and relatively consistent enforcement. Legislation like the CAN-SPAM act in the US didn’t hurt, but by the time CAN-SPAM was passed (in 2003), ISPs had already arrived at a fairly successful commercial resolution to the problem.
Because anti-spam efforts were largely fueled by agreements enshrined in B2B contracts, and not in government regulation, there was never full consistency across the industry. Different ISPs created different AUPs — some stricter and some looser. Different ISPs wrote different terms of service into their contracts, with different “cure” periods (a period of time that a party in the contract is given to come into compliance with a contractual requirement). Different ISPs had different attitudes towards balancing “customer service” versus their responsibilities to their upstream providers and to the broader community of network operators.
Consequently, there’s nothing that says “We need to receive X number of spam complaints before we take action,” for instance. Some providers may have internal process standards for this. A lot of enforcement simply takes place via automated algorithms; i.e. if a certain threshold of users reports something as spam, enforcement actions take place. Providers effectively establish, through peer norms, what constitutes “effective” enforcement in accordance with terms of service obligations. Providers don’t need to threaten each other with network disconnection, because a norm has been established. But the implicit threat — and the contractual teeth behind that threat — always remains.
Nobody really likes terminating customers. So there are often fairly long cure periods, recognizing that it can take a while for a customer to properly comply with an AUP. In the suspension letter that AWS sent Parler, AWS cites communications “over the past several weeks”. Usually the providers look for their customers to demonstrate good-faith efforts, but may take suspension or termination efforts if it does not look like a good-faith effort to comply is being made, or if it appears that the effort, no matter how seemingly earnest, does not seem likely to bring compliance within a reasonable time period. 30 days is a common timeframe specified as a cure period in contracts (and is the cure period in the AWS standard Enterprise Agreement), but cloud provider click-through agreements (such as the AWS Customer Agreement) do not normally have a cure period, allowing immediate action to be taken at the provider’s discretion.
What Does This Have to Do With Policing Users on Social Media?
When providers established anti-spam AUPs, they also added a whole laundry list of offenses beyond spamming. Think of that list, “Everything a good corporate lawyer thought an ISP might ever want to terminate a customer for doing.” Illegal behavior, harassment, behavior that disrupts provider operations, behavior that threatens the safety/security/operations of other businesses, etc. are all prohibited.
Hosting companies — eventually followed by cloud providers like AWS, Microsoft Azure, and Google Cloud Platform, as well as companies that hold key roles in the Internet ecosystem (domain registrars and the companies that operate the DNS; content delivery networks like Akamai and Cloudflare, etc.) — were essentially obliged to incorporate their upstream ISP usage policies into their own terms of service and AUPs, and to enforce those policies on their users if they wanted to stay connected to the Internet. Some such providers have also explicitly chosen not to sell to customers in certain business segments — for instance, no gambling, or no pornography, even if the business is fully legitimate and legal (for instance, like MGM Resorts or Playboy) — through limiting what’s allowed in their terms of service. An AUP may restrict activities that are perfectly legal in a given jurisdiction.
Even extremely large tech companies that have their own data centers, like Facebook and Apple, are ultimately beholden to ISPs. (Google is something of an odd case because in addition to owning their own data centers, they are one of the largest global network operators. Google owns extensive fiber routes and peers with Tier 1 ISPs as an equal.) And even though AWS has, to some degree, a network of its own, it is effectively a Tier 2 ISP, making it beholden to the AUPs of its upstream. Other cloud providers are typically mostly or fully transit-dependent, and are thus entirely beholden to their upstream.
In short: Everyone who hosts content companies, and the content companies themselves, is essentially trapped, by the chain of AUP obligations, to policing content to ensure that it is not illegal, harassing, or otherwise seen as commercially problematic.
You have to go outside the normal Internet supply chain — for instance, to the Russian service providers — before you escape the commercial arrangements that bound notions of good business behavior on the Internet. It doesn’t matter what a provider’s philosophical alignment is. Commercially, they simply can’t really push back on the established order. And because these agreements are global, regulation at a single-country level can’t really force these agreements to be significantly more or less restrictive, because of the globalized nature of peering/transit; providers generally interconnect in multiple countries.
It also means that these aren’t just “Silicon Valley” standards. These are global norms for behavior, which means they are not influenced solely by the relatively laissez-faire content standards of the United States, but by the more stringent European and APAC environments.
It’s an interesting result of what happens when businesses police themselves. Even without formal industry-association “rules” or regulatory obligations, a fairly ironclad order can emerge that exerts extremely effective downstream pressure (as we saw in the cases of 8Chan and the Daily Stormer back in 2019).
Does being multicloud help with terms of service violations?
Some people will undoubtedly ask, “Would it have helped Parler to have been multicloud?” Parler has already said that they are merely bare-metal customers of AWS, reducing technical dependencies / improving portability. But their situation is such that they would almost certainly have had the exact same issue if they had been running on Microsoft Azure, Google Cloud Platform, or even Oracle Cloud Infrastructure as well (even though the three companies have top executives with political views spanning the spectrum). A multicloud strategy won’t help any business that violates AUP norms.
AWS and its cloud/hosting competitors are usually pretty generous when working with business customers that unintentionally violate their AUPs. But a business that chooses not to comply is unlikely to find itself welcome anywhere, which makes multicloud deployment largely useless as a defensive strategy.
The multi-headed hydra of cloud resilience
Clients have recently been asking a lot more questions about the comparative resilience of cloud providers.
Identity services are a particular point of concern (for instance, the Azure AD outage of October 1st and Google Cloud IAM outage of March 26th) since when identity is down, the customer can’t access the cloud provider’s control plane (and it may impact service use in general) — plus there’s generally no way for the customer to work around such issues.
The good news is, hyperscale cloud providers do a pretty good job of being robust. However, the risk of smaller, more hosting-like providers can be much higher — and there are notable differences between the hyperscalers, too.
Operations folks know: Everything breaks. Physical stuff fails, software is buggy, and people screw up (a lot). A provider can try its best to reduce the number of failures, limit the “blast radius” of a problem, limit the possibility of “cascading failures”, and find ways to mitigate the impact on users. But you can’t avoid failure entirely. Systems that are resilient recover quickly from failure.
If you chop off the head of a hydra, it grows back — quickly. We can think about five key factors — heads of the hydra — that influence the robustness, resilience, and observed (“real world”) availability of cloud services:
- Physical design: The design of physical things, such as the data center and the hardware used to deliver services.
- Logical (software) design: The design of non-physical things, especially software — all aspects of the service architecture that is not related to a physical element.
- Implementation quality: The robustness of the actual implementation, encompassing implementation skill, care and meticulousness, and the effectiveness of quality-assurance (QA) efforts.
- Deployment processes: The rollout of service changes is the single largest cause of operational failures in cloud services. The quality of these processes, the automation used in the processes, and the degree to which humans are given latitude to use good judgment (or poor judgment) thus have a material impact on availability.
- Operational processes: Other operational processes, such as monitoring, incident management — and, most importantly, problem management — impact the cloud provider’s ability to react quickly to problems, mitigate issues, and ensure that the root causes of incidents are addressed. Both proactive and reactive maintenance efforts can have an impact on availability.
A sixth factor, Transparency, isn’t directly related to keeping the hydra alive, but matters to customers as they plan for their own application architectures and risk management — contributing to customer resilience. Transparency includes making architectural information to customers, as well as delivering outage-related visibility and insight to customers. Customers need real-world info — like current and historical outage reports and the root-cause-analysis port-mortems that offer insight into what went wrong and why (and what the provider is doing about it).
When you think about cloud service resilience (or the resilience of your own systems), think about it in terms of those factors. Don’t think about it like you think about on-premises systems, where people often think primarily about hardware failures or a fire in the data center. Rather, you’re dealing with systems where software issues are almost always the root cause. Physical robustness still matters, but the other four factors are largely about software.