I was chatting with someone in vendor analyst relations the other day, who was jokingly asking me about a “day in the life” glimpse at what I do. I thought that might make an entertaining post from pandemic-land. So this is what that day looked like.
8:30 am: Woken up by munchkin, who wants a snuggle, presumably to generate enough cuteness to have videos unlocked on the iPad. (It’s spring break at his preschool, but my husband and I are working, and therefore anything that buys silence is worthwhile.) Get up, get dressed, view the munchkin’s “Dino Center” which has been erected outside the master bedroom door as a set of plastic dinosaur landmines waiting to be stepped on by unsuspecting parents.
8:45 am: Starting-the-day tasks, which occupy all the time until my first meeting of the day.
- Work email: Respond to inquiry-routing requests. Contribute to various research community discussions, including consensus on the recent Azure Active Directory outage. Answer various other internal questions. Do some back-and-forth with vendors about research projects I’m working on.
- Catch up on Teams messages: Interesting stuff on what’s been seen the previous day on cloud contracts, cross-team discussion on the right HA/DR patterns in cloud architectures, miscellaneous salespeople wanting help with a client.
- Prep for the day’s client calls: Read request, supporting docs, company background, past inquiry history, etc.
- Personal stuff: Look at personal email, glance at Facebook, check in on social Slack, do daily tasks on an iPad RPG game (gotta hand it to those guys for ensuring daily addictive habits are maintained).
10 am: Weekly one-hour team meeting. Various administrative matters. Eat breakfast on call.
11 am: One-hour vendor briefing on new product release.
12 pm: 30 minutes of “free” time, which will be my only respite until the end of the work day. Order lunch. Answer email/Teams, check in on Twitter. Work on responding to peer review comments on a doc.
12:30 pm – 4 pm: Back-to-back inquiries with 6 different clients. This runs from everything from a 30-minute chat with the CTO of a major global outsourcer (who wants to know about the latest trends impacting cloud adoption) to spending an hour with a cloud architect who’s dealing with the merger of two companies (one of whom is all-in AWS and the other is all-in Azure) and needs to untangle wide swathe of multicloud issues. While on the phone:
- I feed lunch to the munchkin (who eats his fries, but none of his grilled cheese sandwich, it will turn out later), and get a few bites of lunch myself.
- Cope with munchkin bombarding me with drawings of invented “Pokemon” with increasingly weird names (and evolutions that look like they belong in Pacific Rim rather than cute anime), done in the style of his dog-eared Pokedex encyclopedia, complete with type, region, and diet.
- Navigate minor crisis created by munchkin’s uncertainty over how to spell “region” (thank you, mute button).
4 pm: 30-minute break so I can drag munchkin onto a Zoom call for his Suzuki violin lesson. Bribe him to cooperate with the promise of one forkful of pie if he also stays quiet until I’m done with calls for the day. Yay, bio break.
4:30 pm: Another inquiry.
5 pm: Talk to reporter in-depth regarding breaking news plus a longer-form article they’re working on.
5:30 pm: Check back in on Teams. Discuss breaking news, load-testing in the cloud, pricing comparison between cloud providers, data protectionism, the role-change of a colleague, and other miscellany.
6 pm: Faceplant. (Count for the day: Seven inquiries, plus three other meetings.) Husband works from the master bedroom, so there is background grumbling at his code. Munchkin reminds me that I still owe him pie.
7 pm: Order dinner. Promise munchkin he can also have ice cream if he cooperates with violin practice. Painstakingly teach munchkin to memorize exactly two lines of a new piece of music. Reward munchkin with precisely one bite of pie. Read various violin-related things for myself.
8 pm: Family time. Which is:
- Dinner (and ice cream). Insist munchkin consume some protein.
- Cleanup. Munchkin protests mightily at being told to clean up the floor, which is absolutely blanketed in faux-Pokemon drawings (sometimes cut out and taped to popsicle sticks to make “puppets”). Floor largely remains Study in Deconstructed Japanese Monsters.
- Bedtime negotiation. Munchkin points out that since he went to bed properly the previous night (rather than staying up reading surreptitiously in his closet out of the view of the camera), he has earned the fifth reward-chart star necessary to get the next book in the “Bad Guys” series, and that therefore I must order it from Amazon (despite the abject failure to actually get the reward system to produce a good habit). This is book #10 in the series; I admire the ability of children’s chapter book authors to turn out infinite content. Also, while he’s at it, he wants books on calligraphy and making “realistic” drawings rather than cartoons.
- I place the next of what has been a seemingly unending stream of Amazon orders for books and household supplies. This reminds me of doing a series of executive dinners all over the US years ago, where the icebreaker was “tell us the last thing you ordered from Amazon”. That turned out to be an unexpectedly intriguing glimpse into lives in different parts of the US, especially differing gender-role expectations.
- Ordinarily this would be (virtual Zoom) Scottish fiddle jam night for me, but a shoulder injury has left me largely unable to play, so PT exercises instead.
9:30 pm: Post-Munchkin bedtime, where the adults stare wearily at the TV and then do more work. So:
- Episode of The Good Doctor while multi-screening other stuff. Match-3 games on the iPad. Facebook, other social media, and reading. Chat on LinkedIn with someone interested in working at Gartner.
- Respond to more work email. Work on responding to Editing’s edits of a doc in the publication process. Do paperwork for previous day’s inquiries; send followup emails and research to clients. Prep for next day’s client calls.
- Write a letter to the board of directors for the community orchestra where I’m the concertmaster, inquiring as to post-pandemic plans.
- Write this blog post.
1:30 am: Bedtime.
I’m seeing various bits of angst around “Is it safe to use cloud services, if my business can be suspended or terminated at any time?” and I thought I’d take some time to explain how cloud providers (and other Internet ecosystem providers, collectively “service providers” [SPs] in this blog post) enforce their Terms of Service (ToS) and Acceptable Use Policies (AUPs).
The TL;DR: Service providers like money, and will strive to avoid terminating customers over policy violations. However, providers do routinely (and sometimes automatically) enforce these policies, although they vary in how much grace and assistance they offer with issues. You don’t have to be a “bad guy” to occasionally run afoul of the policies. But if your business is permanently unwilling or unable to comply with a particular provider’s policies, you cannot use that provider.
AUP enforcement actions are rarely publicized. The information in this post is drawn from personal experience running an ISP abuse department; 20 years of reviewing multiple ISP, hosting, CDN, and cloud IaaS contracts on a daily basis; many years of dialogue with Gartner clients about their experiences with these policies; and conversations with service providers about these topics. Note that I am not a lawyer, and this post is not legal advice. I would encourage you to work with your corporate counsel to understand service provider contract language and its implications for your business, whenever you contract for any form of digital service, whether cloud or noncloud.
The information in this post is intended to be factual; it is not advice and there is a minimum of opinion. Gartner clients interested in understanding how to negotiate terms of service with cloud providers are encouraged to consult our advice for negotiating with Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP), or with SaaS providers. My colleagues will cheerfully review your contracts and provide tailored advice in the context of client inquiry.
Click-thrus, negotiated contracts, ToS, and AUPs
Business-to-business (B2B) service provider agreements have taken two different forms for more than 20 years. There are “click-through agreements” (CTAs) that present you with a online contract that you click to sign; consequently, they are as-is, “take it or leave it” legal documents that tend to favor the provider in terms of business risk mitigation. Then there are negotiated contracts — “enterprise agreements” (EAs) that typically begin with more generous terms and conditions (T&Cs) that better balance the interests of the customer and the provider. EAs are typically negotiated between the provider’s lawyers and the customer’s procurement (sourcing & vendor management) team, as well as the customer’s lawyers (“counsel”).
Some service providers operate exclusively on either CTAs or EAs. But most cloud providers offer both. Not all customers may be eligible to sign an EA; that’s a business decision. Providers may set a minimum account size, minimum spend, minimum creditworthiness, etc., and these thresholds may be different in different countries. Providers are under no obligation to either publicize the circumstances under which an EA is offered, or to offer an EA to a particular customer (or prospective customer).
While in general, EAs would logically be negotiated by all customers who can qualify, providers do not necessarily proactively offer EAs. Furthermore, some companies — especially startups without cloud-knowledgeable sourcing managers — aren’t aware of the existence of EAs and therefore don’t pursue them. And many businesses that are new to cloud services don’t initially negotiate an EA, or take months to do that negotiation, operating on a CTA in the meantime. Therefore, there are certainly businesses that spend a lot of money with a provider, yet only have a CTA.
Terms of service are typically baked directly into both CTAs and EAs — they are one element of the T&Cs. As a result, a business on an EA benefits both from the greater “default” generosity of the EA’s T&Cs over the provider’s CTA (if the provider offers both), as well as whatever clauses they’ve been able to negotiate in their favor. In general, the bigger the deal, the more leverage the customer has to negotiate favorable T&Cs, which may include greater “cure time” for contract breaches, greater time to pay the bill, more notice of service changes, etc.
AUPs, on the other hand, are separate documents incorporated by reference into the T&Cs. They are usually a superset or expansion/clarification of the things mentioned directly in the ToS. For instance, the ToS may say “no illegal activity allowed”, and the AUP will give examples of prohibited activities (important since what is legal varies by country). AUPs routinely restrict valid use, even if such use is entirely legal. Service providers usually stipulate that an AUP can change with no notice (which essentially allows a provider to respond rapidly to a change in the regulatory or threat environment).
Unlike the EA T&Cs, an AUP is non-negotiable. However, an EA can be negotiated to clarify an AUP interpretation, especially if the customer is in a “grey area” that might be covered by the AUP even if the use is totally legitimate (i.e. a security vendor that performs penetration testing on other businesses at their request, may nevertheless ask for an explicit EA statement that such testing doesn’t violate the AUP).
Prospective customers of a service provider can’t safely make assumptions about the AUP intent. For example, some providers might not exclude even a fully white-hat pen-testing security vendor from the relevant portion of the AUP. Some providers with a gambling-excluding AUP may not choose to do business with an organization that has, for instance, anything to do with gambling, even if that gambling is not online (which can get into grey areas like, “Is running a state lottery a form of gambling?”). Some providers operating data centers in countries without full freedom of the press may be obliged to enforce restrictions on what content a media company can host in those regions. Anyone who could conceivably violate the AUP as part of the routine course of business would therefore want to gain clarity on interpretation up front — and get it in writing in an EA.
What does AUP enforcement look like?
If you’re not familiar with AUPs or why they exist and must be enforced, I encourage you to read my post “Terms of Service: from anti-spam to content takedown” first.
AUP enforcement is generally handled by a “fraud and abuse” department within a service provider, although in recent years, some service providers have adopted friendlier names, like “trust and safety”. When an enforcement action is taken, the customer is typically given a clear statement of what the violation is, any actions taken (or that will be taken within X amount of time if the violation isn’t fixed), and how to contact the provider regarding this issue. There is normally no ambiguity, although less technically-savvy customers can sometimes have difficulties understanding why what they did wrong — and in the case of automatic enforcement actions, the customer may be entirely puzzled by what they did to trigger this.
There is almost always a split in the way that enforcement is handled for customers on a CTA, vs customers on an EA. Because customers on a CTA undergo zero or minimal verification, there is no presumption that those customers are legit good actors. Indeed, some providers may assume, until proven otherwise, that such customers exist specifically to perpetuate fraud and/or abuse. Customers on an EA have effectively gone through more vetting — the account team has probably done the homework to figure out likely revenue opportunity, business model and drivers for the sale, etc. — and they also have better T&Cs, so they get the benefit of the doubt.
Consequently, CTA customers are often subject to more stringent policies and much harsher, immediate enforcement of those policies. Immediate suspension or termination is certainly possible, with relatively minimal communication. (To take a public GCP example: GCP would terminate without means to protest as recently as 2018, though that has changed. Its suspension guidelines and CTA restrictions offer clear statements of swift and significantly automated enforcement, including prevention of cryptocurrency mining for CTA customers who aren’t on invoicing, even though it’s perfectly legal.) The watchword for the cloud providers is “business risk management” when it comes to CTA customers.
Customers that are on a CTA but are spending a lot of money — and seem to be legit businesses with an established history on the platform — generally get a little more latitude in enforcement. (And if enforcement is automated, there may be a sliding threshold for automated actions based on spend history.) Similarly, customers on a CTA but who are actively negotiating an EA or engaged in the enterprise sales process may get more latitude.
Often-contrary to the handling of CTA customers, providers usually assume an EA customer who has breached the AUP has done so unintentionally. (For instance, one of the customer’s salespeople may have sent spam, or a customer VM may have been compromised and is now being used as part of a botnet.) Consequently, the provider tends to believe that what the customer needs is notification that something is wrong, education on why it’s problematic, and help in addressing the issue. EA customers are often completely spared from any automated form of policy enforcement. While business risk management is still a factor for the service provider, this is often couched politely as helping the customer avoid risk exposure for the customer’s own business.
Providers do, however, generally firmly hold the line on “the customer needs to deal with the problem”. For instance, I’ve encountered cloud customers who have said to me, “Well, my security breach isn’t so bad, and I don’t have time/resources to address this compromised VM, so I’d like more than 30 days grace to do so, how do I make my provider agree?” when the service provider has reasonably taken the stance that the breach potentially endangers others, and mandated that the customer promptly (immediately) address the breach. In many cases, the provider will offer technical assistance if necessary. Service providers vary in their response to this sort of recalcitrance and the extent of their enforcement actions. For instance, AWS normally takes actions against the narrowest feasible scope — i.e. against only the infrastructure involved in the policy violation. Broadly, cloud providers don’t punish customers for violations, but customers must do something about such violations.
Some providers have some form of variant of a “three strikes” policy, or escalating enforcement. For instance, if a customer has repeated issues — for example, it seems unable implement effective anti-spam compliance for itself, or it constantly fails to maintain effective security in a way that could impact other customers or the cloud provider’s services, or it can’t effectively moderate content it offers to the public, or it can’t prevent its employees from distributing illegally copied music using corporate cloud resources — then repeated warnings and limited enforcement actions can turn into suspensions or termination. Thus, even EA customers are essentially obliged treat every policy violation as something that they need to strive to ensure will not recur in the future. Resolution of a given violation is not evidence that the customer is in effective compliance with the agreement.
It’s not unusual for entirely legitimate, well-intentioned businesses to breach the ToS or AUP, but this is normally rare; a business might do this once or twice over the course of many years. New cloud customers on a CTA may also innocently exhibit behaviors that trigger automated enforcement actions that use algorithms to look for usage patterns that may be indicative of fraud or abuse. Service providers will take enforcement actions based on the customer history, the contractual agreement, and other business-risk and customer-relationship factors.
Intent matters. Accidental breaches are likely to be treated with a great deal more kindness. If breaches recur, though, the provider is likely to want to see evidence that the business has an effective plan for preventing further such issues. Even if the customer is willing to absorb the technical, legal, or business risks associated with a violation, the service provider is likely to insist that the issue be addressed — to protect their own services, their own customers, and for the customer’s own good.
(Update: Gartner clients, I have published a research note: “What is the risk of actually losing your cloud provider?“)
Many consumers are familiar with the terms of service (ToS) that govern their use of consumer platforms that contain user-generated content (UGC), such as Facebook, Twitter, and YouTube. But many people are less familiar with the terms of service and acceptable use policy (AUP) that governs the relationships between businesses and their service providers.
In light of the recent decision undertaken by Amazon Web Services (AWS) to suspend service to Parler, a Twitter-like social network that was used to plan the January 6th insurrection at the US Capitol, there have been numerous falsehoods circulating on social media that seem related to a lack of understanding of service provider behavior, ToS and AUPs: claims of “free speech” violations, a coordinated conspiracy between big tech companies, and the like. This post is intended to examine, without judgment, how service providers — cloud, hosting and colo, CDN and other Internet infrastructure providers, and the ISPs that connect everyone to the Internet — came to a place of industry “standards” for behavior, and how enforcement is handled in B2B situations.
The TL;DR summary: The global service provider community, as a result of anti-spam efforts dating back to the mid-90s, enforces extremely similar policies governing content, including user-generated content, through a combination of B2B peer pressure and contractual obligations. Business customers who contravene these norms have very few options.
These norms will greatly limit Parler’s options for a new home. Many sites with far-right and similarly controversial content have ultimately ended up using a provider in a supply chain that relies on Russian connectivity, thus dodging the Internet norms that prevail in the rest of the world.
Internet Architecture and Service Provider Dependencies
While the Internet is a collection of loosely federated networks that are in theory independent from one another, it is also an interdependent web of interconnections between those networks. There are two ways that ISPs connect with one another — through “settlement-free peering” (essentially an exchange of traffic between two ISPs that view themselves as equals) and through the purchase of “transit” (making a “downstream ISP” the customer of an “upstream ISP”).
This results in a three-tier model for ISPs. The Tier 1 ISPs are big global carriers of network connectivity — companies like AT&T, BT and NTT — who have settlement-free peers with each other, and sell transit to smaller ISPs. Tier 2 ISPs are usually regional, and have settlement-free peers with others in and around their region, but also are reliant on transit from the Tier 1s. Tier 3 ISPs are entirely dependent on purchasing transit. ISPs at all three tiers also sell connectivity directly to businesses and/or consumers.
In practice, this means that ISPs are generally contractually bound to other ISPs. All transit contracts are governed by terms of service that normally incorporate, by reference, an AUP. Even settlement-free peering agreements are legal contracts, which normally includes the mutual agreement to maintain and enforce some form of AUP. (In the earlier days of the Internet, peering was done on a handshake, but anything of that sort is basically a legacy that can come to an abrupt end should one party suddenly decide to behave badly.)
AUP documents are interesting because they are deliberately created as living documents, allowing AUPs to be adapted to changing circumstances — unlike standard contract terms, which apply for the length of what is usually a multiyear contract. AUPs are also normally ironclad; it’s usually difficult to impossible for a business to get any form of AUP exemption written into their contract. Most contracts provide minimal or no notice for AUP changes. Businesses tend to simply agree to them because most businesses do not plan to engage in the kind of behavior that violates an AUP — and because they don’t have much choice.
The existence of ISP tiering means that upstream providers have significant influence over the behavior of their downstream. Upstream ISPs normally mandate that their downstream ISPs — and other service providers that use their connectivity, like hosting providers — enforce an AUP that enables the downstream provider to be compliant with the upstream’s terms of service. Downstream providers that fail to do so can have their connectivity temporary suspended or their contract terminated. And between the Tier 1 providers, peer pressure ensures a common global understanding and enforcement of acceptable behavior on the Internet.
Note that this has all occurred in the absence of regulation. ISPs have come to these arrangements through decisions about what’s good for their individual businesses first and foremost, with the general agreement that these community standards for AUPs are good for the community of network operators as a whole.
We’re Here Because Nobody Likes Spammers
So how did we arrive at this state in the first place?
In the mid-90s, as the Internet was growing rapidly, in the near-total absence of regulation, spam was a growing problem. Spam came from both legitimate businesses who simply weren’t aware of or didn’t especially care about Internet etiquette, as well as commercial spammers (bad actors with deceptive or fraudulent ads, and/or illegal/grey-market products).
Many B2B ISPs did not feel that it was necessarily their responsibility to intervene, despite general distaste for spammers — and, sometimes, a flood of consumer complaints. Some percentage of spammers were otherwise “good customers” — i.e. they paid their bills on time and bought a lot of bandwidth. Many more, however, obtained services under fraudulent pretenses, didn’t pay their bills, or tended not to pay on time.
Gradually, the community of network operators came to a common understanding that spammers were generally bad for business, whether they were your own customers, or whether they were the customers of, say, a web hosting company that you provided Internet connectivity for.
This resulted in upstream ISPs exerting pressure on downstream ISPs. Downstream ISPs, in turn, exerted pressure on their customers — kicking spammers off their networks and pushing hosters to kick spammers out of hosting environments. ISPs formalized AUPs. AUP enforcement took longer. Many ISPs were initially pretty shoddy and inconsistent in their enforcement — either because they needed the revenue they were getting from spammers, or due to unwillingness or inability to fund a staff to deal with abuse, or corporate lawyers who urged caution. It took years, but ISPs eventually arrived at AUPs that were contractually enforceable, processes for handling complaints, and relatively consistent enforcement. Legislation like the CAN-SPAM act in the US didn’t hurt, but by the time CAN-SPAM was passed (in 2003), ISPs had already arrived at a fairly successful commercial resolution to the problem.
Because anti-spam efforts were largely fueled by agreements enshrined in B2B contracts, and not in government regulation, there was never full consistency across the industry. Different ISPs created different AUPs — some stricter and some looser. Different ISPs wrote different terms of service into their contracts, with different “cure” periods (a period of time that a party in the contract is given to come into compliance with a contractual requirement). Different ISPs had different attitudes towards balancing “customer service” versus their responsibilities to their upstream providers and to the broader community of network operators.
Consequently, there’s nothing that says “We need to receive X number of spam complaints before we take action,” for instance. Some providers may have internal process standards for this. A lot of enforcement simply takes place via automated algorithms; i.e. if a certain threshold of users reports something as spam, enforcement actions take place. Providers effectively establish, through peer norms, what constitutes “effective” enforcement in accordance with terms of service obligations. Providers don’t need to threaten each other with network disconnection, because a norm has been established. But the implicit threat — and the contractual teeth behind that threat — always remains.
Nobody really likes terminating customers. So there are often fairly long cure periods, recognizing that it can take a while for a customer to properly comply with an AUP. In the suspension letter that AWS sent Parler, AWS cites communications “over the past several weeks”. Usually the providers look for their customers to demonstrate good-faith efforts, but may take suspension or termination efforts if it does not look like a good-faith effort to comply is being made, or if it appears that the effort, no matter how seemingly earnest, does not seem likely to bring compliance within a reasonable time period. 30 days is a common timeframe specified as a cure period in contracts (and is the cure period in the AWS standard Enterprise Agreement), but cloud provider click-through agreements (such as the AWS Customer Agreement) do not normally have a cure period, allowing immediate action to be taken at the provider’s discretion.
What Does This Have to Do With Policing Users on Social Media?
When providers established anti-spam AUPs, they also added a whole laundry list of offenses beyond spamming. Think of that list, “Everything a good corporate lawyer thought an ISP might ever want to terminate a customer for doing.” Illegal behavior, harassment, behavior that disrupts provider operations, behavior that threatens the safety/security/operations of other businesses, etc. are all prohibited.
Hosting companies — eventually followed by cloud providers like AWS, Microsoft Azure, and Google Cloud Platform, as well as companies that hold key roles in the Internet ecosystem (domain registrars and the companies that operate the DNS; content delivery networks like Akamai and Cloudflare, etc.) — were essentially obliged to incorporate their upstream ISP usage policies into their own terms of service and AUPs, and to enforce those policies on their users if they wanted to stay connected to the Internet. Some such providers have also explicitly chosen not to sell to customers in certain business segments — for instance, no gambling, or no pornography, even if the business is fully legitimate and legal (for instance, like MGM Resorts or Playboy) — through limiting what’s allowed in their terms of service. An AUP may restrict activities that are perfectly legal in a given jurisdiction.
Even extremely large tech companies that have their own data centers, like Facebook and Apple, are ultimately beholden to ISPs. (Google is something of an odd case because in addition to owning their own data centers, they are one of the largest global network operators. Google owns extensive fiber routes and peers with Tier 1 ISPs as an equal.) And even though AWS has, to some degree, a network of its own, it is effectively a Tier 2 ISP, making it beholden to the AUPs of its upstream. Other cloud providers are typically mostly or fully transit-dependent, and are thus entirely beholden to their upstream.
In short: Everyone who hosts content companies, and the content companies themselves, is essentially trapped, by the chain of AUP obligations, to policing content to ensure that it is not illegal, harassing, or otherwise seen as commercially problematic.
You have to go outside the normal Internet supply chain — for instance, to the Russian service providers — before you escape the commercial arrangements that bound notions of good business behavior on the Internet. It doesn’t matter what a provider’s philosophical alignment is. Commercially, they simply can’t really push back on the established order. And because these agreements are global, regulation at a single-country level can’t really force these agreements to be significantly more or less restrictive, because of the globalized nature of peering/transit; providers generally interconnect in multiple countries.
It also means that these aren’t just “Silicon Valley” standards. These are global norms for behavior, which means they are not influenced solely by the relatively laissez-faire content standards of the United States, but by the more stringent European and APAC environments.
It’s an interesting result of what happens when businesses police themselves. Even without formal industry-association “rules” or regulatory obligations, a fairly ironclad order can emerge that exerts extremely effective downstream pressure (as we saw in the cases of 8Chan and the Daily Stormer back in 2019).
Does being multicloud help with terms of service violations?
Some people will undoubtedly ask, “Would it have helped Parler to have been multicloud?” Parler has already said that they are merely bare-metal customers of AWS, reducing technical dependencies / improving portability. But their situation is such that they would almost certainly have had the exact same issue if they had been running on Microsoft Azure, Google Cloud Platform, or even Oracle Cloud Infrastructure as well (even though the three companies have top executives with political views spanning the spectrum). A multicloud strategy won’t help any business that violates AUP norms.
AWS and its cloud/hosting competitors are usually pretty generous when working with business customers that unintentionally violate their AUPs. But a business that chooses not to comply is unlikely to find itself welcome anywhere, which makes multicloud deployment largely useless as a defensive strategy.
Clients have recently been asking a lot more questions about the comparative resilience of cloud providers.
Identity services are a particular point of concern (for instance, the Azure AD outage of October 1st and Google Cloud IAM outage of March 26th) since when identity is down, the customer can’t access the cloud provider’s control plane (and it may impact service use in general) — plus there’s generally no way for the customer to work around such issues.
The good news is, hyperscale cloud providers do a pretty good job of being robust. However, the risk of smaller, more hosting-like providers can be much higher — and there are notable differences between the hyperscalers, too.
Operations folks know: Everything breaks. Physical stuff fails, software is buggy, and people screw up (a lot). A provider can try its best to reduce the number of failures, limit the “blast radius” of a problem, limit the possibility of “cascading failures”, and find ways to mitigate the impact on users. But you can’t avoid failure entirely. Systems that are resilient recover quickly from failure.
If you chop off the head of a hydra, it grows back — quickly. We can think about five key factors — heads of the hydra — that influence the robustness, resilience, and observed (“real world”) availability of cloud services:
- Physical design: The design of physical things, such as the data center and the hardware used to deliver services.
- Logical (software) design: The design of non-physical things, especially software — all aspects of the service architecture that is not related to a physical element.
- Implementation quality: The robustness of the actual implementation, encompassing implementation skill, care and meticulousness, and the effectiveness of quality-assurance (QA) efforts.
- Deployment processes: The rollout of service changes is the single largest cause of operational failures in cloud services. The quality of these processes, the automation used in the processes, and the degree to which humans are given latitude to use good judgment (or poor judgment) thus have a material impact on availability.
- Operational processes: Other operational processes, such as monitoring, incident management — and, most importantly, problem management — impact the cloud provider’s ability to react quickly to problems, mitigate issues, and ensure that the root causes of incidents are addressed. Both proactive and reactive maintenance efforts can have an impact on availability.
A sixth factor, Transparency, isn’t directly related to keeping the hydra alive, but matters to customers as they plan for their own application architectures and risk management — contributing to customer resilience. Transparency includes making architectural information to customers, as well as delivering outage-related visibility and insight to customers. Customers need real-world info — like current and historical outage reports and the root-cause-analysis port-mortems that offer insight into what went wrong and why (and what the provider is doing about it).
When you think about cloud service resilience (or the resilience of your own systems), think about it in terms of those factors. Don’t think about it like you think about on-premises systems, where people often think primarily about hardware failures or a fire in the data center. Rather, you’re dealing with systems where software issues are almost always the root cause. Physical robustness still matters, but the other four factors are largely about software.
“It is a lovely place, my house,” said the Queen. “I am sure you would like it. There are whole rooms full of Turkish Delight, and what’s more, I have no children of my own. I want a nice boy whom I could bring up as a Prince and who would be King of Narnia when I am gone. While he was Prince he would wear a gold crown and eat Turkish Delight all day long; and you are much the cleverest and handsomest young man I’ve ever met. I think I would like to make you the Prince—some day, when you bring the others to visit me.” — The White Witch (C.S. Lewis; The Lion, The Witch, and the Wardrobe)
When most people read the Narnia novels as children, they have no idea what Turkish Delight is. Its obscurity in recent decades has allowed everyone to imagine it as an entirely wonderful substance, carrying all their hopes and dreams of the perfect candy.
So, too, do people pour all of their business hopes and dreams into a nebulously-defined future of “digital transformation”.
Because the cloud is such a key enabling technology for digital business, I have plenty of discussions with clients who have been promised grand “digital transformation” outcomes by cloud providers and cloud MSPs. But it certainly not a phenomenon limited to the cloud. Hardware vendors and ISVs, outsourcers, consultancies, etc. are all selling this dream. While I can think of vendors who are more guilty of this than others, it’s a cross-IT-industry phenomenon.
Beware all digital transformation promises. Especially the ones where the vendor promises to partner with you to change the future of your industry or reinvent/revolutionize/disrupt X (where X is what your industry does).
I’ve quietly watched a string of broken transformation promises over the last few years, gently privately warning clients in inquiry conversations that you generally can’t trust these sorts of vendor promises. These behaviors have become much more prominent recently, though. And a colleague recently told me about a conversation that seemed like just a bridge too far: a large tech vendor promising to partner with a small Midwestern industrial manufacturer (tech laggards not doing anything exciting) to create transformative products, as part of a sales negotiation. (This vendor had not previously exhibited such behavior, so it was doubly startling.)
Clients come to us with tales of vendors who, in the course of sales discussions, promises to partner with them — possibly even dangling the possibility of a joint venture — to launch a transformational digital business, revolutionize the future of their industry, or the like. (Note that this is distinct from companies that sell transformation consulting. They promise to help you figure out the future, not form a business partnership to create that future — i.e. McKinsey, Deloitte, etc.)
Usually, neither the customer nor the vendor have a concrete idea of what that looks like. Usually, the vendor refuses to put this partnership notion in writing as a formal contract. On the rare occasion that there is a contract, it is pretty vague, does not oblige the vendor to put forth any business ideas, and allows the vendor to refuse any business idea and investment. In other words, it has zero teeth. Because it’s so open-ended, the customer can fill the void with all their Turkish Delight dreams.
Moreover, the vendor may sometimes dangle samples of transformation-oriented services and consulting during the sales process. The customer gobbles down these sweet nuggets, and then stares mournfully at the empty box of transformation candy. For the promise of more, they’ll cheerfully betray their enterprise procurement standards, while the sourcing managers stand on the sidelines frantically waving contract-related warnings.
Listen to your sourcing managers when they warn you that the proposed “partnership” is a fiction. The White Witch probably doesn’t have your best interests at heart. Good digital transformation promises — ones that are likely to actually be kept — have concrete outcomes. They specify what the partnership will be developing, together with timelines, budgets, and the legal entity (such as a JV) that will be used to deliver the products/services. Or they specify the specific consulting services that will be provided — workshops, deliverables from those workshops, work-for-hire agreements with specific costs (and discounts, if applicable), and so forth.
Without concrete contractual outcomes, the vendor can vanish the candy into thin air with no repercussions. Sure, in a concrete transformation proposal, the end result will probably not be your Turkish Delight dreams. It might resemble a bowl of ordinary M&Ms. Or maybe a tasty grab-bag of Lindt truffles. (You’d have to get particularly lucky for it get much beyond the realm of grocery-store candy, though.) But you’re much more likely to actually get a good outcome.
Off-hand, I can think of one public example where a prominent “change the industry” vendor partnership with an enterprise, seems to have resulted in a credible product: Microsoft’s Connected Vehicle Platform. There, Microsoft signed a deal with a collection of automakers, each of whom had specific outcomes they wished to achieve — outcomes which could be realistically achieved in a reasonable amount of time, and representing industry advancement but not anything truly revolutionary. Microsoft built upon those individual projects to deliver a platform which would move the industry forward, which was announced with a clear mission and a timeframe for launch. Sure, it didn’t “change the future of cars”, but it brought tangible benefits to the customers.
Vendors often try to sell to who you hope to be, rather than who you are now. Your aspirations aren’t bad. Just make sure that your aspirations are well-defined and there’s a realistic roadmap to achieve them. Hope is not a strategy. The vendor may have little incentive not to promise everything you could dream of, in order to get you to sign a large purchase agreement.
Many of my client inquiries deal with the seemingly overwhelming complexity of maturing cloud adoption — especially with the current wave of pandemic-driven late adopters, who are frequently facing business directives to move fast but see only an immense tidal wave of insurmountably complex tasks.
A lot of my advice is focused on starting small — or at least tackling reasonably-scoped projects. The following is specifically applicable to IaaS / IaaS+PaaS:
Build a cloud center of excellence. You can start a CCOE with just a single person designated as a cloud architect. Standing up a CCOE is probably going to take you a year of incremental work, during which cloud adoption, especially pilot projects, can move along. You might have to go back and retroactively apply governance and good practices to some projects. That’s usually okay.
Start with one cloud. Don’t go multicloud from the start. Do one. Get good at it (or at least get a reasonable way into a successful implementation). Then add another. If there’s immediate business demand (with solid business-case justifications) for more than one, get an MSP to deal with the additional clouds.
Don’t build a complex governance and ops structure based on theory. Don’t delay adoption while you work out everything you think you’ll need to govern and manage it. If you’ve never used cloud before, the reality may be quite different than you have in your head. Run a sequence of increasingly complex pilot projects to gain practical experience while you do preparatory work in the background. Take the lessons learned and apply them to that work.
Don’t build massive RFPs to choose a provider. Almost all organizations are better off considering their strategic priorities and then matching a cloud provider to those priorities. (If priorities are bifurcated between running the legacy and building new digital capabilities, this might encourage two strategic providers, which is fine and commonplace.) Massive RFPs are a lot of work and are rarely optimal. (Government folks might have no choice, unfortunately.)
Don’t try to evaluate every service. Hyperscale cloud providers have dozens upon dozens of services. You won’t use all of them. Don’t bother to evaluate all of them. If you think you might use a service in the future, and you want to compare that service across providers… well, by the time you get around to implementing it, all of the providers will have radically updated that service, so any work you do now will be functionally useless. Look just at the services that you are certain you will use immediately and in the very near (no more than one year) future. Validate a subset of services for use, and add new validations as needed later on.
Focus on thoughtful workload placement. Decide who your approved and preferred providers are, and build a workload placement policy. Look for “good technical fit” and not necessarily ideal technical fit; integration affinities and similar factors are more important. The time to do a detailed comparison of an individual service’s technical capabilities is when deciding workload placement, not during the RFP phase.
Accept the limits of cloud portability. Cloud providers don’t and will probably never offer commoditized services. Even when infrastructure resources seem superficially similar, there are still meaningful differences, and the management capabilities wrapped around those resources are highly differentiated. You’re buying into ecosystems that have the long-term stickiness of middleware and management software. Don’t waste time on single-pane-of-glass no-lock-in fantasies, no matter how glossily pretty the vendor marketing material is. And no, containers aren’t magic in this regard.
Links are to Gartner research and are paywalled.
Pondering the care and feeding of your multicloud gelatinous cube. (Which engulfs everything in its path, and digests everything organic.)
Most organizations end up multicloud, rather than intending to be multicloud in a deliberate and structured way. So typical tales go like this: The org started doing digital business-related new applications on AWS and now AWS has become the center of gravity for all new cloud-native apps and cloud-related skills. Then the org decided to migrate “boring” LOB Windows-based COTS to the cloud for cost-savings, and lifted-and-shifted them onto Azure (thereby not actually saving money, but that’s a post for another day). Now the org has a data science team that thinks that GCP is unbearably sexy. And there’s a floating island out there of Oracle business applications where OCI is being contemplated. And don’t forget about the division in China, that hosts on Alibaba Cloud…
Multicloud is inevitable in almost all organizations. Cloud IaaS+PaaS spans such a wide swathe of IT functions that it’s impractical and unrealistic to assume that the organization will be single-vendor over the long term. Just like the enterprise tends to have at least three of everything (if not ten of everything), the enterprise is similarly not going to resist the temptation of being multicloud, even if it’s complex and challenging to manage, and significantly increases management costs. It is a rare organization that both has diverse business needs, and can exercise the discipline to use a single provider.
Despite recognizing the giant ooze that we see squelching our way, along with our unavoidable doom, there are things we can do to prepare, govern, and ensure that we retain some of our sanity.
For starters, we can actively choose our multicloud strategy and stance. We can classify providers into tiers, decide what providers are approved for use and under what circumstances, and decide what providers are preferred and/or strategic.
We can then determine the level of support that the organization is going to have for each tier — decide, for instance, that we’ll provide full governance and operations for our primary strategic provider, a lighter-weight approach that leans on an MSP to support our secondary strategic provider, and less support (or no support beyond basic risk management) for other providers.
After that, we can build an explicit workload placement policy that has an algorithm that guides application owners/architects in deciding where particular applications live, based on integration affinities, good technical fit, etc.
Note that cost-based provider selection and cost-based long-term workload placement are both terrible ideas. This is a constant fight between cloud architects and procurement managers. It is rooted in the erroneous idea that IaaS is a commodity, and that provider pricing advantages are long-term rather than short-lived. Using cost-based placement often leads to higher long-term TCO, not to mention a grand mess with data gravity and thus data management, and fragile application integrations.
See my new research note, “Comparing Cloud Workload Placement Strategies” (Gartner paywall) for a guide to multicloud IaaS / IaaS+PaaS strategies (including when you should pursue a single-cloud approach). In a few weeks, you’ll see the follow-up doc “Designing a Cloud Workload Placement Policy” publish, which provides a guide to writing such policies, with an analysis of different placement factors and their priorities.
Responsibility for cloud operations is often a political football in enterprises. Sometimes nobody wants it; it’s a toxic hot potato that’s apparently coated in developer cooties. Sometimes everybody wants it, and some executives think that control over it are going to ensure their next promotion / a handsome bonus / attractiveness for their next job. Frequently, developers and the infrastructure & operations (I&O) orgs clash over it. Sometimes, CIOs decide to just stuff it into a Cloud Center of Excellence team which started out doing architecture and governance, and then finds itself saddled with everything else, too.
Lots of arguments are made for it to live in particular places and to be executed in various ways. There’s inevitably a clash between the “boring” stuff that is basically lifted-and-shifted and rarely changes, and the fast-moving agile stuff. And different approaches to IaaS, PaaS, and SaaS. And and and…
Well, the fact of the matter is that multiple people are probably right. You don’t actually want to take a one-size-fits-all approach. You want to fit operational approaches to your business needs. And you maybe even want to have specialized teams for each major hyperscale provider, even if you adopt some common approaches across a multicloud environment. (Azure vs. non-Azure, i.e. Azure vs. AWS, is a common split, often correlated closely to Windows-based application environments vs Linux-based application environments.)
Ideally, you’re going to be highly automated, agile, cloud-native, and collaborative between developers and operators (i.e. DevOps). But maybe not for everything (i.e. not all apps are under active development).
Plus, once you’ve chosen your basic operations approach (or approaches), you have to figure out how you’re going to handle cloud configuration, release engineering, and security responsibilities. (And all the upskilling necessary to do that well!)
That’s where people tend to really get hung up. How much responsibility can I realistically push to my development teams? How much responsibility do they want? How do I phase in new operational approaches over time? How do I hook this into existing CI/CD, agile, and DevOps initiatives?
There’s no one right answer. However, there’s one answer that is almost always wrong, and that’s splitting cloud operations across the I&O functional silos — i.e., the server team deals with your EC2 VMs, your NetApp storage admin deals with your Azure Blobs, your F5 specialist configures your Google Load Balancers, your firewall team fights with your network team over who controls the VPC config (often settled, badly, by buying firewall virtual appliances), etc.
When that approach is taken, the admins almost always treat the cloud portals like they’re the latest pointy-clicky interface for a piece of hardware. This pretty much guarantees incompetence, lack of coordination, and gross inefficiency. It’s usually terrible at regardless of what scale you’re at. Unfortunately, it’s also the first thing that most people try (closely followed by massively overburdening some poor cloud architect with Absolutely Everything Cloud-Related.)
What works for most orgs: Some form of cloud platform operations, where cloud management is treated like a “product”. It’s almost an internal cloud MSP approach, where the cloud platform ops team delivers a CMP suite, cloud-enabled CI/CD pipeline integrations, templates and automation, other cloud engineering, and where necessary, consultative assistance to coders and to application management teams. That team is usually on call for incident response, but the first line for incidents is usually the NOC or the like, and the org’s usual incident management team.
But there are lots of options. Gartner clients: Want a methodical dissection of pros and cons; cloud engineering, operating, and administration tasks; job roles; coder responsibilities; security integration; and other issues? Read my new note, “Comparing Cloud Operations Approaches“, which looks at eleven core patterns along with guidance for choosing between them, andmaking a range of accompanying decisions.
A nontrivial chunk of my client conversations are centered on the topic of cloud IaaS/PaaS self-service, and how to deal with development teams (and other technical end-user teams, i.e. data scientists, researchers, hardware engineers, etc.) that use these services. These teams, and the individuals within those teams, often have different levels of competence with the clouds, operations, security, etc. but pretty much all of them want unfettered access.
Responsible governance requires appropriate guidelines (policies) and guardrails, and some managers and architects feel that there should be one universal policy, and everyone — from the highly competent digital business team, to the data scientists with a bit of ad-hoc infrastructure knowledge — should be treated identically for the sake of “fairness”. This tends to be a point of particular sensitivity if there are numerous application development teams with similar needs, but different levels of cloud competence. In these situations, applying a single approach is deadly — either for agility or your crisis-induced ulcer.
Creating a structured, tiered approach, with different levels of self-service and associated governance guidelines and guardrails, is the most flexible approach. Furthermore, teams that deploy primarily using a CI/CD pipeline have different needs from teams working manually in the cloud provider portal, which in turn are different from teams that would benefit from having an easy-vend template that gets provisioned out of a ServiceNow request.
The degree to which each team can reasonably create its own configurations is related to the team’s competence with cloud solution architecture, cloud engineering, and cloud security. Not every person on the team may have a high level of competence; in fact, that will generally not be the case. However, the very least, for full self-service there needs to be at least one person with strong competencies in each of those areas, who has oversight responsibilities, acts an expert (provides assistance/mentorship within the team), and does any necessary code review.
If you use CI/CD, you also want automation of such review in your pipeline, that includes your infrastructure-as-code (IaC) and cloud configs, not just the app code; i.e. a tool like Concourse Labs). Even if your whole pipeline isn’t automated, review of IaC during the dev stage, and not just when it triggers a cloud security posture management tool (like Palo Alto’s Prisma Cloud or Turbot), whether in dev, test, or production.
Who determines “competence”? To avoid nasty internal politics, it’s best to set this standard objectively. Certifications are a reasonable approach, but if your org isn’t the sort that tends to pay for internal certifications or the external certifications (AWS/Azure Solution Architect, DevOps Engineer, Security Engineer, etc.) seem like too high a bar, you can develop an internal training course and certification. It’s not a bad idea for all of your coders (whether app developers, data scientists, etc.) that use the cloud to get some formal training on creating good and secure cloud configurations, anyway.
(For Gartner clients: I’m happy to have a deeper discussion in inquiry. And yes, a formal research note on this is currently going through our editing process and will be published soon.)
(Confused by the title of this post? Read this brief anecdote.)
The myth of cloud repatriation refuses to die, and a good chunk of the problem is that users (and poll respondents) use “repatriation” is a wild array of ways, but non-cloud vendors want you to believe that “repatriation” means enterprises packing up all their stuff in the cloud and moving it back into their internal data centers — which occurs so infrequently that it’s like a sasquatch sighting.
A non-comprehensive list of the ways that clients use the term “repatriation” that have little to nothing to do with what non-cloud vendors (or “hybrid”) would like you to believe:
Outsourcing takeback. The origin of the term comes from orgs that are coming back from traditional IT outsourcing. However, we also hear cloud architects say they are “repatriating” when they gradually take back management of cloud workloads from a cloud MSP; the workloads stay in the cloud, though.
Migration pause. Some migrations to IaaS/IaaS+PaaS do not go well. This is often the result of choosing a low-quality MSP for migration assistance, or rethinking the wisdom of a lift-and-shift. Orgs will pause, switch MSPs and/or switch migration approaches (usually to lift-and-optimize), and then resume. Some workloads might be temporarily returned on-premise while this occurs.
SaaS portfolio rationalization. Sprawling adoption of SaaS, at the individual, team, department or business-unit level, can result in one or more SaaS applications being replaced with other, official, corporate SaaS (for instance, replacing individual use of Dropbox with an org-wide Google Drive implementation as part of G-Suite). Sometimes, the org might choose to build on-premises functionality instead (for instance, replacing ad-hoc SaaS analytics with an on-prem data warehouse and enterprise BI solution). This is overwhelmingly the most common form of “cloud repatriation”.
Development in the cloud, production on premises. While the dev/prod split of environments is much less common than it used to be, some organizations still develop in cloud IaaS and then run the app in an on-prem data center in production. Orgs like this will sometimes say they “repatriate” the apps for production.
The Oops. Sometimes organizations attempt to put an application in the cloud and it Just Doesn’t Go Well. Sometimes the workload isn’t a good match for cloud services in general. Sometimes the workload is just a bad match for the particular provider chosen. Sometimes they make a bad integrator choice, or their internal cloud skills are inadequate to the task. Whatever it is, people might hit the “abort” button and either rethink and retry in the cloud, or give up and put it on premises (either until they can put together a better plan, or for the long term).
Of course, there are the sasquatch sightings, too, like the Dropbox migration from AWS (also see the five-year followup), but those stories rarely represent enterprise-comparable use cases. If you’re one of the largest purchasers of storage on the planet, and you want custom hardware, absolutely, DIY makes sense. (And Dropbox continues to do some things on AWS.)
Customers also engage in broader strategic application portfolio rationalizations that sometimes result in groups of applications being shifted around, based on changing needs. While the broader movement is towards the cloud, applications do sometimes come back on-premises, often to align to data gravity considerations for application and data integration.
None of these things are in any way equivalent to the notion that there’s a broad or even common movement of workloads from the cloud back on-premises, though, especially for those customers who have migrated entire data centers or the vast majority of their IT estate to the cloud.
(Updated with research: In my note for Gartner clients, “Moving Beyond the Myth of Repatriation: How to Handle Cloud Projects Failures”, I provide detailed guidance on why cloud projects fail, how to reduce the risks of such projects, and how — or if — to rescue troubled cloud projects.)