Blog Archives
Multicloud failover is almost always a terrible idea
Most people — and notably, almost all regulators — are entirely wrong about addressing cloud resilience through the belief that they should do multicloud failover because, as I noted in a previous blog post, the cloud is NOT just someone else’s computer. (I have been particularly aghast at a recent Reuters article about the Bank of England’s stance.)
Regulators, risk managers, and plenty of IT management largely think of AWS, Azure, etc. as monolithic entities, where “the cloud” can just break for them, and then kaboom, everything is dead everywhere worldwide. They imagine one gargantuan amorphous data center, subject to all the problems that can afflict single data centers, or single systems. But that’s not how it works, that’s not the most effective way to address risk, and testing the “resilience of the provider” (as a generic whole) is both impossible and meaningless.
I mean, yes, there’s the possibility of the catastrophic failure of practically any software technology. There could be, for instance, a bug in the control systems of airplanes from fill-in-the-blank manufacturer that could be simultaneously triggered at a particular time and cause all their airplanes to drop out of the sky simultaneously. But we don’t plan to make commercial airlines maintain backup planes from some other manufacturer in case it happens. Instead, we try to ensure that each plane is resilient in many ways — which importantly addresses the most probable forms of failure, which will be electrical or mechanical failures of particular components.
Hyperscale cloud providers are full of moving parts — lots of components, assembled together into something that looks and feels like a cohesive whole. Each of those components has its own form of resilience, and some of those components are more fragile than others. Some of those components are typically operating well within engineered tolerances. Some of those components might be operating at the edge of those tolerances in certain circumstances — likely due to unexpected pressures from scale — and might be extra-scary if the provider isn’t aware that they’re operating at that edge. In addition to fault-tolerance within each component, there are many mechanisms for fault-tolerance built into the interaction between those components.
Every provider also has its own equivalent of “maintenance” (returning to the plane analogy). The quality of the “mechanics” and the operations will also impact how well the system as a whole operates. (See my previous blog post, “The multi-headed hydra of cloud resilience” for the factors that go into provider resilience.)
It’s not impossible for a provider to have a worldwide outage that effectively impacts all services (rather than just a single service). Such outages are all typically rooted in something that prevents components from communicating with each other, or customers from connecting to the services — global network issues, DNS, security certificates, or identity. The first major incident of this type was the 2012 Azure leap year outage. The 2019 Google “Chubby” outage had global network impact, including on GCP. There have been multiple Azure AD outages with broad impact across Microsoft’s cloud portfolio, most recently the 2021 Azure Active Directory outage. (But there are certainly other possibilities. As recently as yesterday, there was a global Azure Windows VM outage that impacted all Windows VM-dependent services.)
Provider architectural and operational differences do clearly make a difference. AWS, notably, has never had a full regional failure or a global outage. The unique nature of GCP’s global network has both benefits and drawbacks. Azure has been improving steadily in reliability over the years as Microsoft addresses both service architecture and deployment (and other operations) processes.
Note that while these outages can be multi-hour, they have generally been short enough that — given typical enterprise recovery-time objectives for disaster recovery, which are often lengthy — customers typically don’t activate a traditional DR plan. (Customers may take other mitigation actions, i.e. failover to another region, failover to an alternative application for a business process, and so forth.)
Multicloud failover requires that you maintain full portability between two providers, which is a massive burden on your application developers. The basic compute runtime (whether VMs or containers) is not the problem, so OpenShift, Anthos, or other “I can move my containers” solutions won’t really help you. The problem is all the differentiators — the different network architectures and features, the different storage capabilities, the proprietary PaaS capabilities, the wildly different security capabilities, etc. Sure, you can run all open source in VMs, but at that point, why are you bothering with the cloud at all? Plus, even in a DR situation, you need some operational capabilities on the other cloud (monitoring, logging, etc.), even if not your full toolset.
Moreover, the huge cost and complexity of a multicloud implementation is effectively a negative distraction from what you should actually be doing that would improve your uptime and reduce your risks, which is making your applications resilient to the types of failure that are actually probable. More on that in a future blog post.
Banks are accelerating their cloud journeys
In the past couple of months, I have talked to the majority of the world’s largest banks about what is necessary to drive successful cloud adoption at enterprise scale. These conversations have a lot of things in common with one another, and I often send the same research notes as a follow-up to our conversations. Here are those notes, with some context. The notes are all behind the Gartner paywall, in most cases Gartner for Technical Professionals, but some of these are available to IT Leaders clients, or Executive Programs clients.
Banks are indeed really moving core banking to the cloud. The long-held adage that “banks might put new systems of innovation or systems of engagement in the cloud, but they’ll never move core banking”, is crumbling. Gartner has statistics supporting this, which you can find in “Core Banking Hot Spot: Moving the Core Into the Cloud“.
Banks cite application modernization as a critical driver for cloud adoption. An increasing number of banks are migrating a substantial percentage of their existing application estate to public cloud IaaS (and PaaS). Supporting survey data can be found in “Application Modernization Is the Most Common Identified Priority for End-User Cloud Adoption in Banking and Investment Services” (but other priorities are closely clustered in importance).
Banks are striving to mature their cloud adoption. Some banks have had a lot of ad hoc adoption over the years, while other banks have been more cautious (venturing into a bit of SaaS but sometimes zero IaaS or PaaS). But we’ve hit the inflection point (starting about two years ago) where banks became comfortable with cloud provider security and then seemingly all of a sudden went to a “go go go!” mode in which cloud was viewed as a critical accelerator of digital banking initiatives. (See “Advance Through Public Cloud Adoption Maturity” for a view of typical journeys.)
Central cloud governance is the norm for banks. Banks generally like the Gartner-style cloud center of excellence (CCOE) model where an enterprise architecture function provides cloud governance, brokerage, and transformation assistance. (See “How to Build a Cloud Center of Excellence“.) However, their CCOE model is likely to be federated to empower different business units or regions to take charge of their own destinies (especially when the cloud strategy is more regional than global). And many banks are splitting off a separate cloud IT unit under a deputy CIO, which is effectively a self-contained organization with hundreds of people devoted to the cloud migration and transformation effort.
While banks still do detailed technical evaluation of cloud providers, strategic selection is based on alignment to the IT strategy. Banks still really care about nitpicky technical details, but ultimately, their selection of strategic providers is based on broader IT priorities, just like most other cloud customers these days. (See “How to Initiate the Selection of Strategic Cloud IaaS Providers“.) Sometimes there’s a certain degree of hope for some kind of innovation partnership. (I am cynical about such “partnerships”, especially when they come in the form of vague sales platitudes without contractual guarantees or a close business development relationship.)
Banks tend to be multicloud. The larger the bank, the more likely it is to adopt a multicloud strategy, similar to other enterprises (see “Comparing Cloud Workload Placement Strategies“). However, this does not mean that all cloud providers are treated equally. My anecdotal impression is that in terms of primary strategic provider, AWS dominates the the top end of the market (the largest banks) but that Azure captures the middle of the pack (from the US midmarket banks that tend to outsource their processing, to the banks that are important at the country/region level but not highly global).
Banks are making the transition to a more systematic approach to multicloud. Like many large distributed enterprises, banks often have pockets of cloud adoption, each aligned to a different cloud provider. With the maturation of their cloud journeys, they are becoming more systematic, building workload placement policies to guide where workloads should go. (See “Designing a Cloud Workload Placement Policy Document“.)
Banks worry about cloud concentration risks. Many banks face regulatory regimes that require them to address concentration risk. Regulators tend not to provide prescriptive guidance for what they must do, though. Banks have told me that attempting to maintain multicloud portability for applications essentially destroys the business case for cloud. Portability significantly impacts application development time, thus reducing the agility benefits. Without the ability to exploit the unique differentiated capabilities of a cloud provider, there’s little compelling reason not to just do it on-premises — which might actually be more risky than doing it in the cloud. There are effective practical risk-reduction approaches that don’t involve “maintain constant portability of all my apps”, though. (See “How to Create a Public Cloud Integrated IaaS and PaaS Exit Plan“.)
I hope to collaborate with a Gartner colleague to write bank-targeted research in the future. If you’re a cloud architect at a bank, I’d love to speak with you in client inquiry.
Are B2B cloud service agreements safe?
I’m seeing various bits of angst around “Is it safe to use cloud services, if my business can be suspended or terminated at any time?” and I thought I’d take some time to explain how cloud providers (and other Internet ecosystem providers, collectively “service providers” [SPs] in this blog post) enforce their Terms of Service (ToS) and Acceptable Use Policies (AUPs).
The TL;DR: Service providers like money, and will strive to avoid terminating customers over policy violations. However, providers do routinely (and sometimes automatically) enforce these policies, although they vary in how much grace and assistance they offer with issues. You don’t have to be a “bad guy” to occasionally run afoul of the policies. But if your business is permanently unwilling or unable to comply with a particular provider’s policies, you cannot use that provider.
AUP enforcement actions are rarely publicized. The information in this post is drawn from personal experience running an ISP abuse department; 20 years of reviewing multiple ISP, hosting, CDN, and cloud IaaS contracts on a daily basis; many years of dialogue with Gartner clients about their experiences with these policies; and conversations with service providers about these topics. Note that I am not a lawyer, and this post is not legal advice. I would encourage you to work with your corporate counsel to understand service provider contract language and its implications for your business, whenever you contract for any form of digital service, whether cloud or noncloud.
The information in this post is intended to be factual; it is not advice and there is a minimum of opinion. Gartner clients interested in understanding how to negotiate terms of service with cloud providers are encouraged to consult our advice for negotiating with Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP), or with SaaS providers. My colleagues will cheerfully review your contracts and provide tailored advice in the context of client inquiry.
Click-thrus, negotiated contracts, ToS, and AUPs
Business-to-business (B2B) service provider agreements have taken two different forms for more than 20 years. There are “click-through agreements” (CTAs) that present you with a online contract that you click to sign; consequently, they are as-is, “take it or leave it” legal documents that tend to favor the provider in terms of business risk mitigation. Then there are negotiated contracts — “enterprise agreements” (EAs) that typically begin with more generous terms and conditions (T&Cs) that better balance the interests of the customer and the provider. EAs are typically negotiated between the provider’s lawyers and the customer’s procurement (sourcing & vendor management) team, as well as the customer’s lawyers (“counsel”).
Some service providers operate exclusively on either CTAs or EAs. But most cloud providers offer both. Not all customers may be eligible to sign an EA; that’s a business decision. Providers may set a minimum account size, minimum spend, minimum creditworthiness, etc., and these thresholds may be different in different countries. Providers are under no obligation to either publicize the circumstances under which an EA is offered, or to offer an EA to a particular customer (or prospective customer).
While in general, EAs would logically be negotiated by all customers who can qualify, providers do not necessarily proactively offer EAs. Furthermore, some companies — especially startups without cloud-knowledgeable sourcing managers — aren’t aware of the existence of EAs and therefore don’t pursue them. And many businesses that are new to cloud services don’t initially negotiate an EA, or take months to do that negotiation, operating on a CTA in the meantime. Therefore, there are certainly businesses that spend a lot of money with a provider, yet only have a CTA.
Terms of service are typically baked directly into both CTAs and EAs — they are one element of the T&Cs. As a result, a business on an EA benefits both from the greater “default” generosity of the EA’s T&Cs over the provider’s CTA (if the provider offers both), as well as whatever clauses they’ve been able to negotiate in their favor. In general, the bigger the deal, the more leverage the customer has to negotiate favorable T&Cs, which may include greater “cure time” for contract breaches, greater time to pay the bill, more notice of service changes, etc.
AUPs, on the other hand, are separate documents incorporated by reference into the T&Cs. They are usually a superset or expansion/clarification of the things mentioned directly in the ToS. For instance, the ToS may say “no illegal activity allowed”, and the AUP will give examples of prohibited activities (important since what is legal varies by country). AUPs routinely restrict valid use, even if such use is entirely legal. Service providers usually stipulate that an AUP can change with no notice (which essentially allows a provider to respond rapidly to a change in the regulatory or threat environment).
Unlike the EA T&Cs, an AUP is non-negotiable. However, an EA can be negotiated to clarify an AUP interpretation, especially if the customer is in a “grey area” that might be covered by the AUP even if the use is totally legitimate (i.e. a security vendor that performs penetration testing on other businesses at their request, may nevertheless ask for an explicit EA statement that such testing doesn’t violate the AUP).
Prospective customers of a service provider can’t safely make assumptions about the AUP intent. For example, some providers might not exclude even a fully white-hat pen-testing security vendor from the relevant portion of the AUP. Some providers with a gambling-excluding AUP may not choose to do business with an organization that has, for instance, anything to do with gambling, even if that gambling is not online (which can get into grey areas like, “Is running a state lottery a form of gambling?”). Some providers operating data centers in countries without full freedom of the press may be obliged to enforce restrictions on what content a media company can host in those regions. Anyone who could conceivably violate the AUP as part of the routine course of business would therefore want to gain clarity on interpretation up front — and get it in writing in an EA.
What does AUP enforcement look like?
If you’re not familiar with AUPs or why they exist and must be enforced, I encourage you to read my post “Terms of Service: from anti-spam to content takedown” first.
AUP enforcement is generally handled by a “fraud and abuse” department within a service provider, although in recent years, some service providers have adopted friendlier names, like “trust and safety”. When an enforcement action is taken, the customer is typically given a clear statement of what the violation is, any actions taken (or that will be taken within X amount of time if the violation isn’t fixed), and how to contact the provider regarding this issue. There is normally no ambiguity, although less technically-savvy customers can sometimes have difficulties understanding why what they did wrong — and in the case of automatic enforcement actions, the customer may be entirely puzzled by what they did to trigger this.
There is almost always a split in the way that enforcement is handled for customers on a CTA, vs customers on an EA. Because customers on a CTA undergo zero or minimal verification, there is no presumption that those customers are legit good actors. Indeed, some providers may assume, until proven otherwise, that such customers exist specifically to perpetuate fraud and/or abuse. Customers on an EA have effectively gone through more vetting — the account team has probably done the homework to figure out likely revenue opportunity, business model and drivers for the sale, etc. — and they also have better T&Cs, so they get the benefit of the doubt.
Consequently, CTA customers are often subject to more stringent policies and much harsher, immediate enforcement of those policies. Immediate suspension or termination is certainly possible, with relatively minimal communication. (To take a public GCP example: GCP would terminate without means to protest as recently as 2018, though that has changed. Its suspension guidelines and CTA restrictions offer clear statements of swift and significantly automated enforcement, including prevention of cryptocurrency mining for CTA customers who aren’t on invoicing, even though it’s perfectly legal.) The watchword for the cloud providers is “business risk management” when it comes to CTA customers.
Customers that are on a CTA but are spending a lot of money — and seem to be legit businesses with an established history on the platform — generally get a little more latitude in enforcement. (And if enforcement is automated, there may be a sliding threshold for automated actions based on spend history.) Similarly, customers on a CTA but who are actively negotiating an EA or engaged in the enterprise sales process may get more latitude.
Often-contrary to the handling of CTA customers, providers usually assume an EA customer who has breached the AUP has done so unintentionally. (For instance, one of the customer’s salespeople may have sent spam, or a customer VM may have been compromised and is now being used as part of a botnet.) Consequently, the provider tends to believe that what the customer needs is notification that something is wrong, education on why it’s problematic, and help in addressing the issue. EA customers are often completely spared from any automated form of policy enforcement. While business risk management is still a factor for the service provider, this is often couched politely as helping the customer avoid risk exposure for the customer’s own business.
Providers do, however, generally firmly hold the line on “the customer needs to deal with the problem”. For instance, I’ve encountered cloud customers who have said to me, “Well, my security breach isn’t so bad, and I don’t have time/resources to address this compromised VM, so I’d like more than 30 days grace to do so, how do I make my provider agree?” when the service provider has reasonably taken the stance that the breach potentially endangers others, and mandated that the customer promptly (immediately) address the breach. In many cases, the provider will offer technical assistance if necessary. Service providers vary in their response to this sort of recalcitrance and the extent of their enforcement actions. For instance, AWS normally takes actions against the narrowest feasible scope — i.e. against only the infrastructure involved in the policy violation. Broadly, cloud providers don’t punish customers for violations, but customers must do something about such violations.
Some providers have some form of variant of a “three strikes” policy, or escalating enforcement. For instance, if a customer has repeated issues — for example, it seems unable implement effective anti-spam compliance for itself, or it constantly fails to maintain effective security in a way that could impact other customers or the cloud provider’s services, or it can’t effectively moderate content it offers to the public, or it can’t prevent its employees from distributing illegally copied music using corporate cloud resources — then repeated warnings and limited enforcement actions can turn into suspensions or termination. Thus, even EA customers are essentially obliged treat every policy violation as something that they need to strive to ensure will not recur in the future. Resolution of a given violation is not evidence that the customer is in effective compliance with the agreement.
Bottom Line
It’s not unusual for entirely legitimate, well-intentioned businesses to breach the ToS or AUP, but this is normally rare; a business might do this once or twice over the course of many years. New cloud customers on a CTA may also innocently exhibit behaviors that trigger automated enforcement actions that use algorithms to look for usage patterns that may be indicative of fraud or abuse. Service providers will take enforcement actions based on the customer history, the contractual agreement, and other business-risk and customer-relationship factors.
Intent matters. Accidental breaches are likely to be treated with a great deal more kindness. If breaches recur, though, the provider is likely to want to see evidence that the business has an effective plan for preventing further such issues. Even if the customer is willing to absorb the technical, legal, or business risks associated with a violation, the service provider is likely to insist that the issue be addressed — to protect their own services, their own customers, and for the customer’s own good.
(Update: Gartner clients, I have published a research note: “What is the risk of actually losing your cloud provider?“)
Terms of Service: From anti-spam to content takedown
Many consumers are familiar with the terms of service (ToS) that govern their use of consumer platforms that contain user-generated content (UGC), such as Facebook, Twitter, and YouTube. But many people are less familiar with the terms of service and acceptable use policy (AUP) that governs the relationships between businesses and their service providers.
In light of the recent decision undertaken by Amazon Web Services (AWS) to suspend service to Parler, a Twitter-like social network that was used to plan the January 6th insurrection at the US Capitol, there have been numerous falsehoods circulating on social media that seem related to a lack of understanding of service provider behavior, ToS and AUPs: claims of “free speech” violations, a coordinated conspiracy between big tech companies, and the like. This post is intended to examine, without judgment, how service providers — cloud, hosting and colo, CDN and other Internet infrastructure providers, and the ISPs that connect everyone to the Internet — came to a place of industry “standards” for behavior, and how enforcement is handled in B2B situations.
The TL;DR summary: The global service provider community, as a result of anti-spam efforts dating back to the mid-90s, enforces extremely similar policies governing content, including user-generated content, through a combination of B2B peer pressure and contractual obligations. Business customers who contravene these norms have very few options.
These norms will greatly limit Parler’s options for a new home. Many sites with far-right and similarly controversial content have ultimately ended up using a provider in a supply chain that relies on Russian connectivity, thus dodging the Internet norms that prevail in the rest of the world.
Internet Architecture and Service Provider Dependencies
While the Internet is a collection of loosely federated networks that are in theory independent from one another, it is also an interdependent web of interconnections between those networks. There are two ways that ISPs connect with one another — through “settlement-free peering” (essentially an exchange of traffic between two ISPs that view themselves as equals) and through the purchase of “transit” (making a “downstream ISP” the customer of an “upstream ISP”).
This results in a three-tier model for ISPs. The Tier 1 ISPs are big global carriers of network connectivity — companies like AT&T, BT and NTT — who have settlement-free peers with each other, and sell transit to smaller ISPs. Tier 2 ISPs are usually regional, and have settlement-free peers with others in and around their region, but also are reliant on transit from the Tier 1s. Tier 3 ISPs are entirely dependent on purchasing transit. ISPs at all three tiers also sell connectivity directly to businesses and/or consumers.
In practice, this means that ISPs are generally contractually bound to other ISPs. All transit contracts are governed by terms of service that normally incorporate, by reference, an AUP. Even settlement-free peering agreements are legal contracts, which normally includes the mutual agreement to maintain and enforce some form of AUP. (In the earlier days of the Internet, peering was done on a handshake, but anything of that sort is basically a legacy that can come to an abrupt end should one party suddenly decide to behave badly.)
AUP documents are interesting because they are deliberately created as living documents, allowing AUPs to be adapted to changing circumstances — unlike standard contract terms, which apply for the length of what is usually a multiyear contract. AUPs are also normally ironclad; it’s usually difficult to impossible for a business to get any form of AUP exemption written into their contract. Most contracts provide minimal or no notice for AUP changes. Businesses tend to simply agree to them because most businesses do not plan to engage in the kind of behavior that violates an AUP — and because they don’t have much choice.
The existence of ISP tiering means that upstream providers have significant influence over the behavior of their downstream. Upstream ISPs normally mandate that their downstream ISPs — and other service providers that use their connectivity, like hosting providers — enforce an AUP that enables the downstream provider to be compliant with the upstream’s terms of service. Downstream providers that fail to do so can have their connectivity temporary suspended or their contract terminated. And between the Tier 1 providers, peer pressure ensures a common global understanding and enforcement of acceptable behavior on the Internet.
Note that this has all occurred in the absence of regulation. ISPs have come to these arrangements through decisions about what’s good for their individual businesses first and foremost, with the general agreement that these community standards for AUPs are good for the community of network operators as a whole.
We’re Here Because Nobody Likes Spammers
So how did we arrive at this state in the first place?
In the mid-90s, as the Internet was growing rapidly, in the near-total absence of regulation, spam was a growing problem. Spam came from both legitimate businesses who simply weren’t aware of or didn’t especially care about Internet etiquette, as well as commercial spammers (bad actors with deceptive or fraudulent ads, and/or illegal/grey-market products).
Many B2B ISPs did not feel that it was necessarily their responsibility to intervene, despite general distaste for spammers — and, sometimes, a flood of consumer complaints. Some percentage of spammers were otherwise “good customers” — i.e. they paid their bills on time and bought a lot of bandwidth. Many more, however, obtained services under fraudulent pretenses, didn’t pay their bills, or tended not to pay on time.
Gradually, the community of network operators came to a common understanding that spammers were generally bad for business, whether they were your own customers, or whether they were the customers of, say, a web hosting company that you provided Internet connectivity for.
This resulted in upstream ISPs exerting pressure on downstream ISPs. Downstream ISPs, in turn, exerted pressure on their customers — kicking spammers off their networks and pushing hosters to kick spammers out of hosting environments. ISPs formalized AUPs. AUP enforcement took longer. Many ISPs were initially pretty shoddy and inconsistent in their enforcement — either because they needed the revenue they were getting from spammers, or due to unwillingness or inability to fund a staff to deal with abuse, or corporate lawyers who urged caution. It took years, but ISPs eventually arrived at AUPs that were contractually enforceable, processes for handling complaints, and relatively consistent enforcement. Legislation like the CAN-SPAM act in the US didn’t hurt, but by the time CAN-SPAM was passed (in 2003), ISPs had already arrived at a fairly successful commercial resolution to the problem.
Because anti-spam efforts were largely fueled by agreements enshrined in B2B contracts, and not in government regulation, there was never full consistency across the industry. Different ISPs created different AUPs — some stricter and some looser. Different ISPs wrote different terms of service into their contracts, with different “cure” periods (a period of time that a party in the contract is given to come into compliance with a contractual requirement). Different ISPs had different attitudes towards balancing “customer service” versus their responsibilities to their upstream providers and to the broader community of network operators.
Consequently, there’s nothing that says “We need to receive X number of spam complaints before we take action,” for instance. Some providers may have internal process standards for this. A lot of enforcement simply takes place via automated algorithms; i.e. if a certain threshold of users reports something as spam, enforcement actions take place. Providers effectively establish, through peer norms, what constitutes “effective” enforcement in accordance with terms of service obligations. Providers don’t need to threaten each other with network disconnection, because a norm has been established. But the implicit threat — and the contractual teeth behind that threat — always remains.
Nobody really likes terminating customers. So there are often fairly long cure periods, recognizing that it can take a while for a customer to properly comply with an AUP. In the suspension letter that AWS sent Parler, AWS cites communications “over the past several weeks”. Usually the providers look for their customers to demonstrate good-faith efforts, but may take suspension or termination efforts if it does not look like a good-faith effort to comply is being made, or if it appears that the effort, no matter how seemingly earnest, does not seem likely to bring compliance within a reasonable time period. 30 days is a common timeframe specified as a cure period in contracts (and is the cure period in the AWS standard Enterprise Agreement), but cloud provider click-through agreements (such as the AWS Customer Agreement) do not normally have a cure period, allowing immediate action to be taken at the provider’s discretion.
What Does This Have to Do With Policing Users on Social Media?
When providers established anti-spam AUPs, they also added a whole laundry list of offenses beyond spamming. Think of that list, “Everything a good corporate lawyer thought an ISP might ever want to terminate a customer for doing.” Illegal behavior, harassment, behavior that disrupts provider operations, behavior that threatens the safety/security/operations of other businesses, etc. are all prohibited.
Hosting companies — eventually followed by cloud providers like AWS, Microsoft Azure, and Google Cloud Platform, as well as companies that hold key roles in the Internet ecosystem (domain registrars and the companies that operate the DNS; content delivery networks like Akamai and Cloudflare, etc.) — were essentially obliged to incorporate their upstream ISP usage policies into their own terms of service and AUPs, and to enforce those policies on their users if they wanted to stay connected to the Internet. Some such providers have also explicitly chosen not to sell to customers in certain business segments — for instance, no gambling, or no pornography, even if the business is fully legitimate and legal (for instance, like MGM Resorts or Playboy) — through limiting what’s allowed in their terms of service. An AUP may restrict activities that are perfectly legal in a given jurisdiction.
Even extremely large tech companies that have their own data centers, like Facebook and Apple, are ultimately beholden to ISPs. (Google is something of an odd case because in addition to owning their own data centers, they are one of the largest global network operators. Google owns extensive fiber routes and peers with Tier 1 ISPs as an equal.) And even though AWS has, to some degree, a network of its own, it is effectively a Tier 2 ISP, making it beholden to the AUPs of its upstream. Other cloud providers are typically mostly or fully transit-dependent, and are thus entirely beholden to their upstream.
In short: Everyone who hosts content companies, and the content companies themselves, is essentially trapped, by the chain of AUP obligations, to policing content to ensure that it is not illegal, harassing, or otherwise seen as commercially problematic.
You have to go outside the normal Internet supply chain — for instance, to the Russian service providers — before you escape the commercial arrangements that bound notions of good business behavior on the Internet. It doesn’t matter what a provider’s philosophical alignment is. Commercially, they simply can’t really push back on the established order. And because these agreements are global, regulation at a single-country level can’t really force these agreements to be significantly more or less restrictive, because of the globalized nature of peering/transit; providers generally interconnect in multiple countries.
It also means that these aren’t just “Silicon Valley” standards. These are global norms for behavior, which means they are not influenced solely by the relatively laissez-faire content standards of the United States, but by the more stringent European and APAC environments.
It’s an interesting result of what happens when businesses police themselves. Even without formal industry-association “rules” or regulatory obligations, a fairly ironclad order can emerge that exerts extremely effective downstream pressure (as we saw in the cases of 8Chan and the Daily Stormer back in 2019).
Does being multicloud help with terms of service violations?
Some people will undoubtedly ask, “Would it have helped Parler to have been multicloud?” Parler has already said that they are merely bare-metal customers of AWS, reducing technical dependencies / improving portability. But their situation is such that they would almost certainly have had the exact same issue if they had been running on Microsoft Azure, Google Cloud Platform, or even Oracle Cloud Infrastructure as well (even though the three companies have top executives with political views spanning the spectrum). A multicloud strategy won’t help any business that violates AUP norms.
AWS and its cloud/hosting competitors are usually pretty generous when working with business customers that unintentionally violate their AUPs. But a business that chooses not to comply is unlikely to find itself welcome anywhere, which makes multicloud deployment largely useless as a defensive strategy.
The multi-headed hydra of cloud resilience
Clients have recently been asking a lot more questions about the comparative resilience of cloud providers.
Identity services are a particular point of concern (for instance, the Azure AD outage of October 1st and Google Cloud IAM outage of March 26th) since when identity is down, the customer can’t access the cloud provider’s control plane (and it may impact service use in general) — plus there’s generally no way for the customer to work around such issues.
The good news is, hyperscale cloud providers do a pretty good job of being robust. However, the risk of smaller, more hosting-like providers can be much higher — and there are notable differences between the hyperscalers, too.
Operations folks know: Everything breaks. Physical stuff fails, software is buggy, and people screw up (a lot). A provider can try its best to reduce the number of failures, limit the “blast radius” of a problem, limit the possibility of “cascading failures”, and find ways to mitigate the impact on users. But you can’t avoid failure entirely. Systems that are resilient recover quickly from failure.
If you chop off the head of a hydra, it grows back — quickly. We can think about five key factors — heads of the hydra — that influence the robustness, resilience, and observed (“real world”) availability of cloud services:
- Physical design: The design of physical things, such as the data center and the hardware used to deliver services.
- Logical (software) design: The design of non-physical things, especially software — all aspects of the service architecture that is not related to a physical element.
- Implementation quality: The robustness of the actual implementation, encompassing implementation skill, care and meticulousness, and the effectiveness of quality-assurance (QA) efforts.
- Deployment processes: The rollout of service changes is the single largest cause of operational failures in cloud services. The quality of these processes, the automation used in the processes, and the degree to which humans are given latitude to use good judgment (or poor judgment) thus have a material impact on availability.
- Operational processes: Other operational processes, such as monitoring, incident management — and, most importantly, problem management — impact the cloud provider’s ability to react quickly to problems, mitigate issues, and ensure that the root causes of incidents are addressed. Both proactive and reactive maintenance efforts can have an impact on availability.
A sixth factor, Transparency, isn’t directly related to keeping the hydra alive, but matters to customers as they plan for their own application architectures and risk management — contributing to customer resilience. Transparency includes making architectural information to customers, as well as delivering outage-related visibility and insight to customers. Customers need real-world info — like current and historical outage reports and the root-cause-analysis port-mortems that offer insight into what went wrong and why (and what the provider is doing about it).
When you think about cloud service resilience (or the resilience of your own systems), think about it in terms of those factors. Don’t think about it like you think about on-premises systems, where people often think primarily about hardware failures or a fire in the data center. Rather, you’re dealing with systems where software issues are almost always the root cause. Physical robustness still matters, but the other four factors are largely about software.
Beware of vendors bearing transformation Turkish Delight
“It is a lovely place, my house,” said the Queen. “I am sure you would like it. There are whole rooms full of Turkish Delight, and what’s more, I have no children of my own. I want a nice boy whom I could bring up as a Prince and who would be King of Narnia when I am gone. While he was Prince he would wear a gold crown and eat Turkish Delight all day long; and you are much the cleverest and handsomest young man I’ve ever met. I think I would like to make you the Prince—some day, when you bring the others to visit me.” — The White Witch (C.S. Lewis; The Lion, The Witch, and the Wardrobe)
When most people read the Narnia novels as children, they have no idea what Turkish Delight is. Its obscurity in recent decades has allowed everyone to imagine it as an entirely wonderful substance, carrying all their hopes and dreams of the perfect candy.
So, too, do people pour all of their business hopes and dreams into a nebulously-defined future of “digital transformation”.
Because the cloud is such a key enabling technology for digital business, I have plenty of discussions with clients who have been promised grand “digital transformation” outcomes by cloud providers and cloud MSPs. But it certainly not a phenomenon limited to the cloud. Hardware vendors and ISVs, outsourcers, consultancies, etc. are all selling this dream. While I can think of vendors who are more guilty of this than others, it’s a cross-IT-industry phenomenon.
Beware all digital transformation promises. Especially the ones where the vendor promises to partner with you to change the future of your industry or reinvent/revolutionize/disrupt X (where X is what your industry does).
I’ve quietly watched a string of broken transformation promises over the last few years, gently privately warning clients in inquiry conversations that you generally can’t trust these sorts of vendor promises. These behaviors have become much more prominent recently, though. And a colleague recently told me about a conversation that seemed like just a bridge too far: a large tech vendor promising to partner with a small Midwestern industrial manufacturer (tech laggards not doing anything exciting) to create transformative products, as part of a sales negotiation. (This vendor had not previously exhibited such behavior, so it was doubly startling.)
Clients come to us with tales of vendors who, in the course of sales discussions, promises to partner with them — possibly even dangling the possibility of a joint venture — to launch a transformational digital business, revolutionize the future of their industry, or the like. (Note that this is distinct from companies that sell transformation consulting. They promise to help you figure out the future, not form a business partnership to create that future — i.e. McKinsey, Deloitte, etc.)
Usually, neither the customer nor the vendor have a concrete idea of what that looks like. Usually, the vendor refuses to put this partnership notion in writing as a formal contract. On the rare occasion that there is a contract, it is pretty vague, does not oblige the vendor to put forth any business ideas, and allows the vendor to refuse any business idea and investment. In other words, it has zero teeth. Because it’s so open-ended, the customer can fill the void with all their Turkish Delight dreams.
Moreover, the vendor may sometimes dangle samples of transformation-oriented services and consulting during the sales process. The customer gobbles down these sweet nuggets, and then stares mournfully at the empty box of transformation candy. For the promise of more, they’ll cheerfully betray their enterprise procurement standards, while the sourcing managers stand on the sidelines frantically waving contract-related warnings.
Listen to your sourcing managers when they warn you that the proposed “partnership” is a fiction. The White Witch probably doesn’t have your best interests at heart. Good digital transformation promises — ones that are likely to actually be kept — have concrete outcomes. They specify what the partnership will be developing, together with timelines, budgets, and the legal entity (such as a JV) that will be used to deliver the products/services. Or they specify the specific consulting services that will be provided — workshops, deliverables from those workshops, work-for-hire agreements with specific costs (and discounts, if applicable), and so forth.
Without concrete contractual outcomes, the vendor can vanish the candy into thin air with no repercussions. Sure, in a concrete transformation proposal, the end result will probably not be your Turkish Delight dreams. It might resemble a bowl of ordinary M&Ms. Or maybe a tasty grab-bag of Lindt truffles. (You’d have to get particularly lucky for it get much beyond the realm of grocery-store candy, though.) But you’re much more likely to actually get a good outcome.
Off-hand, I can think of one public example where a prominent “change the industry” vendor partnership with an enterprise, seems to have resulted in a credible product: Microsoft’s Connected Vehicle Platform. There, Microsoft signed a deal with a collection of automakers, each of whom had specific outcomes they wished to achieve — outcomes which could be realistically achieved in a reasonable amount of time, and representing industry advancement but not anything truly revolutionary. Microsoft built upon those individual projects to deliver a platform which would move the industry forward, which was announced with a clear mission and a timeframe for launch. Sure, it didn’t “change the future of cars”, but it brought tangible benefits to the customers.
Vendors often try to sell to who you hope to be, rather than who you are now. Your aspirations aren’t bad. Just make sure that your aspirations are well-defined and there’s a realistic roadmap to achieve them. Hope is not a strategy. The vendor may have little incentive not to promise everything you could dream of, in order to get you to sign a large purchase agreement.
Don’t boil the ocean to create your cloud
Many of my client inquiries deal with the seemingly overwhelming complexity of maturing cloud adoption — especially with the current wave of pandemic-driven late adopters, who are frequently facing business directives to move fast but see only an immense tidal wave of insurmountably complex tasks.
A lot of my advice is focused on starting small — or at least tackling reasonably-scoped projects. The following is specifically applicable to IaaS / IaaS+PaaS:
Build a cloud center of excellence. You can start a CCOE with just a single person designated as a cloud architect. Standing up a CCOE is probably going to take you a year of incremental work, during which cloud adoption, especially pilot projects, can move along. You might have to go back and retroactively apply governance and good practices to some projects. That’s usually okay.
Start with one cloud. Don’t go multicloud from the start. Do one. Get good at it (or at least get a reasonable way into a successful implementation). Then add another. If there’s immediate business demand (with solid business-case justifications) for more than one, get an MSP to deal with the additional clouds.
Don’t build a complex governance and ops structure based on theory. Don’t delay adoption while you work out everything you think you’ll need to govern and manage it. If you’ve never used cloud before, the reality may be quite different than you have in your head. Run a sequence of increasingly complex pilot projects to gain practical experience while you do preparatory work in the background. Take the lessons learned and apply them to that work.
Don’t build massive RFPs to choose a provider. Almost all organizations are better off considering their strategic priorities and then matching a cloud provider to those priorities. (If priorities are bifurcated between running the legacy and building new digital capabilities, this might encourage two strategic providers, which is fine and commonplace.) Massive RFPs are a lot of work and are rarely optimal. (Government folks might have no choice, unfortunately.)
Don’t try to evaluate every service. Hyperscale cloud providers have dozens upon dozens of services. You won’t use all of them. Don’t bother to evaluate all of them. If you think you might use a service in the future, and you want to compare that service across providers… well, by the time you get around to implementing it, all of the providers will have radically updated that service, so any work you do now will be functionally useless. Look just at the services that you are certain you will use immediately and in the very near (no more than one year) future. Validate a subset of services for use, and add new validations as needed later on.
Focus on thoughtful workload placement. Decide who your approved and preferred providers are, and build a workload placement policy. Look for “good technical fit” and not necessarily ideal technical fit; integration affinities and similar factors are more important. The time to do a detailed comparison of an individual service’s technical capabilities is when deciding workload placement, not during the RFP phase.
Accept the limits of cloud portability. Cloud providers don’t and will probably never offer commoditized services. Even when infrastructure resources seem superficially similar, there are still meaningful differences, and the management capabilities wrapped around those resources are highly differentiated. You’re buying into ecosystems that have the long-term stickiness of middleware and management software. Don’t waste time on single-pane-of-glass no-lock-in fantasies, no matter how glossily pretty the vendor marketing material is. And no, containers aren’t magic in this regard.
Links are to Gartner research and are paywalled.
The multicloud gelatinous cube
Pondering the care and feeding of your multicloud gelatinous cube. (Which engulfs everything in its path, and digests everything organic.)
Most organizations end up multicloud, rather than intending to be multicloud in a deliberate and structured way. So typical tales go like this: The org started doing digital business-related new applications on AWS and now AWS has become the center of gravity for all new cloud-native apps and cloud-related skills. Then the org decided to migrate “boring” LOB Windows-based COTS to the cloud for cost-savings, and lifted-and-shifted them onto Azure (thereby not actually saving money, but that’s a post for another day). Now the org has a data science team that thinks that GCP is unbearably sexy. And there’s a floating island out there of Oracle business applications where OCI is being contemplated. And don’t forget about the division in China, that hosts on Alibaba Cloud…
Multicloud is inevitable in almost all organizations. Cloud IaaS+PaaS spans such a wide swathe of IT functions that it’s impractical and unrealistic to assume that the organization will be single-vendor over the long term. Just like the enterprise tends to have at least three of everything (if not ten of everything), the enterprise is similarly not going to resist the temptation of being multicloud, even if it’s complex and challenging to manage, and significantly increases management costs. It is a rare organization that both has diverse business needs, and can exercise the discipline to use a single provider.
Despite recognizing the giant ooze that we see squelching our way, along with our unavoidable doom, there are things we can do to prepare, govern, and ensure that we retain some of our sanity.
For starters, we can actively choose our multicloud strategy and stance. We can classify providers into tiers, decide what providers are approved for use and under what circumstances, and decide what providers are preferred and/or strategic.
We can then determine the level of support that the organization is going to have for each tier — decide, for instance, that we’ll provide full governance and operations for our primary strategic provider, a lighter-weight approach that leans on an MSP to support our secondary strategic provider, and less support (or no support beyond basic risk management) for other providers.
After that, we can build an explicit workload placement policy that has an algorithm that guides application owners/architects in deciding where particular applications live, based on integration affinities, good technical fit, etc.
Note that cost-based provider selection and cost-based long-term workload placement are both terrible ideas. This is a constant fight between cloud architects and procurement managers. It is rooted in the erroneous idea that IaaS is a commodity, and that provider pricing advantages are long-term rather than short-lived. Using cost-based placement often leads to higher long-term TCO, not to mention a grand mess with data gravity and thus data management, and fragile application integrations.
See my new research note, “Comparing Cloud Workload Placement Strategies” (Gartner paywall) for a guide to multicloud IaaS / IaaS+PaaS strategies (including when you should pursue a single-cloud approach). In a few weeks, you’ll see the follow-up doc “Designing a Cloud Workload Placement Policy” publish, which provides a guide to writing such policies, with an analysis of different placement factors and their priorities.
The messy dilemma of cloud operations
Responsibility for cloud operations is often a political football in enterprises. Sometimes nobody wants it; it’s a toxic hot potato that’s apparently coated in developer cooties. Sometimes everybody wants it, and some executives think that control over it are going to ensure their next promotion / a handsome bonus / attractiveness for their next job. Frequently, developers and the infrastructure & operations (I&O) orgs clash over it. Sometimes, CIOs decide to just stuff it into a Cloud Center of Excellence team which started out doing architecture and governance, and then finds itself saddled with everything else, too.
Lots of arguments are made for it to live in particular places and to be executed in various ways. There’s inevitably a clash between the “boring” stuff that is basically lifted-and-shifted and rarely changes, and the fast-moving agile stuff. And different approaches to IaaS, PaaS, and SaaS. And and and…
Well, the fact of the matter is that multiple people are probably right. You don’t actually want to take a one-size-fits-all approach. You want to fit operational approaches to your business needs. And you maybe even want to have specialized teams for each major hyperscale provider, even if you adopt some common approaches across a multicloud environment. (Azure vs. non-Azure, i.e. Azure vs. AWS, is a common split, often correlated closely to Windows-based application environments vs Linux-based application environments.)
Ideally, you’re going to be highly automated, agile, cloud-native, and collaborative between developers and operators (i.e. DevOps). But maybe not for everything (i.e. not all apps are under active development).
Plus, once you’ve chosen your basic operations approach (or approaches), you have to figure out how you’re going to handle cloud configuration, release engineering, and security responsibilities. (And all the upskilling necessary to do that well!)
That’s where people tend to really get hung up. How much responsibility can I realistically push to my development teams? How much responsibility do they want? How do I phase in new operational approaches over time? How do I hook this into existing CI/CD, agile, and DevOps initiatives?
There’s no one right answer. However, there’s one answer that is almost always wrong, and that’s splitting cloud operations across the I&O functional silos — i.e., the server team deals with your EC2 VMs, your NetApp storage admin deals with your Azure Blobs, your F5 specialist configures your Google Load Balancers, your firewall team fights with your network team over who controls the VPC config (often settled, badly, by buying firewall virtual appliances), etc.
When that approach is taken, the admins almost always treat the cloud portals like they’re the latest pointy-clicky interface for a piece of hardware. This pretty much guarantees incompetence, lack of coordination, and gross inefficiency. It’s usually terrible at regardless of what scale you’re at. Unfortunately, it’s also the first thing that most people try (closely followed by massively overburdening some poor cloud architect with Absolutely Everything Cloud-Related.)
What works for most orgs: Some form of cloud platform operations, where cloud management is treated like a “product”. It’s almost an internal cloud MSP approach, where the cloud platform ops team delivers a CMP suite, cloud-enabled CI/CD pipeline integrations, templates and automation, other cloud engineering, and where necessary, consultative assistance to coders and to application management teams. That team is usually on call for incident response, but the first line for incidents is usually the NOC or the like, and the org’s usual incident management team.
But there are lots of options. Gartner clients: Want a methodical dissection of pros and cons; cloud engineering, operating, and administration tasks; job roles; coder responsibilities; security integration; and other issues? Read my new note, “Comparing Cloud Operations Approaches“, which looks at eleven core patterns along with guidance for choosing between them, andmaking a range of accompanying decisions.
Hunting the Dread Gazebo of Repatriation
(Confused by the title of this post? Read this brief anecdote.)
The myth of cloud repatriation refuses to die, and a good chunk of the problem is that users (and poll respondents) use “repatriation” is a wild array of ways, but non-cloud vendors want you to believe that “repatriation” means enterprises packing up all their stuff in the cloud and moving it back into their internal data centers — which occurs so infrequently that it’s like a sasquatch sighting.
A non-comprehensive list of the ways that clients use the term “repatriation” that have little to nothing to do with what non-cloud vendors (or “hybrid”) would like you to believe:
Outsourcing takeback. The origin of the term comes from orgs that are coming back from traditional IT outsourcing. However, we also hear cloud architects say they are “repatriating” when they gradually take back management of cloud workloads from a cloud MSP; the workloads stay in the cloud, though.
Migration pause. Some migrations to IaaS/IaaS+PaaS do not go well. This is often the result of choosing a low-quality MSP for migration assistance, or rethinking the wisdom of a lift-and-shift. Orgs will pause, switch MSPs and/or switch migration approaches (usually to lift-and-optimize), and then resume. Some workloads might be temporarily returned on-premise while this occurs.
SaaS portfolio rationalization. Sprawling adoption of SaaS, at the individual, team, department or business-unit level, can result in one or more SaaS applications being replaced with other, official, corporate SaaS (for instance, replacing individual use of Dropbox with an org-wide Google Drive implementation as part of G-Suite). Sometimes, the org might choose to build on-premises functionality instead (for instance, replacing ad-hoc SaaS analytics with an on-prem data warehouse and enterprise BI solution). This is overwhelmingly the most common form of “cloud repatriation”.
Development in the cloud, production on premises. While the dev/prod split of environments is much less common than it used to be, some organizations still develop in cloud IaaS and then run the app in an on-prem data center in production. Orgs like this will sometimes say they “repatriate” the apps for production.
The Oops. Sometimes organizations attempt to put an application in the cloud and it Just Doesn’t Go Well. Sometimes the workload isn’t a good match for cloud services in general. Sometimes the workload is just a bad match for the particular provider chosen. Sometimes they make a bad integrator choice, or their internal cloud skills are inadequate to the task. Whatever it is, people might hit the “abort” button and either rethink and retry in the cloud, or give up and put it on premises (either until they can put together a better plan, or for the long term).
Of course, there are the sasquatch sightings, too, like the Dropbox migration from AWS (also see the five-year followup), but those stories rarely represent enterprise-comparable use cases. If you’re one of the largest purchasers of storage on the planet, and you want custom hardware, absolutely, DIY makes sense. (And Dropbox continues to do some things on AWS.)
Customers also engage in broader strategic application portfolio rationalizations that sometimes result in groups of applications being shifted around, based on changing needs. While the broader movement is towards the cloud, applications do sometimes come back on-premises, often to align to data gravity considerations for application and data integration.
None of these things are in any way equivalent to the notion that there’s a broad or even common movement of workloads from the cloud back on-premises, though, especially for those customers who have migrated entire data centers or the vast majority of their IT estate to the cloud.
(Updated with research: In my note for Gartner clients, “Moving Beyond the Myth of Repatriation: How to Handle Cloud Projects Failures”, I provide detailed guidance on why cloud projects fail, how to reduce the risks of such projects, and how — or if — to rescue troubled cloud projects.)