Multicloud failover is almost always a terrible idea
Most people — and notably, almost all regulators — are entirely wrong about addressing cloud resilience through the belief that they should do multicloud failover because, as I noted in a previous blog post, the cloud is NOT just someone else’s computer. (I have been particularly aghast at a recent Reuters article about the Bank of England’s stance.)
Regulators, risk managers, and plenty of IT management largely think of AWS, Azure, etc. as monolithic entities, where “the cloud” can just break for them, and then kaboom, everything is dead everywhere worldwide. They imagine one gargantuan amorphous data center, subject to all the problems that can afflict single data centers, or single systems. But that’s not how it works, that’s not the most effective way to address risk, and testing the “resilience of the provider” (as a generic whole) is both impossible and meaningless.
I mean, yes, there’s the possibility of the catastrophic failure of practically any software technology. There could be, for instance, a bug in the control systems of airplanes from fill-in-the-blank manufacturer that could be simultaneously triggered at a particular time and cause all their airplanes to drop out of the sky simultaneously. But we don’t plan to make commercial airlines maintain backup planes from some other manufacturer in case it happens. Instead, we try to ensure that each plane is resilient in many ways — which importantly addresses the most probable forms of failure, which will be electrical or mechanical failures of particular components.
Hyperscale cloud providers are full of moving parts — lots of components, assembled together into something that looks and feels like a cohesive whole. Each of those components has its own form of resilience, and some of those components are more fragile than others. Some of those components are typically operating well within engineered tolerances. Some of those components might be operating at the edge of those tolerances in certain circumstances — likely due to unexpected pressures from scale — and might be extra-scary if the provider isn’t aware that they’re operating at that edge. In addition to fault-tolerance within each component, there are many mechanisms for fault-tolerance built into the interaction between those components.
Every provider also has its own equivalent of “maintenance” (returning to the plane analogy). The quality of the “mechanics” and the operations will also impact how well the system as a whole operates. (See my previous blog post, “The multi-headed hydra of cloud resilience” for the factors that go into provider resilience.)
It’s not impossible for a provider to have a worldwide outage that effectively impacts all services (rather than just a single service). Such outages are all typically rooted in something that prevents components from communicating with each other, or customers from connecting to the services — global network issues, DNS, security certificates, or identity. The first major incident of this type was the 2012 Azure leap year outage. The 2019 Google “Chubby” outage had global network impact, including on GCP. There have been multiple Azure AD outages with broad impact across Microsoft’s cloud portfolio, most recently the 2021 Azure Active Directory outage. (But there are certainly other possibilities. As recently as yesterday, there was a global Azure Windows VM outage that impacted all Windows VM-dependent services.)
Provider architectural and operational differences do clearly make a difference. AWS, notably, has never had a full regional failure or a global outage. The unique nature of GCP’s global network has both benefits and drawbacks. Azure has been improving steadily in reliability over the years as Microsoft addresses both service architecture and deployment (and other operations) processes.
Note that while these outages can be multi-hour, they have generally been short enough that — given typical enterprise recovery-time objectives for disaster recovery, which are often lengthy — customers typically don’t activate a traditional DR plan. (Customers may take other mitigation actions, i.e. failover to another region, failover to an alternative application for a business process, and so forth.)
Multicloud failover requires that you maintain full portability between two providers, which is a massive burden on your application developers. The basic compute runtime (whether VMs or containers) is not the problem, so OpenShift, Anthos, or other “I can move my containers” solutions won’t really help you. The problem is all the differentiators — the different network architectures and features, the different storage capabilities, the proprietary PaaS capabilities, the wildly different security capabilities, etc. Sure, you can run all open source in VMs, but at that point, why are you bothering with the cloud at all? Plus, even in a DR situation, you need some operational capabilities on the other cloud (monitoring, logging, etc.), even if not your full toolset.
Moreover, the huge cost and complexity of a multicloud implementation is effectively a negative distraction from what you should actually be doing that would improve your uptime and reduce your risks, which is making your applications resilient to the types of failure that are actually probable. More on that in a future blog post.
Terms of Service: From anti-spam to content takedown
Many consumers are familiar with the terms of service (ToS) that govern their use of consumer platforms that contain user-generated content (UGC), such as Facebook, Twitter, and YouTube. But many people are less familiar with the terms of service and acceptable use policy (AUP) that governs the relationships between businesses and their service providers.
In light of the recent decision undertaken by Amazon Web Services (AWS) to suspend service to Parler, a Twitter-like social network that was used to plan the January 6th insurrection at the US Capitol, there have been numerous falsehoods circulating on social media that seem related to a lack of understanding of service provider behavior, ToS and AUPs: claims of “free speech” violations, a coordinated conspiracy between big tech companies, and the like. This post is intended to examine, without judgment, how service providers — cloud, hosting and colo, CDN and other Internet infrastructure providers, and the ISPs that connect everyone to the Internet — came to a place of industry “standards” for behavior, and how enforcement is handled in B2B situations.
The TL;DR summary: The global service provider community, as a result of anti-spam efforts dating back to the mid-90s, enforces extremely similar policies governing content, including user-generated content, through a combination of B2B peer pressure and contractual obligations. Business customers who contravene these norms have very few options.
These norms will greatly limit Parler’s options for a new home. Many sites with far-right and similarly controversial content have ultimately ended up using a provider in a supply chain that relies on Russian connectivity, thus dodging the Internet norms that prevail in the rest of the world.
Internet Architecture and Service Provider Dependencies
While the Internet is a collection of loosely federated networks that are in theory independent from one another, it is also an interdependent web of interconnections between those networks. There are two ways that ISPs connect with one another — through “settlement-free peering” (essentially an exchange of traffic between two ISPs that view themselves as equals) and through the purchase of “transit” (making a “downstream ISP” the customer of an “upstream ISP”).
This results in a three-tier model for ISPs. The Tier 1 ISPs are big global carriers of network connectivity — companies like AT&T, BT and NTT — who have settlement-free peers with each other, and sell transit to smaller ISPs. Tier 2 ISPs are usually regional, and have settlement-free peers with others in and around their region, but also are reliant on transit from the Tier 1s. Tier 3 ISPs are entirely dependent on purchasing transit. ISPs at all three tiers also sell connectivity directly to businesses and/or consumers.
In practice, this means that ISPs are generally contractually bound to other ISPs. All transit contracts are governed by terms of service that normally incorporate, by reference, an AUP. Even settlement-free peering agreements are legal contracts, which normally includes the mutual agreement to maintain and enforce some form of AUP. (In the earlier days of the Internet, peering was done on a handshake, but anything of that sort is basically a legacy that can come to an abrupt end should one party suddenly decide to behave badly.)
AUP documents are interesting because they are deliberately created as living documents, allowing AUPs to be adapted to changing circumstances — unlike standard contract terms, which apply for the length of what is usually a multiyear contract. AUPs are also normally ironclad; it’s usually difficult to impossible for a business to get any form of AUP exemption written into their contract. Most contracts provide minimal or no notice for AUP changes. Businesses tend to simply agree to them because most businesses do not plan to engage in the kind of behavior that violates an AUP — and because they don’t have much choice.
The existence of ISP tiering means that upstream providers have significant influence over the behavior of their downstream. Upstream ISPs normally mandate that their downstream ISPs — and other service providers that use their connectivity, like hosting providers — enforce an AUP that enables the downstream provider to be compliant with the upstream’s terms of service. Downstream providers that fail to do so can have their connectivity temporary suspended or their contract terminated. And between the Tier 1 providers, peer pressure ensures a common global understanding and enforcement of acceptable behavior on the Internet.
Note that this has all occurred in the absence of regulation. ISPs have come to these arrangements through decisions about what’s good for their individual businesses first and foremost, with the general agreement that these community standards for AUPs are good for the community of network operators as a whole.
We’re Here Because Nobody Likes Spammers
So how did we arrive at this state in the first place?
In the mid-90s, as the Internet was growing rapidly, in the near-total absence of regulation, spam was a growing problem. Spam came from both legitimate businesses who simply weren’t aware of or didn’t especially care about Internet etiquette, as well as commercial spammers (bad actors with deceptive or fraudulent ads, and/or illegal/grey-market products).
Many B2B ISPs did not feel that it was necessarily their responsibility to intervene, despite general distaste for spammers — and, sometimes, a flood of consumer complaints. Some percentage of spammers were otherwise “good customers” — i.e. they paid their bills on time and bought a lot of bandwidth. Many more, however, obtained services under fraudulent pretenses, didn’t pay their bills, or tended not to pay on time.
Gradually, the community of network operators came to a common understanding that spammers were generally bad for business, whether they were your own customers, or whether they were the customers of, say, a web hosting company that you provided Internet connectivity for.
This resulted in upstream ISPs exerting pressure on downstream ISPs. Downstream ISPs, in turn, exerted pressure on their customers — kicking spammers off their networks and pushing hosters to kick spammers out of hosting environments. ISPs formalized AUPs. AUP enforcement took longer. Many ISPs were initially pretty shoddy and inconsistent in their enforcement — either because they needed the revenue they were getting from spammers, or due to unwillingness or inability to fund a staff to deal with abuse, or corporate lawyers who urged caution. It took years, but ISPs eventually arrived at AUPs that were contractually enforceable, processes for handling complaints, and relatively consistent enforcement. Legislation like the CAN-SPAM act in the US didn’t hurt, but by the time CAN-SPAM was passed (in 2003), ISPs had already arrived at a fairly successful commercial resolution to the problem.
Because anti-spam efforts were largely fueled by agreements enshrined in B2B contracts, and not in government regulation, there was never full consistency across the industry. Different ISPs created different AUPs — some stricter and some looser. Different ISPs wrote different terms of service into their contracts, with different “cure” periods (a period of time that a party in the contract is given to come into compliance with a contractual requirement). Different ISPs had different attitudes towards balancing “customer service” versus their responsibilities to their upstream providers and to the broader community of network operators.
Consequently, there’s nothing that says “We need to receive X number of spam complaints before we take action,” for instance. Some providers may have internal process standards for this. A lot of enforcement simply takes place via automated algorithms; i.e. if a certain threshold of users reports something as spam, enforcement actions take place. Providers effectively establish, through peer norms, what constitutes “effective” enforcement in accordance with terms of service obligations. Providers don’t need to threaten each other with network disconnection, because a norm has been established. But the implicit threat — and the contractual teeth behind that threat — always remains.
Nobody really likes terminating customers. So there are often fairly long cure periods, recognizing that it can take a while for a customer to properly comply with an AUP. In the suspension letter that AWS sent Parler, AWS cites communications “over the past several weeks”. Usually the providers look for their customers to demonstrate good-faith efforts, but may take suspension or termination efforts if it does not look like a good-faith effort to comply is being made, or if it appears that the effort, no matter how seemingly earnest, does not seem likely to bring compliance within a reasonable time period. 30 days is a common timeframe specified as a cure period in contracts (and is the cure period in the AWS standard Enterprise Agreement), but cloud provider click-through agreements (such as the AWS Customer Agreement) do not normally have a cure period, allowing immediate action to be taken at the provider’s discretion.
What Does This Have to Do With Policing Users on Social Media?
When providers established anti-spam AUPs, they also added a whole laundry list of offenses beyond spamming. Think of that list, “Everything a good corporate lawyer thought an ISP might ever want to terminate a customer for doing.” Illegal behavior, harassment, behavior that disrupts provider operations, behavior that threatens the safety/security/operations of other businesses, etc. are all prohibited.
Hosting companies — eventually followed by cloud providers like AWS, Microsoft Azure, and Google Cloud Platform, as well as companies that hold key roles in the Internet ecosystem (domain registrars and the companies that operate the DNS; content delivery networks like Akamai and Cloudflare, etc.) — were essentially obliged to incorporate their upstream ISP usage policies into their own terms of service and AUPs, and to enforce those policies on their users if they wanted to stay connected to the Internet. Some such providers have also explicitly chosen not to sell to customers in certain business segments — for instance, no gambling, or no pornography, even if the business is fully legitimate and legal (for instance, like MGM Resorts or Playboy) — through limiting what’s allowed in their terms of service. An AUP may restrict activities that are perfectly legal in a given jurisdiction.
Even extremely large tech companies that have their own data centers, like Facebook and Apple, are ultimately beholden to ISPs. (Google is something of an odd case because in addition to owning their own data centers, they are one of the largest global network operators. Google owns extensive fiber routes and peers with Tier 1 ISPs as an equal.) And even though AWS has, to some degree, a network of its own, it is effectively a Tier 2 ISP, making it beholden to the AUPs of its upstream. Other cloud providers are typically mostly or fully transit-dependent, and are thus entirely beholden to their upstream.
In short: Everyone who hosts content companies, and the content companies themselves, is essentially trapped, by the chain of AUP obligations, to policing content to ensure that it is not illegal, harassing, or otherwise seen as commercially problematic.
You have to go outside the normal Internet supply chain — for instance, to the Russian service providers — before you escape the commercial arrangements that bound notions of good business behavior on the Internet. It doesn’t matter what a provider’s philosophical alignment is. Commercially, they simply can’t really push back on the established order. And because these agreements are global, regulation at a single-country level can’t really force these agreements to be significantly more or less restrictive, because of the globalized nature of peering/transit; providers generally interconnect in multiple countries.
It also means that these aren’t just “Silicon Valley” standards. These are global norms for behavior, which means they are not influenced solely by the relatively laissez-faire content standards of the United States, but by the more stringent European and APAC environments.
It’s an interesting result of what happens when businesses police themselves. Even without formal industry-association “rules” or regulatory obligations, a fairly ironclad order can emerge that exerts extremely effective downstream pressure (as we saw in the cases of 8Chan and the Daily Stormer back in 2019).
Does being multicloud help with terms of service violations?
Some people will undoubtedly ask, “Would it have helped Parler to have been multicloud?” Parler has already said that they are merely bare-metal customers of AWS, reducing technical dependencies / improving portability. But their situation is such that they would almost certainly have had the exact same issue if they had been running on Microsoft Azure, Google Cloud Platform, or even Oracle Cloud Infrastructure as well (even though the three companies have top executives with political views spanning the spectrum). A multicloud strategy won’t help any business that violates AUP norms.
AWS and its cloud/hosting competitors are usually pretty generous when working with business customers that unintentionally violate their AUPs. But a business that chooses not to comply is unlikely to find itself welcome anywhere, which makes multicloud deployment largely useless as a defensive strategy.
Gmail, Macquarie, and US regulation
Google continues to successfully push Gmail into higher education, in an Australian deal with Macquarie University. (Microsoft is its primary competitor in this market, but for Microsoft, most such Live@edu represent cannibalization of their higher ed Exchange base.)
That, by itself, isn’t a particularly interesting announcement. Email SaaS is a huge trend, and the low-cost .edu offerings have been gaining particular momentum. What caught my eye was this:
The university was hesitant to move staff members on to Gmail due to regulatory and cost factors. They were concerned that their email messages would be subject to draconian US law. In particular, they were worried about protecting their intellectual property under the Patriot Act and Digital Millennium Copyright Act, Mr. Bailey said. “In the end, Google agreed to store that data under EU jurisdiction, which we accepted,” he said.
That tells us that Google can divide their data storage into zones if need be, as one would expect, but it also tells us that they can do so for particular customers (presumably, given Google’s approach to the world, as a configurable, automated thing, and not as a one-off).
However, the remark about the Patriot Act and DMCA is what really caught my attention. DMCA is a worry for universities (due to the high likelihood of pirated media), but USA PATRIOT is a significant worry for a lot of the non-US clients that I talk to about cloud computing, especially those in Europe — to the point where I speak with clients who won’t use US-based vendors, even if the infrastructure itself is in Europe. (Australian clients are more likely to end up with a vendor that has somewhat local infrastructure to begin with, due to the latency issues.)
Cross-border issues are a serious barrier to cloud adoption in Europe in general, often due to regulatory requirements to keep data within-country (or sometimes less stringently, within the EU). That will make it more difficult for European cloud computing vendors to gain significant operational scale. (Whether this will also be the case in Asia remains to be seen.)
But if you’re in the US, it’s worth thinking about how the Patriot Act is perceived outside the US, and how it and any similar measures will limit the desire to use US-based cloud vendors. A lot of US-based folks tell me that they don’t understand why anyone would worry about it, but the “you should just trust that the US government won’t abuse it” story plays considerably less well elsewhere in the world.
Domain names and Kentucky gambling
Last month, the state of Kentucky issued a seizure order for 141 domain names that it claimed were being used in connection with illegal gambling. (Full text of the order here.)
It’s a remarkable order. It asserts that probable cause exists to believe that the domain names are being used in connection with illegal gambling (despite the fact that some are parked domains, which would clearly indicate otherwise), and that as such, Kentucky is entitled to require the registrars to immediately transfer the registration for those domains to Kentucky or some other entity that it designates.
WebProNews has published statements from Governor Steve Beshear and his deputy communications director Jill Midkiff. The governor essentially claimed that illegal online gambling harms Kentucky’s legal gambling businesses, particularly the lottery and horse racing. But regardless of why it was done, it’s still a chilling potential precedent.
Yesterday, the judge in the case denied a dismissal, setting a forfeiture hearing for next month. He also stated that the sites would have 30 days to voluntarily block access by Kentucky users to avoid further legal problems. MarkMonitor (a provider of managed domain name and brand protection solutions) has posted the full text of the opinion, along with the key relevant questions raised by this case.
This case gets right to the heart of the question, “Who controls the Internet?” If Kentucky succeeds, it will fundamentally change our understanding of jurisdiction with regarding to domain names, with broad ramifications both within the United States and internationally.