Multicloud failover is almost always a terrible idea

Most people — and notably, almost all regulators — are entirely wrong about addressing cloud resilience through the belief that they should do multicloud failover because, as I noted in a previous blog post,  the cloud is NOT just someone else’s computer. (I have been particularly aghast at a recent Reuters article about the Bank of England’s stance.)

Regulators, risk managers, and plenty of IT management largely think of AWS, Azure, etc. as monolithic entities, where “the cloud” can just break for them, and then kaboom, everything is dead everywhere worldwide. They imagine one gargantuan amorphous data center, subject to all the problems that can afflict single data centers, or single systems. But that’s not how it works, that’s not the most effective way to address risk, and testing the “resilience of the provider” (as a generic whole) is both impossible and meaningless.

I mean, yes, there’s the possibility of the catastrophic failure of practically any software technology. There could be, for instance, a bug in the control systems of airplanes from fill-in-the-blank manufacturer that could be simultaneously triggered at a particular time and cause all their airplanes to drop out of the sky simultaneously. But we don’t plan to make commercial airlines maintain backup planes from some other manufacturer in case it happens. Instead, we try to ensure that each plane is resilient in many ways — which importantly addresses the most probable forms of failure, which will be electrical or mechanical failures of particular components.

Hyperscale cloud providers are full of moving parts — lots of components, assembled together into something that looks and feels like a cohesive whole. Each of those components has its own form of resilience, and some of those components are more fragile than others. Some of those components are typically operating well within engineered tolerances. Some of those components might be operating at the edge of those tolerances in certain circumstances — likely due to unexpected pressures from scale — and might be extra-scary if the provider isn’t aware that they’re operating at that edge. In addition to fault-tolerance within each component, there are many mechanisms for fault-tolerance built into the interaction between those components.

Every provider also has its own equivalent of “maintenance” (returning to the plane analogy). The quality of the “mechanics” and the operations will also impact how well the system as a whole operates.  (See my previous blog post, “The multi-headed hydra of cloud resilience” for the factors that go into provider resilience.)

It’s not impossible for a provider to have a worldwide outage that effectively impacts all services (rather than just a single service).  Such outages are all typically rooted in something that prevents components from communicating with each other, or customers from connecting to the services — global network issues, DNS, security certificates, or identity. The first major incident of this type was the 2012 Azure leap year outage. The 2019 Google “Chubby” outage had global network impact, including on GCP. There have been multiple Azure AD outages with broad impact across Microsoft’s cloud portfolio, most recently the 2021 Azure Active Directory outage. (But there are certainly other possibilities. As recently as yesterday, there was a global Azure Windows VM outage that impacted all Windows VM-dependent services.)

Provider architectural and operational differences do clearly make a difference. AWS, notably, has never had a full regional failure or a global outage. The unique nature of GCP’s global network has both benefits and drawbacks. Azure has been improving steadily in reliability over the years as Microsoft addresses both service architecture and deployment (and other operations) processes.

Note that while these outages can be multi-hour, they have generally been short enough that — given typical enterprise recovery-time objectives for disaster recovery, which are often lengthy — customers typically don’t activate a traditional DR plan. (Customers may take other mitigation actions, i.e. failover to another region, failover to an alternative application for a business process, and so forth.)

Multicloud failover requires that you maintain full portability between two providers, which is a massive burden on your application developers. The basic compute runtime (whether VMs or containers) is not the problem, so OpenShift, Anthos, or other “I can move my containers” solutions won’t really help you. The problem is all the differentiators — the different network architectures and features, the different storage capabilities, the proprietary PaaS capabilities, the wildly different security capabilities, etc. Sure, you can run all open source in VMs, but at that point, why are you bothering with the cloud at all? Plus, even in a DR situation, you need some operational capabilities on the other cloud (monitoring, logging, etc.), even if not your full toolset.

Moreover, the huge cost and complexity of a multicloud implementation is effectively a negative distraction from what you should actually be doing that would improve your uptime and reduce your risks, which is making your applications resilient to the types of failure that are actually probable. More on that in a future blog post.

Posted on October 14, 2021, in Infrastructure and tagged , , , , , , . Bookmark the permalink. 11 Comments.

  1. Facebook and its many services went down for hours. Can’t imagine the implications if that happened to mission critical payment systems. The ‘clouds are not monoliths’ argument misses the point. Distributed systems have catastrophic failures too e.g. if common / shared services go down / if there is a bug, config error introduced into them.

    Like

  2. In response to the above comment:

    “Facebook and its many services went down for hours. Can’t imagine the implications if that happened to mission critical payment systems. The ‘clouds are not monoliths’ argument misses the point. Distributed systems have catastrophic failures too e.g. if common / shared services go down / if there is a bug, config error introduced into them.”

    It’s true that distributed systems have failures too and hence the point in this article about spending time to “make your applications resilient to the types of failure that are actually probable” makes more sense.

    Facebook does not even use any of the public cloud providers, it uses its own datacenters. Because of these failures on Facebook, no one is hounding them to move to public cloud or multi-public cloud as a solution to all of its problems, but instead to bring better policies and systems in place to make sure it does not happen again.

    I think this is also true for any system, any bank just because they use their own datacenters are not immune to any self-inflicting application/design problems, same is true for the banks using public cloud-providers. If anything, the impact of such self-inflicting application/design problem increases it’s blast radius when you are on multiple cloud providers. The mistakes that you could previously only make in one public cloud provider, instead of spending time to fix them, now you are making the same mistakes in 2 public cloud providers.

    Like

  3. I am not sure, if there is any BoE policy saying if you are using private datacenters, you should not have all the machines from the same provider, i.e. Dell, you should not use the same loadbalancers across your datacenters, i.e. F5, you should not use the same anti-virus at all places, etc. just because all the Dell machines can go and malfunction at the same time, etc.

    So taking that concept and trying to make that argument for cloud providers seems like a stretch.

    Like

  1. Pingback: Revolve Press Start : la revue de presse du Cloud - Octobre 2021 Blog Devoteam Revolve

  2. Pingback: Improving cloud resilience through stuff that works | CloudPundit: Massive-Scale Computing

  3. Pingback: allesnurgecloud #40 – Facebook Downtime, Modern Solution, GitLab Hacks und SRE bei Google – allesnurgecloud.com

  4. Pingback: [FI] Tietoliikennealan katsaus 2021-10 – loopback1.net

  5. Pingback: SRE Weekly Issue #295 – SRE WEEKLY

  6. Pingback: SRE Weekly Issue #295 – FDE

  7. Pingback: Resilience: Cloudy without a chance of meatballs | CloudPundit: Massive-Scale Computing

  8. Pingback: Episode 28 - GSMA and TelcoDR talk telco and public cloud | TelcoDR

Leave a comment