Improving cloud resilience through stuff that works
As I noted in a previous blog post, multicloud failover is almost always a terrible idea. While the notion that an entire cloud provider can go dark for a lengthy period of time (let’s say a day or more) is not entirely impossible, it’s the least probable of the many ways that an application can experience failure. Humans tend to over-index on catastrophic but low-probability events, so it’s not especially shocking that people fixate on the possibility, but before you spend precious people-effort (not to mention money) on multicloud failover, you should first properly resource all the other things you could be doing to improve your resilience in the cloud.
As I noted previously, five core things impact cloud resilience: physical design, logical (software) design, implementation quality, deployment processes, and operational processes. So you should select your cloud provider carefully. Some providers have a better track record of reliability than others — often related directly in differences in the five core resilience factors. I’m not suggesting that this be a primary selection criterion, but the less reliable your provider, the more you’re going to have to pour effort into resilience, knowing that the provider’s failures are going to test you in the real world. You should care most about the failure of global dependencies (identity, security certificates, NTP, DNS, etc.) that can affect all services worldwide, followed by multi-region failures (especially those that affect an entire geography).
However, those things aren’t just important for cloud providers. They also affect you, the application owner, and the way you should design, implement, update, and operate your application — whether that application is on-premises or in the cloud. Before you resort to multicloud failover, you should have done all of the below and concluded that you’ve already maximized your resilience via these techniques and still need more.
Start with local HA. When architecting a mission-critical application, design it to use whatever HA capabilities are available to you within an availability zone (AZ). Use a clustered (and preferably scale-out) architecture for the stuff you build yourself. Ensure you maximize the resilience options available from the cloud services.
Build good error-handling into your application. Your application should besmart about the way it handles errors, either from other application components or from cloud services (or other third-party components). It should exhibit polite retry behavior and implement circuit breakers to try to limit cascading failures. It should implement load-shedding, in recognition of the fact that rejecting excessive requests so that the requests that can be served receive decent performance is better than just collapsing into non-responsiveness. It should have fallback mechanisms for graceful degradation, to limit impact on users.
Architect the application’s internals for resilience. Techniques such as partitions and bulkheads are likely going to be reserved for larger-scale applications, but are vital for limiting the blast radius of failures. (If you have no idea what any of this terminology means, read Michael Nygard’s “Release It!” — in my personal opinion, if you read one book about mission-critical app design, that should probably be the one.)
Use multiple AZs. Run your application active-active across at least two, and preferably three, AZs within each region that you use. (Note that three can be considerably harder than two because most cloud provider services natively support running in two AZs simultaneously but not three. But that’s a far easier problem than multicloud failover.)
Use multiple regions. Run your application active-active across at least two, and preferably three regions. (Again, two is definitely much easier than three, due to a cloud service’s cross-region support generally being two regions.) If you can’t do that, do fast fully-automated regional failover.
Implement chaos engineering. Not only do you need to thoroughly test in your dev/QA environment to determine what happens under expected failure conditions, but you also need to experiment with fault injection in your production environment where there are complex unpredictable conditions that may cause unexpected failures. If this sounds scary and you expect it’ll blow up in your face, then you need to do a better job in the design and implementation of your application. Forcing constant failures into production systems (ala Netflix’s famed Chaos Monkey) helps you identify all the weak spots, builds resilience, and should help give you confidence that things will continue to work when cloud issues arise.
It’s really important to treat resilience as a systems concern, not purely an infrastructure concern. Your application architecture and implementation need to be resilient. If your developers can’t be trusted to write continuously available applications, imposing multicloud portability requirements (and attendant complexity) upon them will probably add to your operational risks.
And I’m not kidding about the chaos engineering. If you’re not mature enough for chaos engineering, you’re not mature enough to successfully implement multicloud failover. If you don’t routinely shoot your own AZs and regions, kill access to services, kill application components, make your container hosts die, deliberately screw up your permissions and fail-closed, etc. and survive that all without worrying, you need to go address your probable risks of failure that have solutions of reasonable complexity, before you tackle the giant complex beast of multicloud failover to address the enormously unlikely event of total provider failure.
Remember that we’re trying to achieve continuity of our business processes and not continuity of particular applications. If you’ve done all of the above and you’re still worried about the miniscule probability of total provider failure, consider building simple alternative applications in another cloud provider (or on-premises, or in colo/hosting). Such applications might simply display cached data, or queue transactions for later processing. This is almost always easier than maintaining full cross-cloud portability for a complex application. Plus, don’t forget that there might be (gasp) paper alternatives for some processes.
(And yes, I already have a giant brick of a research note written on this topic, slated for publication at the end of this year. Stay tuned…)