Clients have recently been asking a lot more questions about the comparative resilience of cloud providers.
Identity services are a particular point of concern (for instance, the Azure AD outage of October 1st and Google Cloud IAM outage of March 26th) since when identity is down, the customer can’t access the cloud provider’s control plane (and it may impact service use in general) — plus there’s generally no way for the customer to work around such issues.
The good news is, hyperscale cloud providers do a pretty good job of being robust. However, the risk of smaller, more hosting-like providers can be much higher — and there are notable differences between the hyperscalers, too.
Operations folks know: Everything breaks. Physical stuff fails, software is buggy, and people screw up (a lot). A provider can try its best to reduce the number of failures, limit the “blast radius” of a problem, limit the possibility of “cascading failures”, and find ways to mitigate the impact on users. But you can’t avoid failure entirely. Systems that are resilient recover quickly from failure.
If you chop off the head of a hydra, it grows back — quickly. We can think about five key factors — heads of the hydra — that influence the robustness, resilience, and observed (“real world”) availability of cloud services:
- Physical design: The design of physical things, such as the data center and the hardware used to deliver services.
- Logical (software) design: The design of non-physical things, especially software — all aspects of the service architecture that is not related to a physical element.
- Implementation quality: The robustness of the actual implementation, encompassing implementation skill, care and meticulousness, and the effectiveness of quality-assurance (QA) efforts.
- Deployment processes: The rollout of service changes is the single largest cause of operational failures in cloud services. The quality of these processes, the automation used in the processes, and the degree to which humans are given latitude to use good judgment (or poor judgment) thus have a material impact on availability.
- Operational processes: Other operational processes, such as monitoring, incident management — and, most importantly, problem management — impact the cloud provider’s ability to react quickly to problems, mitigate issues, and ensure that the root causes of incidents are addressed. Both proactive and reactive maintenance efforts can have an impact on availability.
A sixth factor, Transparency, isn’t directly related to keeping the hydra alive, but matters to customers as they plan for their own application architectures and risk management — contributing to customer resilience. Transparency includes making architectural information to customers, as well as delivering outage-related visibility and insight to customers. Customers need real-world info — like current and historical outage reports and the root-cause-analysis port-mortems that offer insight into what went wrong and why (and what the provider is doing about it).
When you think about cloud service resilience (or the resilience of your own systems), think about it in terms of those factors. Don’t think about it like you think about on-premises systems, where people often think primarily about hardware failures or a fire in the data center. Rather, you’re dealing with systems where software issues are almost always the root cause. Physical robustness still matters, but the other four factors are largely about software.
Responsibility for cloud operations is often a political football in enterprises. Sometimes nobody wants it; it’s a toxic hot potato that’s apparently coated in developer cooties. Sometimes everybody wants it, and some executives think that control over it are going to ensure their next promotion / a handsome bonus / attractiveness for their next job. Frequently, developers and the infrastructure & operations (I&O) orgs clash over it. Sometimes, CIOs decide to just stuff it into a Cloud Center of Excellence team which started out doing architecture and governance, and then finds itself saddled with everything else, too.
Lots of arguments are made for it to live in particular places and to be executed in various ways. There’s inevitably a clash between the “boring” stuff that is basically lifted-and-shifted and rarely changes, and the fast-moving agile stuff. And different approaches to IaaS, PaaS, and SaaS. And and and…
Well, the fact of the matter is that multiple people are probably right. You don’t actually want to take a one-size-fits-all approach. You want to fit operational approaches to your business needs. And you maybe even want to have specialized teams for each major hyperscale provider, even if you adopt some common approaches across a multicloud environment. (Azure vs. non-Azure, i.e. Azure vs. AWS, is a common split, often correlated closely to Windows-based application environments vs Linux-based application environments.)
Ideally, you’re going to be highly automated, agile, cloud-native, and collaborative between developers and operators (i.e. DevOps). But maybe not for everything (i.e. not all apps are under active development).
Plus, once you’ve chosen your basic operations approach (or approaches), you have to figure out how you’re going to handle cloud configuration, release engineering, and security responsibilities. (And all the upskilling necessary to do that well!)
That’s where people tend to really get hung up. How much responsibility can I realistically push to my development teams? How much responsibility do they want? How do I phase in new operational approaches over time? How do I hook this into existing CI/CD, agile, and DevOps initiatives?
There’s no one right answer. However, there’s one answer that is almost always wrong, and that’s splitting cloud operations across the I&O functional silos — i.e., the server team deals with your EC2 VMs, your NetApp storage admin deals with your Azure Blobs, your F5 specialist configures your Google Load Balancers, your firewall team fights with your network team over who controls the VPC config (often settled, badly, by buying firewall virtual appliances), etc.
When that approach is taken, the admins almost always treat the cloud portals like they’re the latest pointy-clicky interface for a piece of hardware. This pretty much guarantees incompetence, lack of coordination, and gross inefficiency. It’s usually terrible at regardless of what scale you’re at. Unfortunately, it’s also the first thing that most people try (closely followed by massively overburdening some poor cloud architect with Absolutely Everything Cloud-Related.)
What works for most orgs: Some form of cloud platform operations, where cloud management is treated like a “product”. It’s almost an internal cloud MSP approach, where the cloud platform ops team delivers a CMP suite, cloud-enabled CI/CD pipeline integrations, templates and automation, other cloud engineering, and where necessary, consultative assistance to coders and to application management teams. That team is usually on call for incident response, but the first line for incidents is usually the NOC or the like, and the org’s usual incident management team.
But there are lots of options. Gartner clients: Want a methodical dissection of pros and cons; cloud engineering, operating, and administration tasks; job roles; coder responsibilities; security integration; and other issues? Read my new note, “Comparing Cloud Operations Approaches“, which looks at eleven core patterns along with guidance for choosing between them, andmaking a range of accompanying decisions.
A few days ago, an unexpected side-effect of some new code caused a major Gmail outage. Last year, a small bug triggered a series of cascading failures that resulted in a major Amazon outage. These are not the first cloud failures, nor will they be the last.
Cloud failures are as complex as the underlying software that powers them. No longer do you have isolated systems; you have complex, interwoven ecosystems, delicately orchestrated by a swarm of software programs. In presenting simplicity to the user, the cloud provider takes on the burden of dealing with that complexity themselves.
People sometimes say that these clouds aren’t built to enterprise standards. In one sense, they aren’t — most aren’t intended to meet enterprise requirements in terms of feature-set. In another sense, though, they are engineered to far exceed anything that the enterprise would ever think of attempting themselves. Massive-scale clouds are designed to never, ever, fail in a user-visible way. The fact that they do fail nonetheless should not be a surprise, given the potential for human error encoded in software. It is, in fact, surprising that they don’t visibly fail more often. Every day, within these clouds, a whole host of small errors that would be outages if they occurred within the enterprise — server hardware failures, storage failures, network failures, even some software failures — are handled invisibly by the back-end. Most of the time, the self-healing works the way it’s supposed to. Sometimes it doesn’t. The irony in both the Gmail outage and the S3 outage is that both appear to have been caused by the very software components that were actively trying to create resiliency.
To run infrastructure on a massive scale, you are utterly dependent upon automation. Automation, in turn, depends on software, and no matter how intensively you QA your software, you will have bugs. It is extremely hard to test complex multi-factor failures. There is nothing that indicates that either Google or Amazon are careless about their software development processes or their safeguards against failure. They undoubtedly hate failure as much as, and possibly more than, their customers do. Every failure means sleepless nights, painful internal post-mortems, lost revenue, angry partners, and embarrassing press. I believe that these companies do, in fact, diligently seek to seamlessly handle every error condition they can, and that they generally possess sufficient quantity and quality of engineering talent to do it well.
But the nature of the cloud — the one homogenous fabric — magnifies problems. Still, that’s not isolated to the cloud alone. Let’s not forget VMware’s license bug from last year. People who normally booted up their VMs at the beginning of the day were pretty much screwed. It took VMware the better part of a day to produce a patch — and their original announced timeframe was 36 hours. I’m not picking on VMware — certainly you could find yourself with a similar problem with any kind of widely deployed software that was vulnerable to a bug that caused it all to fail.
Enterprise-quality software produced the SQL Slammer worm, after all. In the cloud, we ain’t seen nothing yet…