I recently wrote a Twitter thread about cloud risk and resilience that drew a lot of interest, so I figured I’d expand on it in a blog post. I’ve been thinking about cloud resilience a lot recently, given that clients have been asking about how they manage their risks.
Inquiries about this historically come in waves, almost always triggered by incidents that raise awareness (unfortunately often because the customer has been directly impacted). A wave generally spans a multi-week period, causing waves to bleed into one another. Three distinct sets come to mind over the course of 2021:
- The Azure AD outages earlier this year had a huge impact on client thinking about concentration risks and critical service dependencies — often more related to M365 than Azure, though (and exacerbated by the critical dependency that many organizations have on Teams during this pandemic). Azure AD is core to SSO for many organizations, making its resilience enormously impactful. These impacts are still very top of mind for many clients, months later.
- The Akamai outage (and other CDN outages with hidden dependencies) this summer raised application and infrastructure dependency awareness, and came as a shock to many customers, as Akamai has generally been seen as a bedrock of dependability.
- The near-daily IBM Cloud “Severity 1” outages over the last month have drawn selective client mentions, rather than a wave, but add to the broader pattern of cloud risk concerns. (To my knowledge, there has been no public communication from IBM regarding root cause of these issues. Notifications indicate the outages are multi-service and multi-regional, often impacting all Gen 2 multizone regions. Kubernetes may be something of a common factor, to guess from the impact scope.)
Media amplification of outage awareness appears to have a lot to do with how seriously they’re taken by customers — or non-customers. Affecting stuff that’s consumed by end-users — i.e. office suites, consumer websites, etc. — gets vastly more attention than things that are “just” a really bad day for enterprise ops people. And there’s a negative halo effect — i.e. if Provider X fails, it tends to raise worries about all their competitors too. But even good media explanations and excellent RCAs tend to be misunderstood by readers — and even by smart IT people. This leads, in turn, to misunderstanding why cloud services fail and what the real risks are.
I recently completed my writing on a note about HA and failover (DR) patterns in cloud IaaS and PaaS, with a light touch on application design patterns for resilience. However, concerns about cloud resilience applies just as much — if not more so — to SaaS, especially API SaaS, which creates complicated and deep webs of dependencies.
You can buy T-shirts, stickers, and all manner of swag that says, “The cloud is just somebody else’s computer.” Cute slogan, but not true. Cloud services — especially at massive scale — are incredibly complex software systems. Complex software systems don’t fail the way a “computer” fails. The cloud exemplifies the failure principles laid out by Richard Cook in his classic “How Complex Systems Fail“.
As humans, we are really bad at figuring out the risk of complex systems, especially because the good ones are heavily defended against failure. And we tend to over-index on rare but dramatic risks (a plane crash) versus more commonplace risks (a car crash).
If you think about “my application hosted on AWS” as “well, it’s just sitting on a server in an AWS data center rather than mine”, then at some point in time, the nature of a failure is going to shock you, because you are wrong.
Cloud services fail after all of the resiliency mechanisms have failed (or sometimes, gone wrong in ways that contribute to the failure). Cloud services tend to go boom because of one or more software bugs, likely combined with either a configuration error or some kind of human error (often related to the deployment process for new configs and software versions). They are only rarely related to a physical failure — and generally the physical failure only became apparent to customers because the software intended to provide resilience against it failed in some fashion.
Far too many customers still think about cloud failure as a simple, fundamentally physical thing. Servers fail, so we should use more than one. Data centers fail, so we should be able to DR into another. Etc. But that model is wrong for cloud and for the digital age. We want to strive for continuous availability and resilience (including graceful degradation and other ways to continue business functionality when the application fails). And we have to plan for individual services failures rather than total cloud failure (whether in an AZ, region, or globally). Such failures can be small-scale, and effectively merely “instability”, rather than an “outage” — and therefore demands apps that are resilient to service errors.
So as cloud buyers, we have to think about our risks differently, and we need to architect and operate differently. But we also need to trust our providers — and trust smartly. To that end, cloud providers need to support us with transparency, so we can make more informed decisions. Key elements of that include:
- Publicly-documented engineering service-level objectives (SLOs), which are usually distinct from the financially-backed SLAs. This is what cloud providers design to internally and measure themselves against, and knowing that helps inform our own designs and internal SLOs for our apps.
- Service architecture documentation that helps us understand the ways a service is and isn’t resilient, so we can design accordingly.
- Documented service dependency maps, which allow us to see the chain of dependencies for each of the services we use, allowing us to think about if Service X is really the best fallback alternative if Service Y goes down, as well as inform our troubleshooting.
- Public status dashboards, clearly indicating the status of services, with solid historical data that allows us to see the track record of service operations. This helps with our troubleshooting and user communication.
- Public outage root-cause analysis (RCA), which allow us to understand why outages occurred, and receive a public pledge as to what will be done to prevent similar failures in the future. A historical archive of these is also a valuable resource.
- Change transparency that could help predict stability concerns. Because so many outages end up being related to new deployments / config changes, and the use of SRE principles, including error budgets, is pretty pervasive amongst cloud providers, there is often an interesting pattern to outages. Changes tend to freeze when the error budget is exceeded, leading to an on-and-off pattern of outages; instability can resume at intervals unpredictable to the customer.
Mission-critical cloud applications are becoming commonplace — both in the pervasive use of SaaS, along with widespread production use of IaaS and PaaS. It’s past time to modernize thinking about cloud operations, cloud resilience, and cloud BC/DR. Cloud risk management needs to be about intelligent mitigation and not avoidance, as forward-thinking businesses are will not accept simply avoiding the cloud at this point.
I am interested in your experiences with resilience as well as cloud instability and outages. Feel free to DM me on Twitter to chat about it.
Pondering the care and feeding of your multicloud gelatinous cube. (Which engulfs everything in its path, and digests everything organic.)
Most organizations end up multicloud, rather than intending to be multicloud in a deliberate and structured way. So typical tales go like this: The org started doing digital business-related new applications on AWS and now AWS has become the center of gravity for all new cloud-native apps and cloud-related skills. Then the org decided to migrate “boring” LOB Windows-based COTS to the cloud for cost-savings, and lifted-and-shifted them onto Azure (thereby not actually saving money, but that’s a post for another day). Now the org has a data science team that thinks that GCP is unbearably sexy. And there’s a floating island out there of Oracle business applications where OCI is being contemplated. And don’t forget about the division in China, that hosts on Alibaba Cloud…
Multicloud is inevitable in almost all organizations. Cloud IaaS+PaaS spans such a wide swathe of IT functions that it’s impractical and unrealistic to assume that the organization will be single-vendor over the long term. Just like the enterprise tends to have at least three of everything (if not ten of everything), the enterprise is similarly not going to resist the temptation of being multicloud, even if it’s complex and challenging to manage, and significantly increases management costs. It is a rare organization that both has diverse business needs, and can exercise the discipline to use a single provider.
Despite recognizing the giant ooze that we see squelching our way, along with our unavoidable doom, there are things we can do to prepare, govern, and ensure that we retain some of our sanity.
For starters, we can actively choose our multicloud strategy and stance. We can classify providers into tiers, decide what providers are approved for use and under what circumstances, and decide what providers are preferred and/or strategic.
We can then determine the level of support that the organization is going to have for each tier — decide, for instance, that we’ll provide full governance and operations for our primary strategic provider, a lighter-weight approach that leans on an MSP to support our secondary strategic provider, and less support (or no support beyond basic risk management) for other providers.
After that, we can build an explicit workload placement policy that has an algorithm that guides application owners/architects in deciding where particular applications live, based on integration affinities, good technical fit, etc.
Note that cost-based provider selection and cost-based long-term workload placement are both terrible ideas. This is a constant fight between cloud architects and procurement managers. It is rooted in the erroneous idea that IaaS is a commodity, and that provider pricing advantages are long-term rather than short-lived. Using cost-based placement often leads to higher long-term TCO, not to mention a grand mess with data gravity and thus data management, and fragile application integrations.
See my new research note, “Comparing Cloud Workload Placement Strategies” (Gartner paywall) for a guide to multicloud IaaS / IaaS+PaaS strategies (including when you should pursue a single-cloud approach). In a few weeks, you’ll see the follow-up doc “Designing a Cloud Workload Placement Policy” publish, which provides a guide to writing such policies, with an analysis of different placement factors and their priorities.
A nontrivial chunk of my client conversations are centered on the topic of cloud IaaS/PaaS self-service, and how to deal with development teams (and other technical end-user teams, i.e. data scientists, researchers, hardware engineers, etc.) that use these services. These teams, and the individuals within those teams, often have different levels of competence with the clouds, operations, security, etc. but pretty much all of them want unfettered access.
Responsible governance requires appropriate guidelines (policies) and guardrails, and some managers and architects feel that there should be one universal policy, and everyone — from the highly competent digital business team, to the data scientists with a bit of ad-hoc infrastructure knowledge — should be treated identically for the sake of “fairness”. This tends to be a point of particular sensitivity if there are numerous application development teams with similar needs, but different levels of cloud competence. In these situations, applying a single approach is deadly — either for agility or your crisis-induced ulcer.
Creating a structured, tiered approach, with different levels of self-service and associated governance guidelines and guardrails, is the most flexible approach. Furthermore, teams that deploy primarily using a CI/CD pipeline have different needs from teams working manually in the cloud provider portal, which in turn are different from teams that would benefit from having an easy-vend template that gets provisioned out of a ServiceNow request.
The degree to which each team can reasonably create its own configurations is related to the team’s competence with cloud solution architecture, cloud engineering, and cloud security. Not every person on the team may have a high level of competence; in fact, that will generally not be the case. However, the very least, for full self-service there needs to be at least one person with strong competencies in each of those areas, who has oversight responsibilities, acts an expert (provides assistance/mentorship within the team), and does any necessary code review.
If you use CI/CD, you also want automation of such review in your pipeline, that includes your infrastructure-as-code (IaC) and cloud configs, not just the app code; i.e. a tool like Concourse Labs). Even if your whole pipeline isn’t automated, review of IaC during the dev stage, and not just when it triggers a cloud security posture management tool (like Palo Alto’s Prisma Cloud or Turbot), whether in dev, test, or production.
Who determines “competence”? To avoid nasty internal politics, it’s best to set this standard objectively. Certifications are a reasonable approach, but if your org isn’t the sort that tends to pay for internal certifications or the external certifications (AWS/Azure Solution Architect, DevOps Engineer, Security Engineer, etc.) seem like too high a bar, you can develop an internal training course and certification. It’s not a bad idea for all of your coders (whether app developers, data scientists, etc.) that use the cloud to get some formal training on creating good and secure cloud configurations, anyway.
(For Gartner clients: I’m happy to have a deeper discussion in inquiry. And yes, a formal research note on this is currently going through our editing process and will be published soon.)
What sort of org structures work well for helping to drive successful cloud adoption? Every day I talk to businesses and public-sector entities about this topic. Some have been successful. Others are struggling. And the late-adopters are just starting out and want to get it right from the start.
Back in 2014, I started giving conference talks about an emerging industry best practice — the “Cloud Center of Excellence” (CCOE) concept. I published a research note at the start of 2019 distilling a whole bunch of advice on how to build a CCOE, and I’ve spent a significant chunk of the last year and a half talking to customers about it. Now I’ve revised that research, turning it into a hefty two-part note on How to Build a Cloud Center of Excellence: part 1 (organizational design) and part 2 (Year 1 tasks).
Gartner’s approach to the CCOE is fundamentally one that is rooted in the discipline of enterprise architecture and the role of EA in driving business success through the adoption of innovative technologies. We advocate a CCOE based on three core pillars — governance (cost management, risk management, etc.), brokerage (solution architecture and vendor management), and community (driving organizational collaboration, knowledge-sharing, and cloud best practices surfaced organically).
Note that it is vital for the CCOE to be focused on governance rather than on control. Organizations who remain focused on control are less likely to deliver effective self-service, or fully unlock key cloud benefits such as agility, flexibility and access to innovation. Indeed, IT organizations that attempt to tighten their grip on cloud control often face rebellion from the business that actually decreases the power of the CIO and the IT organization.
Also importantly, we do not think that the single-vendor CCOE approaches (which are currently heavily advocated by the professional services organizations of the hyperscalers) are the right long-term solution for most customers. A CCOE should ideally be vendor-neutral and span IaaS, PaaS, and SaaS in a multicloud world, with a focus on finding the right solutions to business problems (which may be cloud or noncloud). And a CCOE is not an IaaS/PaaS operations organization — cloud engineering/operations is a separate set of organizational decisions (I’ll have a research note out on that soon, too).
Please dive into the research (Gartner paywall) if you are interested in reading all the details. I have discussed this topic with literally thousands of clients over the last half-dozen years. If you’re a Gartner for Technical Professionals client, I’d be happy to talk to you about your own unique situation.
IBM has launched the beta of BlueMix, its Cloud Foundry-based PaaS. Understanding what BlueMix, and IBM, do and don’t bring to the table means a bit of a digression into how Cloud Foundry works as a PaaS. Since my blog is usually pretty infrastructure-oriented, I’m guessing that a significant percentage of readers won’t know very much about Cloud Foundry (which I’ll abbreviate as CF).
In CF, users write application code, which they deploy onto CF runtime environments (defined by “buildpacks”) — i.e., programming languages and associated frameworks. When CF is deployed as a PaaS, it will normally have some built-in buildpacks, but users can also add additional ones through a mechanism called buildpacks (which originated at Heroku, a PaaS provider that is not CF-based). CF runs applications in its own “Warden” containers (which are OS-independent), staging the runtime and app code into what it calls “droplets”. These application instances are of a size controlled by the user (developer), and the user chooses how many of them there are. Cloud Foundry does not have native auto-scaling currently.
CF can also expose a catalog of services; these services might or might not be built on top of Cloud Foundry. These services are called “Managed Services”, and they support CF’s Service Broker API, allowing CF to provision those services and bind them to applications. Users can also bind their own service instances, supplying credentials for services that exist outside of CF and that aren’t directly integrated via the Service Broker API. Users of CF can also bind external services that don’t support CF explicitly.
IBM has built its own UI for BlueMix. IBM has said at Pulse that it’s got a new focus on design, and BlueMix shows it — the interface is modern and attractive, and its entire look-and-feel and usability are in stark contrast to, say, its previous SmartCloud Application Services offering. Interacting with the UI is pleasant enough. Most users will probably use the CF command-line tool (CLI), though. Apps are normally deployed using the CLI, unless the customer is using JazzHub (a developer service created out of IBM UrbanCode).
For the BlueMix beta, IBM has created two buildpacks of its own, for Liberty (Java) and Node.js, which it says it has hardened and instrumented. They also supply two community buildpacks, for Ruby on Rails and Ruby Sinatra. As with normal CF, users can supply their own buildpacks, and the open-source CF buildpacks appear to work fine, IBM calls these “runtimes” in the BlueMix portal.
IBM also has a bunch of CF services — “Managed Services” in CF parlance. Some of these are IBM-created, like the DataCache (which is WebSphere eXtreme Scale) and Elastic MQ (WebSphere MQ). Others are labeled “community” and are likely open-source CF service implementations of popular packages like MySQL and MongoDB. As is true with all CF services, the implementation of a service is not necessarily on Cloud Foundry — for instance, one of the services is Cloudant, which is entirely external.
Finally, IBM provides what it calls “boilerplates”, which you can click to create an application with a runtime plus a number of additional services that are bound to the app. The most notable is the “mobile backend starter”, which combines Node.js with a number of mobile-oriented services, like a mobile data store and push notifications.
All in all, the BlueMix beta is a showcase for IBM middleware and other IBM software of interest to developers. IBM has essentially had to SaaS-ify (or PaaS-ify, if you prefer that term) its enterprise software assets to achieve this. Obviously, this is only a sliver of its portfolio, but bringing more software assets into BlueMix is clearly key to its strategy — BlueMix is as much a service catalog as a PaaS in this case.
Broadly, though, it’s very clear that IBM is targeting the enterprise developer, especially the enterprise developer who is currently developing in Java on WebSphere technologies. It’s bringing those developers to the cloud — not targeting cloud-native developers, who are more likely to be drawn to something like AppFog if they’re looking for a CF service. Given that IBM says that it will provide strong support for integrating with existing on-premise applications, this is a strategy that makes sense.
Standard CF constraints apply — limited RAM per application instance (and tight resource limitations in general in BlueMix beta), no writes to the local filesystem, and so forth. Other features that would be value-added, like monitoring and automatic caching of static content, are missing at present.
The short-form way to think of BlueMix beta is “Cloud Foundry with some IBM middleware as a service”. It’s hosted in SoftLayer data centers. Presumably at some point IBM will introduce SLAs for at least portions of the service. It’s certainly worth checking out if you’re a WebSphere shop, and if you’re checking out Cloud Foundry in general, this seems to be a perfectly decent way to do it. There’s solid promise here, and my expectation is that at this stage of the game, PaaS might well be a much stronger play for IBM than IaaS, at least in terms of the ability to articulate the overall value of the IBM ecosystem and make an argument for making a strategic bet on IBM in the cloud.