Monthly Archives: October 2020

The multi-headed hydra of cloud resilience

Clients have recently been asking a lot more questions about the comparative resilience of cloud providers.

Identity services are a particular point of concern (for instance, the Azure AD outage of October 1st and Google Cloud IAM outage of March 26th) since when identity is down, the customer can’t access the cloud provider’s control plane (and it may impact service use in general) — plus there’s generally no way for the customer to work around such issues.

The good news is, hyperscale cloud providers do a pretty good job of being robust. However, the risk of smaller, more hosting-like providers can be much higher — and there are notable differences between the hyperscalers, too.

Operations folks know: Everything breaks. Physical stuff fails, software is buggy, and people screw up (a lot). A provider can try its best to reduce the number of failures, limit the “blast radius” of a problem, limit the possibility of “cascading failures”, and find ways to mitigate the impact on users. But you can’t avoid failure entirely. Systems that are resilient recover quickly from failure.

If you chop off the head of a hydra, it grows back — quickly. We can think about five key factors — heads of the hydra — that influence the robustness, resilience, and observed (“real world”) availability of cloud services:

  • Physical design: The design of physical things, such as the data center and the hardware used to deliver services.
  • Logical (software) design: The design of non-physical things, especially software — all aspects of the service architecture that is not related to a physical element.
  • Implementation quality: The robustness of the actual implementation, encompassing implementation skill, care and meticulousness, and the effectiveness of quality-assurance (QA) efforts.
  • Deployment processes: The rollout of service changes is the single largest cause of operational failures in cloud services. The quality of these processes, the automation used in the processes, and the degree to which humans are given latitude to use good judgment (or poor judgment) thus have a material impact on availability.
  • Operational processes: Other operational processes, such as monitoring, incident management — and, most importantly, problem management — impact the cloud provider’s ability to react quickly to problems, mitigate issues, and ensure that the root causes of incidents are addressed. Both proactive and reactive maintenance efforts can have an impact on availability.

A sixth factor, Transparency, isn’t directly related to keeping the hydra alive, but matters to customers as they plan for their own application architectures and risk management — contributing to customer resilience. Transparency includes making architectural information to customers, as well as delivering outage-related visibility and insight to customers. Customers need real-world info — like current and historical outage reports and the root-cause-analysis port-mortems that offer insight into what went wrong and why (and what the provider is doing about it).

When you think about cloud service resilience (or the resilience of your own systems), think about it in terms of those factors. Don’t think about it like you think about on-premises systems, where people often think primarily about hardware failures or a fire in the data center. Rather, you’re dealing with systems where software issues are almost always the root cause. Physical robustness still matters, but the other four factors are largely about software.

Beware of vendors bearing transformation Turkish Delight

“It is a lovely place, my house,” said the Queen. “I am sure you would like it. There are whole rooms full of Turkish Delight, and what’s more, I have no children of my own. I want a nice boy whom I could bring up as a Prince and who would be King of Narnia when I am gone. While he was Prince he would wear a gold crown and eat Turkish Delight all day long; and you are much the cleverest and handsomest young man I’ve ever met. I think I would like to make you the Prince—some day, when you bring the others to visit me.” — The White Witch (C.S. Lewis; The Lion, The Witch, and the Wardrobe)

When most people read the Narnia novels as children, they have no idea what Turkish Delight is. Its obscurity in recent decades has allowed everyone to imagine it as an entirely wonderful substance, carrying all their hopes and dreams of the perfect candy.

So, too, do people pour all of their business hopes and dreams into a nebulously-defined future of “digital transformation”.

Because the cloud is such a key enabling technology for digital business, I have plenty of discussions with clients who have been promised grand “digital transformation” outcomes by cloud providers and cloud MSPs. But it certainly not a phenomenon limited to the cloud. Hardware vendors and ISVs, outsourcers, consultancies, etc. are all selling this dream. While I can think of vendors who are more guilty of this than others, it’s a cross-IT-industry phenomenon.

Beware all digital transformation promises. Especially the ones where the vendor promises to partner with you to change the future of your industry or reinvent/revolutionize/disrupt X (where X is what your industry does).

I’ve quietly watched a string of broken transformation promises over the last few years, gently privately warning clients in inquiry conversations that you generally can’t trust these sorts of vendor promises. These behaviors have become much more prominent recently, though. And a colleague recently told me about a conversation that seemed like just a bridge too far: a large tech vendor promising to partner with a small Midwestern industrial manufacturer (tech laggards not doing anything exciting) to create transformative products, as part of a sales negotiation. (This vendor had not previously exhibited such behavior, so it was doubly startling.)

Clients come to us with tales of vendors who, in the course of sales discussions, promises to partner with them — possibly even dangling the possibility of a joint venture — to launch a transformational digital business, revolutionize the future of their industry, or the like. (Note that this is distinct from companies that sell transformation consulting. They promise to help you figure out the future, not form a business partnership to create that future — i.e. McKinsey, Deloitte, etc.)

Usually, neither the customer nor the vendor have a concrete idea of what that looks like. Usually, the vendor refuses to put this partnership notion in writing as a formal contract. On the rare occasion that there is a contract, it is pretty vague, does not oblige the vendor to put forth any business ideas, and allows the vendor to refuse any business idea and investment. In other words, it has zero teeth. Because it’s  so open-ended, the customer can fill the void with all their Turkish Delight dreams.

Moreover, the vendor may sometimes dangle samples of transformation-oriented services and consulting during the sales process. The customer gobbles down these sweet nuggets, and then stares mournfully at the empty box of transformation candy. For the promise of more, they’ll cheerfully betray their enterprise procurement standards, while the sourcing managers stand on the sidelines frantically waving contract-related warnings.

Listen to your sourcing managers when they warn you that the proposed “partnership” is a fiction. The White Witch probably doesn’t have your best interests at heart. Good digital transformation promises — ones that are likely to actually be kept — have concrete outcomes. They specify what the partnership will be developing, together with timelines, budgets, and the legal entity (such as a JV) that will be used to deliver the products/services. Or they specify the specific consulting services that will be provided — workshops, deliverables from those workshops, work-for-hire agreements with specific costs (and discounts, if applicable), and so forth.

Without concrete contractual outcomes, the vendor can vanish the candy into thin air with no repercussions. Sure, in a concrete transformation proposal, the end result will probably not be your Turkish Delight dreams. It might resemble a bowl of ordinary M&Ms. Or maybe a tasty grab-bag of Lindt truffles. (You’d have to get particularly lucky for it get much beyond the realm of grocery-store candy, though.) But you’re much more likely to actually get a good outcome.

Off-hand, I can think of one public example where a prominent “change the industry” vendor partnership with an enterprise, seems to have resulted in a credible product: Microsoft’s Connected Vehicle Platform. There, Microsoft signed a deal with a collection of automakers, each of whom had specific outcomes they wished to achieve — outcomes which could be realistically achieved in a reasonable amount of time, and representing industry advancement but not anything truly revolutionary. Microsoft built upon those individual projects to deliver a platform which would move the industry forward, which was announced with a clear mission and a timeframe for launch. Sure, it didn’t “change the future of cars”, but it brought tangible benefits to the customers.

Vendors often try to sell to who you hope to be, rather than who you are now. Your aspirations aren’t bad. Just make sure that your aspirations are well-defined and there’s a realistic roadmap to achieve them. Hope is not a strategy. The vendor may have little incentive not to promise everything  you could dream of, in order to get you to sign a large purchase agreement.