Blog Archives

The cloud budget overrun rainbow of flavors

Cloud budget overruns don’t have a singular cause. Instead, they come in a bright rainbow of jelly belly flavors (the Bertie Botts ones, especially, will combine into a non-mouthwatering delight). Each needs different forms of response.

Ungoverned costs. This is the black licorice of FinOps problems. The organization has no idea what it’s spending, really, much less where the money is going, other than the big bills (or often, many little credit card bills) that they pay each month. This requires basic cost hygiene: analyze your cloud bills, get a cost management tool into place and make it useful through some tagging or partitioning discipline.

Unanticipated usage. This is the sour watermelon flavor of cost overruns — deliciously sweet yet mouth-puckering. In this situation, the organization is the victim of its own cloud success. Cloud has been such a great thing for the organization that more and more unanticipated cloud projects are showing up, blowing out the original budget estimates for cloud resources. Those cloud projects are delivering business value and it doesn’t make sense to say no to them (and even if central IT says no, the cloud costs can usually be paid for out of a line-of-business budget). Nevertheless, it’s causing a lot of organizational angst because central IT or the sourcing team didn’t anticipate this spending. This organization needs to learn to shift its budgeting processes for the digital future, and cloud chargeback will help support future decision-making.

No commitments. This is the minty wrongness of Bertie Botts toothpaste. The organization could get discounts by using public discounting mechanisms for commits (like AWS Savings Plans and Azure Reserved Instances) as well making a contractual commitment for a negotiated discount. But because the organization feels like they can’t perfectly predict their use and aren’t sure if they’ll use all of what they’re using today, they commit to nothing, therefore ensuring that they spend grotesquely more than they could be. This is universally a terrible idea. Organizations that aren’t in early pilot stage have long-term production applications and some predictability of usage; commit to the stuff you know you’re not killing off.

Dev/test waste. This is the mundane bleah-ness of Bertie Botts earwax. Developers are provisioning the biggest things they can get away with (or at least being overaggressive in their estimates of what they need), there are lots of abandoned resources idling away, and dev/test infrastructure that isn’t used outside of business hours isn’t being suspended when unused. This is what cloud cost management tools are great at doing — identifying obvious waste so that it can be eliminated, largely by shutting it down or suspending it, preferably via automation.

Too much production headroom. This is the mild weirdness of the Bertie Botts grass flavor. Application teams haven’t implemented autoscaling for applications that can scale horizontally, or they’ve overestimated how much production headroom an application with variable usage needs (which may result in oversizing compute units, or being overly aggressive with autoscaling). This requires implementing autoscaling with some thoughtful tuning of parameters, and possibly a business value conversation on the cost/benefit tradeoff of having higher application performance on a consistent basis.

Wrongsizing production. This is the awful lingering terribleness of Bertie Botts vomit, whose taste you cannot get out of your mouth. Production environments are statically overprovisioned and therefore overly costly. On-prem, 30% utilization is common, but it’s all capex and as long as it’s within budget, no one really cares about the waste. But in the cloud, you pay for that excess resource monthly, forcing you to confront the ongoing cost of the waste.

However, anyone who tells you to “just” rightsize has never actually tried to do this in practice within an enterprise. The problem is that applications that scale vertically typically can’t be easily rightsized. It’s likely difficult-to-impossible to do automatically, due to complicated application installation. The application is fragile and may be mission-critical, so you are cautious about maintenance downtime. And the application team — the only people who really understand how this thing works — is likely busy with other priorities.

If this is your situation, your cloud cost management tool may cause you to cry hopeless tears, because you can see the waste but taking remediation actions is a complicated cross-functional war dance and delicate negotiation that leaves everyone wondering if it wouldn’t have been easier to just keep paying a larger bill.

Suboptimal design and implementation. The controversial popcorn flavor. Architects are sometimes cost-oblivious when they design cloud solutions. They may make bad design choices, or changes in application features and behavior over time may have turned out to make a design choice unexpectedly expensive. Developers may write poorly-performing code that consumes a lot of infrastructure resources, or code that makes excessive (and, cumulatively, expensive) calls to cloud services. Your cloud cost management tools are unlikely to be of any use for detecting these situations. This needs to be addressed through performance engineering, with attention paid to the business value of the time/effort/money necessary to do so — and for many organizations may require bringing in third-party expertise to diagnose the problems and offer recommendations.

Notably, the answer to most of these issues is not “implement a cloud cost management tool”. The challenges aren’t really as simple as a lot of vendors (and talking heads) make them out to be.

Don’t be surprised when “move fast and break things” results in broken stuff

Of late, I’ve been talking to a lot of organizations that have learned cloud lessons the hard way — and even more organizations who are newer cloud adopters who seem absolutely determined to make the same mistakes. (Note: Those waving little cloud-repatriation flags shouldn’t be hopeful. Organizations are fixing their errors and moving on successfully with their cloud adoption.)

If your leadership adopts the adage, “Move fast and break things!” then no one should be surprised when things break. If you don’t adequately manage your risks, sometimes things will break in spectacularly public ways, and result in your CIO and/or CISO getting fired.

Many organizations that adopt that philosophy (often with the corresponding imposition of “You build it, you run it!” upon application teams) not only abdicate responsibility to the application teams, but they lose all visibility into what’s going on at the application team level. So they’re not even aware of the risks that are out there, much less whether those risks are being adequately managed. The first time central risk teams become aware of the cracks in the foundation might be when the building collapses in an impressive plume of dust.

(Note that boldness and the willingness to experiment are different from recklessness. Trying out new business ideas that end up failing, attempting different innovative paths for implementing solutions that end up not working out, or rapidly trying a bunch of different things to see which works well — these are calculated risks. They’re absolutely things you should do if you can. That’s different from just doing everything at maximum speed and not worrying about the consequences.)

Just like cloud cost optimization might not be a business priority, broader risk management (especially security risk management) might not be a business priority. If adding new features is more important than address security vulnerabilities, no one should be shocked when vulnerabilities are left in a state of “busy – fix later”. (This is quite possibly worse than “drunk – fix later“, as that at least implies that the fix will be coming as soon as the writer sobers up, whereas busy-ness is essentially a state that tends to persist until death).

It’s faster to build applications that don’t have much if any resilience. It’s faster to build applications if you don’t have to worry about application security (or any other form of security). It’s faster to build applications if you don’t have to worry about performance or cost. It’s faster to build applications if you only need to think about the here-and-now and not any kind of future. It is, in short, faster if you are willing to accumulate meaningful technical debt that will be someone else’s problem to deal with later. (It’s especially convenient if you plan to take your money and run by switching jobs, ensuring you’re free of the consequences.)

“We hope the business and/or dev teams will behave responsibly” is a nice thought, but hope is not a strategy. This is especially true when you do little to nothing to ensure that those teams have the skills to behave responsibly, are usefully incentivized to behave responsibly, and receive enough governance to verify that they are behaving responsibly.

When it all goes pear-shaped, the C-level IT executives (especially the CIO, chief information security officer, and the chief risk officer) are going to be the ones to be held accountable and forced to resign under humiliating circumstances. Even if it’s just because “You should have known better than to let these risks go ungoverned”.

(This usually holds true even if business leaders insisted that they needed to move too quickly to allow risk to be appropriately managed, and those leaders were allowed to override the CIO/CISO/CRO, business leaders pretty much always escape accountability here, because they aren’t expected to have known better. Even when risk folks have made business leaders sign letters that say, “I have been made aware of the risks, and I agree to be personally responsible for them” it’s generally the risk leaders who get held accountable. The business leaders usually get off scott-free even with the written evidence.)

Risk management doesn’t entail never letting things break. Rather, it entails a consideration of risk impacts and probabilities, and thinking intelligently about how to deal with the risks (including implementing compensating controls when you’re doing something that you know is quite risky). But one little crack can, in combination with other little cracks (that you might or might or might not be aware of), result in big breaches. Things rarely break because of black swan events. Rather, they break because you ignored basic hygiene, like “patch known vulnerabilities”. (This can even impact big cloud providers, i.e. the recent Azurescape vulnerability, where Microsoft continued to use 2017-era known-vulnerable open-source code in production.)

However, even in organizations with central governance of risk, it’s all too common to have vulnerability management teams inform you-build-it-you-run-it dev teams that they need to fix Known Issue X. A busy developer will look at their warning, which gives them, say, 30 days to fix the vulnerability, which is within the time bounds of good practice. Then on day 30, the developer will request an extension, and it will probably be granted, giving them, say, another 30 days. When that runs out, the developer will request another extension, and they will repeat this until they run out the extension clock, whereupon usually 90 days or more have elapsed. At that point there will probably be a further delay for the security team to get involved in an enforcement action and actually fix the thing.

There are no magic solutions for this, especially in organizations where teams are so overwhelmed and overworked that anything that might possibly be construed as optional or lower-priority gets dropped on the floor, where it is trampled, forgotten, and covered in old chewing gum. (There are non-magical solutions that require work — more on that in future research notes.)

Moving fast and breaking things takes a toll. And note that sometimes what breaks are people, as the sheer number of things they need to cope with overload their coping mechanisms and they burn out (either in impressive pillars or flame, or quiet extinguishment into ashes).

Improving cloud resilience through stuff that works

As I noted in a previous blog post, multicloud failover is almost always a terrible idea. While the notion that an entire cloud provider can go dark for a lengthy period of time (let’s say a day or more) is not entirely impossible, it’s the least probable of the many ways that an application can experience failure. Humans tend to over-index on catastrophic but low-probability events, so it’s not especially shocking that people fixate on the possibility, but before you spend precious people-effort (not to mention money) on multicloud failover, you should first properly resource all the other things you could be doing to improve your resilience in the cloud.

As I noted previously, five core things impact cloud resilience: physical design, logical (software) design, implementation quality, deployment processes, and operational processes. So you should select your cloud provider carefully. Some providers have a better track record of reliability than others — often related directly in differences in the five core resilience factors. I’m not suggesting that this be a primary selection criterion, but the less reliable your provider, the more you’re going to have to pour effort into resilience, knowing that the provider’s failures are going to test you in the real world. You should care most about the failure of global dependencies (identity, security certificates, NTP, DNS, etc.) that can affect all services worldwide, followed by multi-region failures (especially those that affect an entire geography).

However, those things aren’t just important for cloud providers. They also affect you, the application owner, and the way you should design, implement, update, and operate your application  — whether that application is on-premises or in the cloud. Before you resort to multicloud failover, you should have done all of the below and concluded that you’ve already maximized your resilience via these techniques and still need more.

Start with local HA. When architecting a mission-critical application, design it to use whatever HA capabilities are available to you within an availability zone (AZ). Use a clustered (and preferably scale-out) architecture for the stuff you build yourself. Ensure you maximize the resilience options available from the cloud services.

Build good error-handling into your application. Your application should besmart about the way it handles errors, either from other application components or from cloud services (or other third-party components). It should exhibit polite retry behavior and implement circuit breakers to try to limit cascading failures. It should implement load-shedding, in recognition of the fact that rejecting excessive requests so that the requests that can be served receive decent performance is better than just collapsing into non-responsiveness. It should have fallback mechanisms for graceful degradation, to limit impact on users.

Architect the application’s internals for resilience. Techniques such as partitions and bulkheads are likely going to be reserved for larger-scale applications, but are vital for limiting the blast radius of failures. (If you have no idea what any of this terminology means, read Michael Nygard’s “Release It!” — in my personal opinion, if you read one book about mission-critical app design, that should probably be the one.)

Use multiple AZs. Run your application active-active across at least two, and preferably three, AZs within each region that you use. (Note that three can be considerably harder than two because most cloud provider services natively support running in two AZs simultaneously but not three. But that’s a far easier problem than multicloud failover.)

Use multiple regions. Run your application active-active across at least two, and preferably three regions. (Again, two is definitely much easier than three, due to a cloud service’s cross-region support generally being two regions.) If you can’t do that, do fast fully-automated regional failover.

Implement chaos engineering. Not only do you need to thoroughly test in your dev/QA environment to determine what happens under expected failure conditions, but you also need to experiment with fault injection in your production environment where there are complex unpredictable conditions that may cause unexpected failures. If this sounds scary and you expect it’ll blow up in your face, then you need to do a better job in the design and implementation of your application. Forcing constant failures into production systems (ala Netflix’s famed Chaos Monkey) helps you identify all the weak spots, builds resilience, and should help give you confidence that things will continue to work when cloud issues arise.

It’s really important to treat resilience as a systems concern, not purely an infrastructure concern. Your application architecture and implementation need to be resilient. If your developers can’t be trusted to write continuously available applications, imposing multicloud portability requirements (and attendant complexity) upon them will probably add to your operational risks.

And I’m not kidding about the chaos engineering. If you’re not mature enough for chaos engineering, you’re not mature enough to successfully implement multicloud failover. If you don’t routinely shoot your own AZs and regions, kill access to services, kill application components, make your container hosts die, deliberately screw up your permissions and fail-closed, etc. and survive that all without worrying, you need to go address your probable risks of failure that have solutions of reasonable complexity, before you tackle the giant complex beast of multicloud failover to address the enormously unlikely event of total provider failure.

Remember that we’re trying to achieve continuity of our business processes and not continuity of particular applications. If you’ve done all of the above and you’re still worried about the miniscule probability of total provider failure, consider building simple alternative applications in another cloud provider (or on-premises, or in colo/hosting). Such applications might simply display cached data, or queue transactions for later processing. This is almost always easier than maintaining full cross-cloud portability for a complex application. Plus, don’t forget that there might be (gasp) paper alternatives for some processes.

(And yes, I already have a giant brick of a research note written on this topic, slated for publication at the end of this year. Stay tuned…)

Cloud cost overruns may be a business leadership failure

A couple of months back, some smart folks at VC firm Andreesen Horowitz wrote a blog post called “The Cost of Cloud, a Trillion Dollar Paradox“. Among other things, the blog made a big splash because it claimed, quote: “[W]hile cloud clearly delivers on its promise early on in a company’s journey, the pressure it puts on margins can start to outweigh the benefits, as a company scales and growth slows.” It claimed that cloud overspending was resulting in huge loss of market value, and that developers needed incentives to reduce spending.

The blog post is pretty sane, but plenty of people misinterpreted it, or took away only its most sensationalistic aspects. I think it’s critical to keep in mind the following:

Decisions about cloud expenditures are ultimately business decisions. Unnecessarily high cloud costs are the result of business decisions about priorities — specifically, about the time that developers and engineers devote to cost optimization versus other priorities.

For example, when developer time is at a premium, and pushing out features as fast as possible is the highest priority, business leadership can choose to allow the following things that are terrible for cloud cost:

  • Developers can ignore all annoying administrative tasks, like rightsizing the infrastructure or turning off stuff that isn’t in active use.
  • Architects can choose suboptimal designs that are easier and faster to implement, but which will cost more to run.
  • Developers can implement crude algorithms and inefficient code in order to more rapidly deliver a feature, without thinking about performance optimizations that would result in less resource consumption.
  • Developers can skip implementing support for more efficient consumption patterns, such as autoscaling.
  • Developers can skip implementing deployment automation that would make it easier to automatically rightsize — potentially compounded by implementing the application in ways that are fragile and make it too risky and effortful to manually rightsize.

All of the above is effectively a form of technical debt. In the pursuit of speed, developers can consume infrastructure more aggressively themselves — not bothering to shut down unused infrastructure, running more CI jobs (or other QA tests), running multiple CI jobs in parallel, allocating bigger faster dev/test servers, etc. — but that’s short-term, not an ongoing cost burden the way that the technical debt is. (Note that the same prioritization issues also impact the extent to which developers cooperate in implementing security directives. That’s a tale for another day.)

The more those things are combined — bad designs, poorly implemented, that you can’t easily rightsize or scale — the more that you have a mess that you can’t untangle without significant expenditure of development time.

Now, some organizations will go put together a “FinOps” team to play whack-a-mole with infrastructure — killing/parking stuff that is idle and rightsizing the waste. And that might help short-term, but until you can automate that basic cost hygiene, this is non-value-added people-intensive work. And woe betide you if your implementations are fragile enough that rightsizing is operationally risky.

Once you’ve got your whack-a-mole down to a nice quick automated cadence, you’ve got to address the application design and implementation technical debt — and invest in the discipline of performance engineering — or you’ll continue paying unnecessarily high bills month after month. (You’d also be oversizing on-prem infrastructure, but people are used to that, and the capital expenditure is money spent, versus the grind of a monthly cloud bill.)

Business leaders have to step up to prioritize cloud cost optimization — or acknowledge that it isn’t a priority, and that it’s okay to waste money on resources as long as the top line is increasing faster. As long that’s a conscious, articulated decision, that’s fine. But we shouldn’t pretend that developers are inherently irresponsible. Developers, like other employees, respond to incentives, and if they’re evaluated on their velocity of feature delivery, they’re going to optimize their work efforts towards that end.

For more details, check out my new research note called “Is FinOps the Answer to Cloud Cost Governance?” which is paywalled and targeted at Gartner’s executive leader clients — a combination of CxOs and business leaders.

Multicloud failover is almost always a terrible idea

Most people — and notably, almost all regulators — are entirely wrong about addressing cloud resilience through the belief that they should do multicloud failover because, as I noted in a previous blog post,  the cloud is NOT just someone else’s computer. (I have been particularly aghast at a recent Reuters article about the Bank of England’s stance.)

Regulators, risk managers, and plenty of IT management largely think of AWS, Azure, etc. as monolithic entities, where “the cloud” can just break for them, and then kaboom, everything is dead everywhere worldwide. They imagine one gargantuan amorphous data center, subject to all the problems that can afflict single data centers, or single systems. But that’s not how it works, that’s not the most effective way to address risk, and testing the “resilience of the provider” (as a generic whole) is both impossible and meaningless.

I mean, yes, there’s the possibility of the catastrophic failure of practically any software technology. There could be, for instance, a bug in the control systems of airplanes from fill-in-the-blank manufacturer that could be simultaneously triggered at a particular time and cause all their airplanes to drop out of the sky simultaneously. But we don’t plan to make commercial airlines maintain backup planes from some other manufacturer in case it happens. Instead, we try to ensure that each plane is resilient in many ways — which importantly addresses the most probable forms of failure, which will be electrical or mechanical failures of particular components.

Hyperscale cloud providers are full of moving parts — lots of components, assembled together into something that looks and feels like a cohesive whole. Each of those components has its own form of resilience, and some of those components are more fragile than others. Some of those components are typically operating well within engineered tolerances. Some of those components might be operating at the edge of those tolerances in certain circumstances — likely due to unexpected pressures from scale — and might be extra-scary if the provider isn’t aware that they’re operating at that edge. In addition to fault-tolerance within each component, there are many mechanisms for fault-tolerance built into the interaction between those components.

Every provider also has its own equivalent of “maintenance” (returning to the plane analogy). The quality of the “mechanics” and the operations will also impact how well the system as a whole operates.  (See my previous blog post, “The multi-headed hydra of cloud resilience” for the factors that go into provider resilience.)

It’s not impossible for a provider to have a worldwide outage that effectively impacts all services (rather than just a single service).  Such outages are all typically rooted in something that prevents components from communicating with each other, or customers from connecting to the services — global network issues, DNS, security certificates, or identity. The first major incident of this type was the 2012 Azure leap year outage. The 2019 Google “Chubby” outage had global network impact, including on GCP. There have been multiple Azure AD outages with broad impact across Microsoft’s cloud portfolio, most recently the 2021 Azure Active Directory outage. (But there are certainly other possibilities. As recently as yesterday, there was a global Azure Windows VM outage that impacted all Windows VM-dependent services.)

Provider architectural and operational differences do clearly make a difference. AWS, notably, has never had a full regional failure or a global outage. The unique nature of GCP’s global network has both benefits and drawbacks. Azure has been improving steadily in reliability over the years as Microsoft addresses both service architecture and deployment (and other operations) processes.

Note that while these outages can be multi-hour, they have generally been short enough that — given typical enterprise recovery-time objectives for disaster recovery, which are often lengthy — customers typically don’t activate a traditional DR plan. (Customers may take other mitigation actions, i.e. failover to another region, failover to an alternative application for a business process, and so forth.)

Multicloud failover requires that you maintain full portability between two providers, which is a massive burden on your application developers. The basic compute runtime (whether VMs or containers) is not the problem, so OpenShift, Anthos, or other “I can move my containers” solutions won’t really help you. The problem is all the differentiators — the different network architectures and features, the different storage capabilities, the proprietary PaaS capabilities, the wildly different security capabilities, etc. Sure, you can run all open source in VMs, but at that point, why are you bothering with the cloud at all? Plus, even in a DR situation, you need some operational capabilities on the other cloud (monitoring, logging, etc.), even if not your full toolset.

Moreover, the huge cost and complexity of a multicloud implementation is effectively a negative distraction from what you should actually be doing that would improve your uptime and reduce your risks, which is making your applications resilient to the types of failure that are actually probable. More on that in a future blog post.

Banks are accelerating their cloud journeys

In the past couple of months, I have talked to the majority of the world’s largest banks about what is necessary to drive successful cloud adoption at enterprise scale. These conversations have a lot of things in common with one another, and I often send the same research notes as a follow-up to our conversations. Here are those notes, with some context. The notes are all behind the Gartner paywall, in most cases Gartner for Technical Professionals, but some of these are available to IT Leaders clients, or Executive Programs clients.

Banks are indeed really moving core banking to the cloud. The long-held adage that “banks might put new systems of innovation or systems of engagement in the cloud, but they’ll never move core banking”, is crumbling. Gartner has statistics supporting this, which you can find in “Core Banking Hot Spot: Moving the Core Into the Cloud“.

Banks cite application modernization as a critical driver for cloud adoption. An increasing number of banks are migrating a substantial percentage of their existing application estate to public cloud IaaS (and PaaS). Supporting survey data can be found in “Application Modernization Is the Most Common Identified Priority for End-User Cloud Adoption in Banking and Investment Services” (but other priorities are closely clustered in importance).

Banks are striving to mature their cloud adoption. Some banks have had a lot of ad hoc adoption over the years, while other banks have been more cautious (venturing into a bit of SaaS but sometimes zero IaaS or PaaS). But we’ve hit the inflection point (starting about two years ago) where banks became comfortable with cloud provider security and then seemingly all of a sudden went to a “go go go!” mode in which cloud was viewed as a critical accelerator of digital banking initiatives. (See “Advance Through Public Cloud Adoption Maturity” for a view of typical journeys.)

Central cloud governance is the norm for banks. Banks generally like the Gartner-style cloud center of excellence (CCOE) model where an enterprise architecture function provides cloud governance, brokerage, and transformation assistance. (See “How to Build a Cloud Center of Excellence“.) However, their CCOE model is likely to be federated to empower different business units or regions to take charge of their own destinies (especially when the cloud strategy is more regional than global). And many banks are splitting off a separate cloud IT unit under a deputy CIO, which is effectively a self-contained organization with hundreds of people devoted to the cloud migration and transformation effort.

While banks still do detailed technical evaluation of cloud providers, strategic selection is based on alignment to the IT strategy. Banks still really care about nitpicky technical details, but ultimately, their selection of strategic providers is based on broader IT priorities, just like most other cloud customers these days. (See “How to Initiate the Selection of Strategic Cloud IaaS Providers“.) Sometimes there’s a certain degree of hope for some kind of innovation partnership. (I am cynical about such “partnerships”, especially when they come in the form of vague sales platitudes without contractual guarantees or a close business development relationship.)

Banks tend to be multicloud. The larger the bank, the more likely it is to adopt a multicloud strategy, similar to other enterprises (see “Comparing Cloud Workload Placement Strategies“). However, this does not mean that all cloud providers are treated equally. My anecdotal impression is that in terms of primary strategic provider, AWS dominates the the top end of the market (the largest banks) but that Azure captures the middle of the pack (from the US midmarket banks that tend to outsource their processing, to the banks that are important at the country/region level but not highly global).

Banks are making the transition to a more systematic approach to multicloud. Like many large distributed enterprises, banks often have pockets of cloud adoption, each aligned to a different cloud provider. With the maturation of their cloud journeys, they are becoming more systematic, building workload placement policies to guide where workloads should go. (See “Designing a Cloud Workload Placement Policy Document“.)

Banks worry about cloud concentration risks. Many banks face regulatory regimes that require them to address concentration risk. Regulators tend not to provide prescriptive guidance for what they must do, though. Banks have told me that attempting to maintain multicloud portability for applications essentially destroys the business case for cloud. Portability significantly impacts application development time, thus reducing the agility benefits. Without the ability to exploit the unique differentiated capabilities of a cloud provider, there’s little compelling reason not to just do it on-premises — which might actually be more risky than doing it in the cloud.  There are effective practical risk-reduction approaches that don’t involve “maintain constant portability of all my apps”, though. (See “How to Create a Public Cloud Integrated IaaS and PaaS Exit Plan“.)

I hope to collaborate with a Gartner colleague to write bank-targeted research in the future. If you’re a cloud architect at a bank, I’d love to speak with you in client inquiry.

Are B2B cloud service agreements safe?

I’m seeing various bits of angst around “Is it safe to use cloud services, if my business can be suspended or terminated at any time?” and I thought I’d take some time to explain how cloud providers (and other Internet ecosystem providers, collectively “service providers” [SPs] in this blog post) enforce their Terms of Service (ToS) and Acceptable Use Policies (AUPs).

The TL;DR: Service providers like money, and will strive to avoid terminating customers over policy violations. However, providers do routinely (and sometimes automatically) enforce these policies, although they vary in how much grace and assistance they offer with issues. You don’t have to be a “bad guy” to occasionally run afoul of the policies. But if your business is permanently unwilling or unable to comply with a particular provider’s policies, you cannot use that provider.

AUP enforcement actions are rarely publicized. The information in this post is drawn from personal experience running an ISP abuse department; 20 years of reviewing multiple ISP, hosting, CDN, and cloud IaaS contracts on a daily basis; many years of dialogue with Gartner clients about their experiences with these policies; and conversations with service providers about these topics. Note that I am not a lawyer, and this post is not legal advice. I would encourage you to work with your corporate counsel to understand service provider contract language and its implications for your business, whenever you contract for any form of digital service, whether cloud or noncloud.

The information in this post is intended to be factual; it is not advice and there is a minimum of opinion. Gartner clients interested in understanding how to negotiate terms of service with cloud providers are encouraged to consult our advice for negotiating with Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP), or with SaaS providers. My colleagues will cheerfully review your contracts and provide tailored advice in the context of client inquiry.

Click-thrus, negotiated contracts, ToS, and AUPs

Business-to-business (B2B) service provider agreements have taken two different forms for more than 20 years. There are “click-through agreements” (CTAs) that present you with a online contract that you click to sign; consequently, they are as-is, “take it or leave it” legal documents that tend to favor the provider in terms of business risk mitigation. Then there are negotiated contracts — “enterprise agreements” (EAs) that typically begin with more generous terms and conditions (T&Cs) that better balance the interests of the customer and the provider. EAs are typically negotiated between the provider’s lawyers and the customer’s procurement (sourcing & vendor management) team, as well as the customer’s lawyers (“counsel”).

Some service providers operate exclusively on either CTAs or EAs. But most cloud providers offer both. Not all customers may be eligible to sign an EA; that’s a business decision. Providers may set a minimum account size, minimum spend, minimum creditworthiness, etc., and these thresholds may be different in different countries. Providers are under no obligation to either publicize the circumstances under which an EA is offered, or to offer an EA to a particular customer (or prospective customer).

While in general, EAs would logically be negotiated by all customers who can qualify, providers do not necessarily proactively offer EAs. Furthermore, some companies — especially startups without cloud-knowledgeable sourcing managers — aren’t aware of the existence of EAs and therefore don’t pursue them. And many businesses that are new to cloud services don’t initially negotiate an EA, or take months to do that negotiation, operating on a CTA in the meantime. Therefore, there are certainly businesses that spend a lot of money with a provider, yet only have a CTA.

Terms of service are typically baked directly into both CTAs and EAs — they are one element of the T&Cs. As a result, a business on an EA benefits both from the greater “default” generosity of the EA’s T&Cs over the provider’s CTA (if the provider offers both), as well as whatever clauses they’ve been able to negotiate in their favor. In general, the bigger the deal, the more leverage the customer has to negotiate favorable T&Cs, which may include greater “cure time” for contract breaches, greater time to pay the bill, more notice of service changes, etc.

AUPs, on the other hand, are separate documents incorporated by reference into the T&Cs. They are usually a superset or expansion/clarification of the things mentioned directly in the ToS. For instance, the ToS may say “no illegal activity allowed”, and the AUP will give examples of prohibited activities (important since what is legal varies by country). AUPs routinely restrict valid use, even if such use is entirely legal. Service providers usually stipulate that an AUP can change with no notice (which essentially allows a provider to respond rapidly to a change in the regulatory or threat environment).

Unlike the EA T&Cs, an AUP is non-negotiable. However, an EA can be negotiated to clarify an AUP interpretation, especially if the customer is in a “grey area” that might be covered by the AUP even if the use is totally legitimate (i.e. a security vendor that performs penetration testing on other businesses at their request, may nevertheless ask for an explicit EA statement that such testing doesn’t violate the AUP).

Prospective customers of a service provider can’t safely make assumptions about the AUP intent. For example, some providers might not exclude even a fully white-hat pen-testing security vendor from the relevant portion of the AUP. Some providers with a gambling-excluding AUP may not choose to do business with an organization that has, for instance, anything to do with gambling, even if that gambling is not online (which can get into grey areas like, “Is running a state lottery a form of gambling?”). Some providers operating data centers in countries without full freedom of the press may be obliged to enforce restrictions on what content a media company can host in those regions. Anyone who could conceivably violate the AUP as part of the routine course of business would therefore want to gain clarity on interpretation up front — and get it in writing in an EA.

What does AUP enforcement look like?

If you’re not familiar with AUPs or why they exist and must be enforced, I encourage you to read my post “Terms of Service: from anti-spam to content takedown” first.

AUP enforcement is generally handled by a “fraud and abuse” department within a service provider, although in recent years, some service providers have adopted friendlier names, like “trust and safety”. When an enforcement action is taken, the customer is typically given a clear statement of what the violation is, any actions taken (or that will be taken within X amount of time if the violation isn’t fixed), and how to contact the provider regarding this issue. There is normally no ambiguity, although less technically-savvy customers can sometimes have difficulties understanding why what they did wrong — and in the case of automatic enforcement actions, the customer may be entirely puzzled by what they did to trigger this.

There is almost always a split in the way that enforcement is handled for customers on a CTA, vs customers on an EA. Because customers on a CTA undergo zero or minimal verification, there is no presumption that those customers are legit good actors. Indeed, some providers may assume, until proven otherwise, that such customers exist specifically to perpetuate fraud and/or abuse. Customers on an EA have effectively gone through more vetting — the account team has probably done the homework to figure out likely revenue opportunity, business model and drivers for the sale, etc. — and they also have better T&Cs, so they get the benefit of the doubt.

Consequently, CTA customers are often subject to more stringent policies and much harsher, immediate enforcement of those policies. Immediate suspension or termination is certainly possible, with relatively minimal communication. (To take a public GCP example: GCP would terminate without means to protest as recently as 2018, though that has changed. Its suspension guidelines and CTA restrictions offer clear statements of swift and significantly automated enforcement, including prevention of cryptocurrency mining for CTA customers who aren’t on invoicing, even though it’s perfectly legal.) The watchword for the cloud providers is “business risk management” when it comes to CTA customers.

Customers that are on a CTA but are spending a lot of money — and seem to be legit businesses with an established history on the platform — generally get a little more latitude in enforcement. (And if enforcement is automated, there may be a sliding threshold for automated actions based on spend history.) Similarly, customers on a CTA but who are actively negotiating an EA or engaged in the enterprise sales process may get more latitude.

Often-contrary to the handling of CTA customers, providers usually assume an EA customer who has breached the AUP has done so unintentionally. (For instance, one of the customer’s salespeople may have sent spam, or a customer VM may have been compromised and is now being used as part of a botnet.) Consequently, the provider tends to believe that what the customer needs is notification that something is wrong, education on why it’s problematic, and help in addressing the issue. EA customers are often completely spared from any automated form of policy enforcement. While business risk management is still a factor for the service provider, this is often couched politely as helping the customer avoid risk exposure for the customer’s own business.

Providers do, however, generally firmly hold the line on “the customer needs to deal with the problem”. For instance, I’ve encountered cloud customers who have said to me, “Well, my security breach isn’t so bad, and I don’t have time/resources to address this compromised VM, so I’d like more than 30 days grace to do so, how do I make my provider agree?” when the service provider has reasonably taken the stance that the breach potentially endangers others, and mandated that the customer promptly (immediately) address the breach. In many cases, the provider will offer technical assistance if necessary. Service providers vary in their response to this sort of recalcitrance and the extent of their enforcement actions. For instance, AWS normally takes actions against the narrowest feasible scope — i.e. against only the infrastructure involved in the policy violation. Broadly, cloud providers don’t punish customers for violations, but customers must do something about such violations.

Some providers have some form of variant of a “three strikes” policy, or escalating enforcement. For instance, if a customer has repeated issues — for example, it seems unable implement effective anti-spam compliance for itself, or it constantly fails to maintain effective security in a way that could impact other customers or the cloud provider’s services, or it can’t effectively moderate content it offers to the public, or it can’t prevent its employees from distributing illegally copied music using corporate cloud resources — then repeated warnings and limited enforcement actions can turn into suspensions or termination. Thus, even EA customers are essentially obliged treat every policy violation as something that they need to strive to ensure will not recur in the future. Resolution of a given violation is not evidence that the customer is in effective compliance with the agreement.

Bottom Line

It’s not unusual for entirely legitimate, well-intentioned businesses to breach the ToS or AUP, but this is normally rare; a business might do this once or twice over the course of many years. New cloud customers on a CTA may also innocently exhibit behaviors that trigger automated enforcement actions that use algorithms to look for usage patterns that may be indicative of fraud or abuse. Service providers will take enforcement actions based on the customer history, the contractual agreement, and other business-risk and customer-relationship factors.

Intent matters. Accidental breaches are likely to be treated with a great deal more kindness. If breaches recur, though, the provider is likely to want to see evidence that the business has an effective plan for preventing further such issues. Even if the customer is willing to absorb the technical, legal, or business risks associated with a violation, the service provider is likely to insist that the issue be addressed — to protect their own services, their own customers, and for the customer’s own good.

(Update: Gartner clients, I have published a research note: “What is the risk of actually losing your cloud provider?“)

Terms of Service: From anti-spam to content takedown

Many consumers are familiar with the terms of service (ToS) that govern their use of consumer platforms that contain user-generated content (UGC), such as Facebook, Twitter, and YouTube. But many people are less familiar with the terms of service and acceptable use policy (AUP) that governs the relationships between businesses and their service providers.

In light of the recent decision undertaken by Amazon Web Services (AWS) to suspend service to Parler, a Twitter-like social network that was used to plan the January 6th insurrection at the US Capitol, there have been numerous falsehoods circulating on social media that seem related to a lack of understanding of service provider behavior, ToS and AUPs: claims of “free speech” violations, a coordinated conspiracy between big tech companies, and the like. This post is intended to examine, without judgment, how service providers — cloud, hosting and colo, CDN and other Internet infrastructure providers, and the ISPs that connect everyone to the Internet — came to a place of industry “standards” for behavior, and how enforcement is handled in B2B situations.

The TL;DR summary: The global service provider community, as a result of anti-spam efforts dating back to the mid-90s, enforces extremely similar policies governing content, including user-generated content, through a combination of B2B peer pressure and contractual obligations. Business customers who contravene these norms have very few options.

These norms will  greatly limit  Parler’s options for a new home. Many sites with far-right and similarly controversial content have ultimately ended up using a provider in a supply chain that relies on Russian connectivity, thus dodging the Internet norms that prevail in the rest of the world.

Internet Architecture and Service Provider Dependencies

While the Internet is a collection of loosely federated networks that are in theory independent from one another, it is also an interdependent web of interconnections between those networks. There are two ways that ISPs connect with one another — through “settlement-free peering” (essentially an exchange of traffic between two ISPs that view themselves as equals) and through the purchase of “transit” (making a “downstream ISP” the customer of an “upstream ISP”).

This results in a three-tier model for ISPs.  The Tier 1 ISPs are big global carriers of network connectivity — companies like AT&T, BT and NTT — who have settlement-free peers with each other,  and sell transit to smaller ISPs. Tier 2 ISPs are usually regional, and have settlement-free peers with others in and around their region, but also are reliant on transit from the Tier 1s. Tier 3 ISPs are entirely dependent on purchasing transit. ISPs at all three tiers also sell connectivity directly to businesses and/or consumers.

In practice, this means that ISPs are generally contractually bound to other ISPs. All transit contracts are governed by terms of service that normally incorporate, by reference, an AUP.  Even settlement-free peering agreements are legal contracts, which normally includes the mutual agreement to maintain and enforce some form of AUP. (In the earlier days of the Internet, peering was done on a handshake, but anything of that sort is basically a legacy that can come to an abrupt end should one party suddenly decide to behave badly.)

AUP documents are interesting because they are deliberately created as living documents, allowing AUPs to be adapted to changing circumstances — unlike standard contract terms, which apply for the length of what is usually a multiyear contract. AUPs are also normally ironclad; it’s usually difficult to impossible for a business to get any form of AUP exemption written into their contract. Most contracts provide minimal or no notice for AUP changes. Businesses tend to simply agree to them because most businesses do not plan to engage in the kind of behavior that violates an AUP — and because they don’t have much choice.

The existence of ISP tiering means that upstream providers have significant influence over the behavior of their downstream. Upstream ISPs normally mandate that their downstream ISPs — and other service providers that use their connectivity, like hosting providers — enforce an AUP that enables the downstream provider to be compliant with the upstream’s terms of service. Downstream providers that fail to do so can have their connectivity temporary suspended or their contract terminated. And between the Tier 1 providers, peer pressure ensures a common global understanding and enforcement of  acceptable behavior on the Internet.

Note that this has all occurred in the absence of regulation. ISPs have come to these arrangements through decisions about what’s good for their individual businesses first and foremost, with the general agreement that these community standards for AUPs are good for the community of network operators as a whole.

We’re Here Because Nobody Likes Spammers

So how did we arrive at this state in the first place?

In the mid-90s, as the Internet was growing rapidly, in the near-total absence of regulation, spam was a growing problem. Spam came from both legitimate businesses who simply weren’t aware of or didn’t especially care about Internet etiquette, as well as commercial spammers (bad actors with deceptive or fraudulent ads, and/or illegal/grey-market products).

Many B2B ISPs did not feel that it was necessarily their responsibility to intervene, despite general distaste for spammers — and, sometimes, a flood of consumer complaints. Some percentage of spammers were otherwise “good customers” — i.e. they paid their bills on time and bought a lot of bandwidth. Many more, however, obtained services under fraudulent pretenses, didn’t pay their bills, or tended not to pay on time.

Gradually, the community of network operators came to a common understanding that spammers were generally bad for business, whether they were your own customers, or whether they were the customers of, say, a web hosting company that you provided Internet connectivity for.

This resulted in upstream ISPs exerting pressure on downstream ISPs. Downstream ISPs, in turn, exerted pressure on their customers — kicking spammers off their networks and pushing hosters to kick spammers out of hosting environments. ISPs formalized AUPs. AUP enforcement took longer. Many ISPs were initially pretty shoddy and inconsistent in their enforcement — either because they needed the revenue they were getting from spammers, or due to unwillingness or inability to fund a staff to deal with abuse, or corporate lawyers who urged caution. It took years, but ISPs eventually arrived at AUPs that were contractually enforceable, processes for handling complaints, and relatively consistent enforcement. Legislation like the CAN-SPAM act in the US didn’t hurt, but by the time CAN-SPAM was passed (in 2003), ISPs had already arrived at a fairly successful commercial resolution to the problem.

Because anti-spam efforts were largely fueled by agreements enshrined in B2B contracts, and not in government regulation, there was never full consistency across the industry. Different ISPs created different AUPs — some stricter and some looser. Different ISPs wrote different terms of service into their contracts, with different “cure” periods (a period of time that a party in the contract is given to come into compliance with a contractual requirement). Different ISPs had different attitudes towards balancing “customer service” versus their responsibilities to their upstream providers and to the broader community of network operators.

Consequently, there’s nothing that says “We need to receive X number of spam complaints before we take action,” for instance. Some providers may have internal process standards for this. A lot of enforcement simply takes place via automated algorithms; i.e. if a certain threshold of users reports something as spam, enforcement actions take place. Providers effectively establish, through peer norms, what constitutes “effective” enforcement in accordance with terms of service obligations. Providers don’t need to threaten each other with network disconnection, because a norm has been established. But the implicit threat — and the contractual teeth behind that threat — always remains.

Nobody really likes terminating customers. So there are often fairly long cure periods, recognizing that it can take a while for a customer to properly comply with an AUP. In the suspension letter that AWS sent Parler, AWS cites communications “over the past several weeks”. Usually the providers look for their customers to demonstrate good-faith efforts, but may take suspension or termination efforts if it does not look like a good-faith effort to comply is being made, or if it appears that the effort, no matter how seemingly earnest, does not seem likely to bring compliance within a reasonable time period. 30 days is a common timeframe specified as a cure period in contracts (and is the cure period in the AWS standard Enterprise Agreement), but cloud provider click-through agreements (such as the AWS Customer Agreement) do not normally have a cure period, allowing immediate action to be taken at the provider’s discretion.

What Does This Have to Do With Policing Users on Social Media?

When providers established anti-spam AUPs, they also added a whole laundry list of offenses beyond spamming. Think of that list, “Everything a good corporate lawyer thought an ISP might ever want to terminate a customer for doing.” Illegal behavior, harassment, behavior that disrupts provider operations, behavior that threatens the safety/security/operations of other businesses, etc. are all prohibited.

Hosting companies — eventually followed by cloud providers like AWS, Microsoft Azure, and Google Cloud Platform, as well as companies that hold key roles in the Internet ecosystem (domain registrars and the companies that operate the DNS; content delivery networks like Akamai and Cloudflare, etc.) — were essentially obliged to incorporate their upstream ISP usage policies into their own terms of service and AUPs, and to enforce those policies on their users if they wanted to stay connected to the Internet. Some such providers have also explicitly chosen not to sell to customers in certain business segments — for instance, no gambling, or no pornography, even if the business is fully legitimate and legal (for instance, like MGM Resorts or Playboy) — through limiting what’s allowed in their terms of service. An AUP may restrict activities that are perfectly legal in a given jurisdiction.

Even extremely large tech companies that have their own data centers, like Facebook and Apple, are ultimately beholden to ISPs. (Google is something of an odd case because in addition to owning their own data centers, they are one of the largest global network operators. Google owns extensive fiber routes and peers with Tier 1 ISPs as an equal.) And even though AWS has, to some degree, a network of its own, it is effectively a Tier 2 ISP, making it beholden to the AUPs of its upstream. Other cloud providers are typically mostly or fully transit-dependent, and are thus entirely beholden to their upstream.

In short: Everyone who hosts content companies, and the content companies themselves, is essentially trapped, by the chain of AUP obligations, to policing content to ensure that it is not illegal, harassing, or otherwise seen as commercially problematic.

You have to go outside the normal Internet supply chain — for instance, to the Russian service providers — before you escape the commercial arrangements that bound notions of good business behavior on the Internet. It doesn’t matter what a provider’s philosophical alignment is. Commercially, they simply can’t really push back on the established order. And because these agreements are global, regulation at a single-country level can’t really force these agreements to be significantly more or less restrictive, because of the globalized nature of peering/transit; providers generally interconnect in multiple countries.

It also means that these aren’t just “Silicon Valley” standards. These are global norms for behavior, which means they are not influenced solely by the relatively laissez-faire content standards of the United States, but by the more stringent European and APAC environments.

It’s an interesting result of what happens when businesses police themselves. Even without formal industry-association “rules” or regulatory obligations, a fairly ironclad order can emerge that exerts extremely effective downstream pressure (as we saw in the cases of 8Chan and the Daily Stormer back in 2019).

Does being multicloud help with terms of service violations?

Some people will undoubtedly ask, “Would it have helped Parler to have been multicloud?” Parler has already said that they are merely bare-metal customers of AWS, reducing technical dependencies / improving portability. But their situation is such that they would almost certainly have had the exact same issue if they had been running on Microsoft Azure, Google Cloud Platform, or even Oracle Cloud Infrastructure as well (even though the three companies have top executives with political views spanning the spectrum). A multicloud strategy won’t help any business that violates AUP norms.

AWS and its cloud/hosting competitors are usually pretty generous when working with business customers that unintentionally violate their AUPs. But a business that chooses not to comply is unlikely to find itself welcome anywhere, which makes multicloud deployment largely useless as a defensive strategy.

The multi-headed hydra of cloud resilience

Clients have recently been asking a lot more questions about the comparative resilience of cloud providers.

Identity services are a particular point of concern (for instance, the Azure AD outage of October 1st and Google Cloud IAM outage of March 26th) since when identity is down, the customer can’t access the cloud provider’s control plane (and it may impact service use in general) — plus there’s generally no way for the customer to work around such issues.

The good news is, hyperscale cloud providers do a pretty good job of being robust. However, the risk of smaller, more hosting-like providers can be much higher — and there are notable differences between the hyperscalers, too.

Operations folks know: Everything breaks. Physical stuff fails, software is buggy, and people screw up (a lot). A provider can try its best to reduce the number of failures, limit the “blast radius” of a problem, limit the possibility of “cascading failures”, and find ways to mitigate the impact on users. But you can’t avoid failure entirely. Systems that are resilient recover quickly from failure.

If you chop off the head of a hydra, it grows back — quickly. We can think about five key factors — heads of the hydra — that influence the robustness, resilience, and observed (“real world”) availability of cloud services:

  • Physical design: The design of physical things, such as the data center and the hardware used to deliver services.
  • Logical (software) design: The design of non-physical things, especially software — all aspects of the service architecture that is not related to a physical element.
  • Implementation quality: The robustness of the actual implementation, encompassing implementation skill, care and meticulousness, and the effectiveness of quality-assurance (QA) efforts.
  • Deployment processes: The rollout of service changes is the single largest cause of operational failures in cloud services. The quality of these processes, the automation used in the processes, and the degree to which humans are given latitude to use good judgment (or poor judgment) thus have a material impact on availability.
  • Operational processes: Other operational processes, such as monitoring, incident management — and, most importantly, problem management — impact the cloud provider’s ability to react quickly to problems, mitigate issues, and ensure that the root causes of incidents are addressed. Both proactive and reactive maintenance efforts can have an impact on availability.

A sixth factor, Transparency, isn’t directly related to keeping the hydra alive, but matters to customers as they plan for their own application architectures and risk management — contributing to customer resilience. Transparency includes making architectural information to customers, as well as delivering outage-related visibility and insight to customers. Customers need real-world info — like current and historical outage reports and the root-cause-analysis port-mortems that offer insight into what went wrong and why (and what the provider is doing about it).

When you think about cloud service resilience (or the resilience of your own systems), think about it in terms of those factors. Don’t think about it like you think about on-premises systems, where people often think primarily about hardware failures or a fire in the data center. Rather, you’re dealing with systems where software issues are almost always the root cause. Physical robustness still matters, but the other four factors are largely about software.

Beware of vendors bearing transformation Turkish Delight

“It is a lovely place, my house,” said the Queen. “I am sure you would like it. There are whole rooms full of Turkish Delight, and what’s more, I have no children of my own. I want a nice boy whom I could bring up as a Prince and who would be King of Narnia when I am gone. While he was Prince he would wear a gold crown and eat Turkish Delight all day long; and you are much the cleverest and handsomest young man I’ve ever met. I think I would like to make you the Prince—some day, when you bring the others to visit me.” — The White Witch (C.S. Lewis; The Lion, The Witch, and the Wardrobe)

When most people read the Narnia novels as children, they have no idea what Turkish Delight is. Its obscurity in recent decades has allowed everyone to imagine it as an entirely wonderful substance, carrying all their hopes and dreams of the perfect candy.

So, too, do people pour all of their business hopes and dreams into a nebulously-defined future of “digital transformation”.

Because the cloud is such a key enabling technology for digital business, I have plenty of discussions with clients who have been promised grand “digital transformation” outcomes by cloud providers and cloud MSPs. But it certainly not a phenomenon limited to the cloud. Hardware vendors and ISVs, outsourcers, consultancies, etc. are all selling this dream. While I can think of vendors who are more guilty of this than others, it’s a cross-IT-industry phenomenon.

Beware all digital transformation promises. Especially the ones where the vendor promises to partner with you to change the future of your industry or reinvent/revolutionize/disrupt X (where X is what your industry does).

I’ve quietly watched a string of broken transformation promises over the last few years, gently privately warning clients in inquiry conversations that you generally can’t trust these sorts of vendor promises. These behaviors have become much more prominent recently, though. And a colleague recently told me about a conversation that seemed like just a bridge too far: a large tech vendor promising to partner with a small Midwestern industrial manufacturer (tech laggards not doing anything exciting) to create transformative products, as part of a sales negotiation. (This vendor had not previously exhibited such behavior, so it was doubly startling.)

Clients come to us with tales of vendors who, in the course of sales discussions, promises to partner with them — possibly even dangling the possibility of a joint venture — to launch a transformational digital business, revolutionize the future of their industry, or the like. (Note that this is distinct from companies that sell transformation consulting. They promise to help you figure out the future, not form a business partnership to create that future — i.e. McKinsey, Deloitte, etc.)

Usually, neither the customer nor the vendor have a concrete idea of what that looks like. Usually, the vendor refuses to put this partnership notion in writing as a formal contract. On the rare occasion that there is a contract, it is pretty vague, does not oblige the vendor to put forth any business ideas, and allows the vendor to refuse any business idea and investment. In other words, it has zero teeth. Because it’s  so open-ended, the customer can fill the void with all their Turkish Delight dreams.

Moreover, the vendor may sometimes dangle samples of transformation-oriented services and consulting during the sales process. The customer gobbles down these sweet nuggets, and then stares mournfully at the empty box of transformation candy. For the promise of more, they’ll cheerfully betray their enterprise procurement standards, while the sourcing managers stand on the sidelines frantically waving contract-related warnings.

Listen to your sourcing managers when they warn you that the proposed “partnership” is a fiction. The White Witch probably doesn’t have your best interests at heart. Good digital transformation promises — ones that are likely to actually be kept — have concrete outcomes. They specify what the partnership will be developing, together with timelines, budgets, and the legal entity (such as a JV) that will be used to deliver the products/services. Or they specify the specific consulting services that will be provided — workshops, deliverables from those workshops, work-for-hire agreements with specific costs (and discounts, if applicable), and so forth.

Without concrete contractual outcomes, the vendor can vanish the candy into thin air with no repercussions. Sure, in a concrete transformation proposal, the end result will probably not be your Turkish Delight dreams. It might resemble a bowl of ordinary M&Ms. Or maybe a tasty grab-bag of Lindt truffles. (You’d have to get particularly lucky for it get much beyond the realm of grocery-store candy, though.) But you’re much more likely to actually get a good outcome.

Off-hand, I can think of one public example where a prominent “change the industry” vendor partnership with an enterprise, seems to have resulted in a credible product: Microsoft’s Connected Vehicle Platform. There, Microsoft signed a deal with a collection of automakers, each of whom had specific outcomes they wished to achieve — outcomes which could be realistically achieved in a reasonable amount of time, and representing industry advancement but not anything truly revolutionary. Microsoft built upon those individual projects to deliver a platform which would move the industry forward, which was announced with a clear mission and a timeframe for launch. Sure, it didn’t “change the future of cars”, but it brought tangible benefits to the customers.

Vendors often try to sell to who you hope to be, rather than who you are now. Your aspirations aren’t bad. Just make sure that your aspirations are well-defined and there’s a realistic roadmap to achieve them. Hope is not a strategy. The vendor may have little incentive not to promise everything  you could dream of, in order to get you to sign a large purchase agreement.

%d bloggers like this: