Don’t be surprised when “move fast and break things” results in broken stuff
Of late, I’ve been talking to a lot of organizations that have learned cloud lessons the hard way — and even more organizations who are newer cloud adopters who seem absolutely determined to make the same mistakes. (Note: Those waving little cloud-repatriation flags shouldn’t be hopeful. Organizations are fixing their errors and moving on successfully with their cloud adoption.)
If your leadership adopts the adage, “Move fast and break things!” then no one should be surprised when things break. If you don’t adequately manage your risks, sometimes things will break in spectacularly public ways, and result in your CIO and/or CISO getting fired.
Many organizations that adopt that philosophy (often with the corresponding imposition of “You build it, you run it!” upon application teams) not only abdicate responsibility to the application teams, but they lose all visibility into what’s going on at the application team level. So they’re not even aware of the risks that are out there, much less whether those risks are being adequately managed. The first time central risk teams become aware of the cracks in the foundation might be when the building collapses in an impressive plume of dust.
(Note that boldness and the willingness to experiment are different from recklessness. Trying out new business ideas that end up failing, attempting different innovative paths for implementing solutions that end up not working out, or rapidly trying a bunch of different things to see which works well — these are calculated risks. They’re absolutely things you should do if you can. That’s different from just doing everything at maximum speed and not worrying about the consequences.)
Just like cloud cost optimization might not be a business priority, broader risk management (especially security risk management) might not be a business priority. If adding new features is more important than address security vulnerabilities, no one should be shocked when vulnerabilities are left in a state of “busy – fix later”. (This is quite possibly worse than “drunk – fix later“, as that at least implies that the fix will be coming as soon as the writer sobers up, whereas busy-ness is essentially a state that tends to persist until death).
It’s faster to build applications that don’t have much if any resilience. It’s faster to build applications if you don’t have to worry about application security (or any other form of security). It’s faster to build applications if you don’t have to worry about performance or cost. It’s faster to build applications if you only need to think about the here-and-now and not any kind of future. It is, in short, faster if you are willing to accumulate meaningful technical debt that will be someone else’s problem to deal with later. (It’s especially convenient if you plan to take your money and run by switching jobs, ensuring you’re free of the consequences.)
“We hope the business and/or dev teams will behave responsibly” is a nice thought, but hope is not a strategy. This is especially true when you do little to nothing to ensure that those teams have the skills to behave responsibly, are usefully incentivized to behave responsibly, and receive enough governance to verify that they are behaving responsibly.
When it all goes pear-shaped, the C-level IT executives (especially the CIO, chief information security officer, and the chief risk officer) are going to be the ones to be held accountable and forced to resign under humiliating circumstances. Even if it’s just because “You should have known better than to let these risks go ungoverned”.
(This usually holds true even if business leaders insisted that they needed to move too quickly to allow risk to be appropriately managed, and those leaders were allowed to override the CIO/CISO/CRO, business leaders pretty much always escape accountability here, because they aren’t expected to have known better. Even when risk folks have made business leaders sign letters that say, “I have been made aware of the risks, and I agree to be personally responsible for them” it’s generally the risk leaders who get held accountable. The business leaders usually get off scott-free even with the written evidence.)
Risk management doesn’t entail never letting things break. Rather, it entails a consideration of risk impacts and probabilities, and thinking intelligently about how to deal with the risks (including implementing compensating controls when you’re doing something that you know is quite risky). But one little crack can, in combination with other little cracks (that you might or might or might not be aware of), result in big breaches. Things rarely break because of black swan events. Rather, they break because you ignored basic hygiene, like “patch known vulnerabilities”. (This can even impact big cloud providers, i.e. the recent Azurescape vulnerability, where Microsoft continued to use 2017-era known-vulnerable open-source code in production.)
However, even in organizations with central governance of risk, it’s all too common to have vulnerability management teams inform you-build-it-you-run-it dev teams that they need to fix Known Issue X. A busy developer will look at their warning, which gives them, say, 30 days to fix the vulnerability, which is within the time bounds of good practice. Then on day 30, the developer will request an extension, and it will probably be granted, giving them, say, another 30 days. When that runs out, the developer will request another extension, and they will repeat this until they run out the extension clock, whereupon usually 90 days or more have elapsed. At that point there will probably be a further delay for the security team to get involved in an enforcement action and actually fix the thing.
There are no magic solutions for this, especially in organizations where teams are so overwhelmed and overworked that anything that might possibly be construed as optional or lower-priority gets dropped on the floor, where it is trampled, forgotten, and covered in old chewing gum. (There are non-magical solutions that require work — more on that in future research notes.)
Moving fast and breaking things takes a toll. And note that sometimes what breaks are people, as the sheer number of things they need to cope with overload their coping mechanisms and they burn out (either in impressive pillars or flame, or quiet extinguishment into ashes).
Hunting the Dread Gazebo of Repatriation
(Confused by the title of this post? Read this brief anecdote.)
The myth of cloud repatriation refuses to die, and a good chunk of the problem is that users (and poll respondents) use “repatriation” is a wild array of ways, but non-cloud vendors want you to believe that “repatriation” means enterprises packing up all their stuff in the cloud and moving it back into their internal data centers — which occurs so infrequently that it’s like a sasquatch sighting.
A non-comprehensive list of the ways that clients use the term “repatriation” that have little to nothing to do with what non-cloud vendors (or “hybrid”) would like you to believe:
Outsourcing takeback. The origin of the term comes from orgs that are coming back from traditional IT outsourcing. However, we also hear cloud architects say they are “repatriating” when they gradually take back management of cloud workloads from a cloud MSP; the workloads stay in the cloud, though.
Migration pause. Some migrations to IaaS/IaaS+PaaS do not go well. This is often the result of choosing a low-quality MSP for migration assistance, or rethinking the wisdom of a lift-and-shift. Orgs will pause, switch MSPs and/or switch migration approaches (usually to lift-and-optimize), and then resume. Some workloads might be temporarily returned on-premise while this occurs.
SaaS portfolio rationalization. Sprawling adoption of SaaS, at the individual, team, department or business-unit level, can result in one or more SaaS applications being replaced with other, official, corporate SaaS (for instance, replacing individual use of Dropbox with an org-wide Google Drive implementation as part of G-Suite). Sometimes, the org might choose to build on-premises functionality instead (for instance, replacing ad-hoc SaaS analytics with an on-prem data warehouse and enterprise BI solution). This is overwhelmingly the most common form of “cloud repatriation”.
Development in the cloud, production on premises. While the dev/prod split of environments is much less common than it used to be, some organizations still develop in cloud IaaS and then run the app in an on-prem data center in production. Orgs like this will sometimes say they “repatriate” the apps for production.
The Oops. Sometimes organizations attempt to put an application in the cloud and it Just Doesn’t Go Well. Sometimes the workload isn’t a good match for cloud services in general. Sometimes the workload is just a bad match for the particular provider chosen. Sometimes they make a bad integrator choice, or their internal cloud skills are inadequate to the task. Whatever it is, people might hit the “abort” button and either rethink and retry in the cloud, or give up and put it on premises (either until they can put together a better plan, or for the long term).
Of course, there are the sasquatch sightings, too, like the Dropbox migration from AWS (also see the five-year followup), but those stories rarely represent enterprise-comparable use cases. If you’re one of the largest purchasers of storage on the planet, and you want custom hardware, absolutely, DIY makes sense. (And Dropbox continues to do some things on AWS.)
Customers also engage in broader strategic application portfolio rationalizations that sometimes result in groups of applications being shifted around, based on changing needs. While the broader movement is towards the cloud, applications do sometimes come back on-premises, often to align to data gravity considerations for application and data integration.
None of these things are in any way equivalent to the notion that there’s a broad or even common movement of workloads from the cloud back on-premises, though, especially for those customers who have migrated entire data centers or the vast majority of their IT estate to the cloud.
(Updated with research: In my note for Gartner clients, “Moving Beyond the Myth of Repatriation: How to Handle Cloud Projects Failures”, I provide detailed guidance on why cloud projects fail, how to reduce the risks of such projects, and how — or if — to rescue troubled cloud projects.)
Why transparency matters in the cloud
A number of people have asked if the advice that Gartner is giving to clients about the cloud, or about Amazon, has changed as a result of Amazon’s outage. The answer is no, it hasn’t.
In a nutshell:
1. Every cloud IaaS provider should be evaluated individually. They’re all different, even if they seem to be superficially based off the same technology. The best provider for you will be dependent upon your use case and requirements. You absolutely can run mission-critical applications in the cloud — you just need to choose the right provider, right solution, and architect your application accordingly.
2. Just like infrastructure in your own data center, cloud IaaS requires management, governance, and a business continuity / disaster recovery plan. Know your risks, and figure out what you’re going to do to mitigate them.
3. If you’re using a SaaS vendor, you need to vet their underlying infrastructure (regardless of whether it’s their own data center, colo, hosting, or cloud).
The irony of the cloud is that you’re theoretically just buying something as a service without worrying about the underlying implementation details — but most savvy cloud computing buyers actually peer at the underlying implementation in grotesquely more detail than, say, most managed hosting customers ever look at the details of how their environment implemented by the provider. The reason for this is that buyers lack adequate trust that the providers will actually offer the availability, performance, and security that they claim they will.
Without transparency, buyers cannot adequately assess their risks. Amazon provides some metrics about what certain services are engineered to (S3 durability, for instance), but there are no details for most of them, and where there are metrics, they are usually for narrow aspects of the service. Moreover, very few of their services actually carry SLAs, and those SLAs are narrow and specific (as everyone discovered recently in this last outage, since it was EBS and RDS that were down and neither have SLAs, with EC2 technically unaffected, so nobody’s going to be able to claim SLA credits).
Without objectively understanding their risks, buyers cannot determine what the most cost-effective path is. Your typical risk calculation multiplies the probability of downtime by the cost of downtime. If the cost to mitigate the risk is lower than this figure, then you’re probably well-advised to go do that thing; if not, then, at least in terms of cold hard numbers, it’s not worth doing (or you’re better off thinking about a different approach that alters the probability of downtime, the cost of downtime, or the mitigation strategy).
Note that this kind of risk calculation can go out the window if the real risk is not well understood. Complex systems — and all global-class computing infrastructures are enormously complex under the covers — have nondeterministic failure modes. This is a fancy way of saying, basically, that these systems can fail in ways that are entirely unpredictable. They are engineered to be resilient to ordinary failure, and that’s the engineering risk that a provider can theoretically tell you about. It’s the weird one-offs that nobody can predict, and are the things that are likely to result in lengthy outages of unknown, unknowable length.
It’s clear from reading Amazon customer reactions, as well as talking to clients (Amazon customers and otherwise) over the last few days, that customers came to Amazon with very different sets of expectations. Some were deep in rose-colored-glasses land, believing that Amazon was sufficiently resilient that they didn’t have to really invest in resiliency themselves (and for some of them, a risk calculation may have made it perfectly sane for them to run just as they were). Others didn’t trust the resiliency, and used Amazon for non-mission-critical workloads, or, if they viewed continuous availability as critical, ran multi-region infrastructures. But what all of these customers have in common is the simple fact that they don’t really know how much resiliency they should be investing in, because Amazon doesn’t reveal enough details about its infrastructure for them to be able to accurately judge their risk.
Transparency does not necessarily mean having to reveal every detail of underlying implementation (although plenty of buyers might like that). It may merely mean releasing enough details that people can make calculations. I don’t have to know the details of the parts in a disk drive to be able to accept a mean time between failure (MTBF) or annualized failure rate (AFR) from the manufacturer, for instance. Transparency does not necessarily require the revelation of trade secrets, although without trust, transparency probably includes the involvement of external auditors.
Gartner clients may find the following research notes helpful:
- Evaluating Cloud Infrastructure as a Service
- Nine Contractual Terms to Reduce Risk in Cloud Contracts
- Amazon EC2: Is It Ready for the Enterprise?
Amazon outage and the auto-immune vulnerabilities of resiliency
Today is Judgment Day, when Skynet becomes self-aware. It is, apparently, also a very, very bad day for Amazon Web Services.
Lots of people have raised questions today about what Amazon’s difficulties today mean for the future of cloud IaaS. My belief is that this doesn’t do anything to the adoption curve — but I do believe that customers who rely upon Amazon to run their businesses will, and should, think hard about the resiliency of their architectures.
It’s important to understand what did and did not happen today. There’s been a popular impression that “EC2 is down”. It’s not. To understand what happened, though, some explanation of Amazon’s infrastructure is necessary.
Amazon divides its infrastructure into “regions”. You can think of a region as basically analogous to “a data center”. For instance, US-East-1 is Amazon’s Northern Virginia data center, while US-West-1 is Amazon’s Silicon Valley data center. Each region, in turn, is divided into multiple “availability zones” (AZs). You can think of an AZ is basically analogous to “a cluster” — it’s a grouping of physical and logical resources. Each AZ is designated by letters — for instance, US-East-1a, US-East-1b, etc. However, each of these designations are customer-specific (which is why Amazon’s status information cannot easily specify which AZ is affected by a problem).
Amazon’s virtual machine offering is the Elastic Compute Cloud (EC2). When you provision an EC2 “instance” (Amazon’s term for a VM), you also get an allocation of “instance storage”. Instance storage is transient — it exists only as long as the VM exists. Consequently, it’s not useful for storing anything that you actually want to keep. To get persistent storage, you use Amazon’s Elastic Block Store (EBS), which is basically just network-attached storage. Many people run databases on EC2 that are backed by EBS, for instance. Because that’s such a common use case, Amazon offers the Relational Database Service (RDS), which is basically an EC2 instance running MySQL.
Amazon’s issues today are with EBS, and with RDS, both in the US-East-1 region. (My guess is that the issues are related, but Amazon has not specifically stated that they are.) Customers who aren’t in the US-East-1 region aren’t affected (customers always choose which region and specific AZs they run in). Customers who don’t use EBS or RDS are also unaffected. However, use of EBS is highly commonplace, and likely just about everyone using EC2 for a production application or Web site is reliant upon EBS. Consequently, even though EC2 itself has been running just fine, the issues have nevertheless had a major impact on customers. If you’re storing your data on EBS, the issues with EBS have made your data inaccessible, or they’ve made access to that data slow and unreliable. Ditto with RDS. Obviously, if you can’t get to your data, you’re not going to be doing much of anything.
In order to get Amazon’s SLA for EC2, you, as a customer, have to run your application in multiple AZs within the same region. Running in multiple AZs is supposed to isolate you from the failure of any single AZ. In practice, of course, this only provides you so much protection — since the AZs are typically all in the same physical data center, anything that affects that whole data center would probably affect all the AZs. Similarly, the AZs are not totally isolated from one another, either physically or logically.
However, when you create an EBS volume, you place it in a specific availability zone, and you can only attach that EBS volume to EC2 instances within that same availability zone. That complicates resiliency, since if you wanted to fail over into another AZ, you’d still need access to your data. That means if you’re going to run in multiple AZs, you have to replicate your data across multiple AZs.
One of the ways you can achieve this is with the Multi-AZ option of RDS. If you’re running a MySQL database and can do so within the constraints of RDS, the multi-AZ option lets you gain the necessary resiliency for your database without having to replicate EBS volumes between AZs.
As one final caveat, data transfer within a region is free and fast — it’s basically over a local LAN, after all. By contrast, Amazon charges you for transfers between regions, which goes over the Internet and has the attendant cost and latency.
Consequently, there are lots of Amazon customers who are running in just a single region. A lot of those customers may be running in just a single AZ (because they didn’t architect their app to easily run in multiple AZs). And of the ones who are running in multiple AZs, a fair number are reliant upon the multi-AZ functionality of RDS.
That’s why today’s impacts were particularly severe. US-East-1 is Amazon’s most popular region. The problems with EBS impacted the entire region, as did the RDS problems (and multi-AZ RDS was particularly impacted), not just a single AZ, so if you were multiple-AZ but not multi-region, the resiliency you were theoretically getting was of no help to you. Today, people learned that it’s not necessarily adequate to run in multiple AZs. (Justin Santa Barbara has a good post about this.)
My perspective on this is pretty much exactly what I would tell a traditional Web hosting customer who’s running only in one data center: If you want more resiliency, you need to run in more than one data center. And on Amazon, if you want more resiliency, you need to not only be multi-AZ but also multi-region.
Amazon’s SLA for EC2 is 99.95% for multi-AZ deployments. That means that you should expect that you can have about 4.5 hours of total region downtime each year without Amazon violating their SLA. Note, by the way, that this outage does not actually violate their SLA. Their SLA defines unavailability as a lack of external connectivity to EC2 instances, coupled with the inability to provision working instances. In this case, EC2 was just fine by that definition. It was EBS and RDS which weren’t, and neither of those services have SLAs.
So how did Amazon end up with a problem that affected all the AZs within the US-East-1 region? Well, according to their status dashboard, they had some sort of network problem last night in their east coast data center. That problem resulted in their automated resiliency mechanisms attempting to re-mirror a large number of EBS volumes. This impacted one of the AZs, but it also overloaded the control infrastructure for EBS in that region. My guess is that RDS also uses this same storage infrastructure, so the capacity shortages and whatnot created by all of this activity ended up also impacting RDS.
My colleague Jay Heiser, who follows, among other things, risk management, calls this “auto-immune disease” — i.e., resiliency mechanisms can sometimes end up causing you harm. (We’ve seen auto-immune problems happen before in a prior Amazon S3 outage, as well as a Google Gmail outage.) The way to limit auto-immune damage is isolation — ensuring limits to the propagation.
Will some Amazon customers pack up and leave? Will some of them swear off the cloud? Probably. But realistically, we’re talking about data centers, and infrastructure, here. They can and do fail. You have to architect your app to have continuous availability across multiple data centers, if it can never ever go down. Whether you’re running your own data center, running in managed hosting, or running in the cloud, you’re going to face this issue. (Your problems might be different — i.e., your own little data center isn’t going to have the kind of complex problem that Amazon experienced today — but you’re still going to have downtime-causing issues.)
There are a lot of moving parts in cloud IaaS. Any one of them going wrong can bork your entire site/application. Your real problem is appropriate risk mitigation — the risk of downtime and its attendant losses, versus the complications and technical challenges and costs created by infrastructure redundancy.