Monthly Archives: June 2010
Abstracting IaaS… or not
Customer inquiry around cloud IaaS these days is mostly of a practical sort — less “what is my broad long-term strategy going to be” or “help me with my initial pilot” like it was last year, and more “I’m going to do something big and serious, help me figure out how to source it”.
My inquiry volume is nothing short of staggering (shoved into 30-minute back-to-back calls the entirety of my work day, so if you talk to me and I sound a bit anxious to keep to schedule, that’s why). I’m currently clinging to the desperate hope that if I spend more time writing, people will consult the written work first, which will free me of having to go over basics in calls and hopefully result in better answers for clients as well as greater sanity for myself.
Thus, I have been trying to cram in a lot of writing in my evenings. At the moment, I’m working on a series of notes covering IaaS soup to nuts, going over everything from the different ways that compute resources are implemented and offered, to the minutiae of what capabilities are available in customer portals.
It strikes me that in more than ten years of covering the hosting industry as an analyst, this is the first time that I’ve written deep-dive architectural notes. No one has really cared in the past whether or not, say, a vendor uses NPIV in their storage fabric, or whether the chips in the servers support EPT/RVI. That’s all changing with cloud IaaS, once people get down into the weeds and look into adopting it for production systems.
It’s vastly ironic that now, in this age of the fluffy wonderful abstraction of infrastructure as a service, IT buyers are obsessing over the minutiae of exactly how this stuff is implemented.
It matters, of course. A core is not just a core; the performance of that core for your apps is going to determine bang-for-your-buck if you’re compute bound. The fine niggling details of implementation of fill-in-the-blank-here will result in different sorts of security vulnerabilities and ways that those vulnerabilities are addressed. And so on. The IT buyers who are delving into this stuff aren’t being paranoid or crazy, really; they’re wanting to evaluate how it’s done versus how they’d do it themselves, when you get right down to what’s going through their heads.
It’s a key difference between IaaS and PaaS thinking in the heads of customers. In PaaS, you trust that it will work as long as you write to the APIs, and you surrender control over the underlying implementation. In IaaS, you’re getting something so close to bare metal that you start really wondering about what’s there, because you’re comparing it directly to your own data center.
I think that over time this will be something that simply gets addressed with SLAs that guarantee particular levels of availability, performance, and so forth, along with some transparency and strong written guarantees around security. But the industry hasn’t hit that level of maturity yet, which means that for the moment, customers will and probably should do deep dives scrutinizing exactly what it is that they’re being offered when they contemplate IaaS solutions.
The cloud is not magic
Just because it’s in the cloud doesn’t make it magic. And it can be very, very dangerous to assume that it is.
I recently talked to an enterprise client who has a group of developers who decided to go out, develop, and run their application on Amazon EC2. Great. It’s working well, it’s inexpensive, and they’re happy. So Central IT is figuring out what to do next.
I asked curiously, “Who is managing the servers?”
The client said, well, Amazon, of course!
Except Amazon doesn’t manage guest operating systems and applications.
It turns out that these developers believed in the magical cloud — an environment where everything was somehow mysteriously being taken care of by Amazon, so they had no need to do the usual maintenance tasks, including worrying about security — and had convinced IT Operations of this, too.
Imagine running Windows. Installed as-is, and never updated since then. Without anti-virus, or any other security measures, other than Amazon’s default firewall (which luckily defaults to largely closed).
Plus, they also assumed that auto-scaling was going to make their app magically scale. It’s not designed to automagically scale horizontally. Somebody is going to be an unhappy camper.
Cautionary tale for IT shops: Make sure you know what the cloud is and isn’t getting you.
Cautionary tale for cloud providers: What you’re actually providing may bear no resemblance to what your customer thinks you’re providing.
Hope is not engineering
My enterprise clients frequently want to know why fill-in-the-blank-cloud-IaaS only has a 99.95% SLA. “That’s more than four hours of downtime a year!” they cry. “More than twenty minutes a month! I can’t possibly live with that! Why can’t they offer anything better than that?”
The answer to that is simple: There is a significant difference between engineering and hope. Many internal IT organizations, for instance, set service-level objectives that are based on what they hope to achieve, rather than the level that the solution is engineered to achieve, and can be mathematically expected to deliver, based on calculated mean time between failures (MTBF) of each component of the service. Many organizations are lucky enough to achieve service levels that are higher than the engineered reliability of their infrastructure. IaaS providers, however, are likely to base their SLAs on their engineered reliability, not on hope.
If a service provider is telling you the SLA is 99.95%, it usually means they’ve got a reasonable expectation, mathematically, of delivering a level of availability that’s 99.95% or higher.
My enterprise client, with his data center that has a single UPS and no generator (much less dual power feeds, multiple carriers and fiber paths, etc.), with a single, non-HA, non-load-balanced server (which might not even have dual power supplies, dual NICs, etc.), will tell me that he’s managed to have 100% uptime on this application in the past year, so fie on you, Mr. Cloud Provider.
I believe that uptime claim. He’s gotten lucky. (Or maybe he hasn’t gotten lucky, but he’ll tell me that the power outage was an anomaly and won’t happen again, or that incident happened during a maintenance window so it doesn’t count.)
A service provider might be willing to offer you a higher SLA. It’s going to cost you, because once you get past a certain point, mathematically improving your reliability starts to get really, really expensive.
Now, that said, I’m not necessarily a fan of all cloud IaaS providers’ SLAs. But I encourage anyone looking at them (or a traditional hosting SLA, for that matter), to ponder the difference between engineering and hope.