Category Archives: Infrastructure

Overpromising

I’ve turned one of my earlier blog entries, Smoke-and-mirrors and cloud software into a full-blown research note: “Software on Amazon’s Elastic Compute Cloud: How to Tell Hype From Reality” (clients only). It’s a Q&A for your software vendor, if they suggest that you deploy their solution on EC2, or if you want to do so and you’re wondering what vendor support you’ll get if you do so. The information is specific to Amazon (since most client inquiries of this type involve Amazon), but somewhat applicable to other cloud compute service providers, too.

More broadly, I’ve noticed an increasing tendency on the part of cloud compute vendors to over-promise. It’s not credible, and it leaves prospective customers scratching their heads and feeling like someone has tried to pull a fast one on them. Worse still, it could leave more gullible businesses going into implementations that ultimately fail. This is exactly what drives the Trough of Disillusionment of the hype cycle and hampers productive mainstream adoption.

Customers: When you have doubts about a cloud vendor’s breezy claims that sure, it will all work out, ask them to propose a specific solution. If you’re wondering how they’ll handle X, Y, or Z, ask them and don’t be satisfied with assurances that you (or they) will figure it out.

Vendors: I believe that if you can’t give the customer the right solution, you’re better off letting him go do the right thing with someone else. Stretching your capabilities can be positive for both you and your customer, but if your solution isn’t the right path, or it is a significantly more difficult path than an alternative solution, both of you are likely to be happier if that customer doesn’t buy from you right now, at least not in that particular context. Better to come back to this customer eventually when your technology is mature enough to meet his needs, or look for the customer’s needs that do suit what you can offer right now. If you screw up a premature implementation, chances are that you won’t get the chance to grow this business the way that you hoped. There are enough early adopters with needs that you can meet, that you should be going after them. There’s nothing wrong with serving start-ups and getting “foothold” implementations in enterprises; don’t bite off more than you can chew.

Almost a decade of analyst experience has shown me that it’s hard for a vendor to get a second chance with a customer if they screwed up the first encounter. Even if, many many years later, the vendor has a vastly augmented set of capabilities and is managed entirely differently, a burned customer still tends to look at them through the lens of that initial experience, and often take that attitude to the various companies they move to. My observation is that in IT outsourcing, customers certainly hold vendor “grudges” for more than five years, and may do so for more than a decade. This is hugely important in emerging markets, as it can dilute early-mover advantages as time progresses.

Bookmark and Share

Job-based vs. request-based computing

Companies are adopting cloud systems infrastructure services in two different ways: job-based “batch processing”, non-interactive computing; and request-based, real-time-response, interactive computing. The two have distinct requirements, but much as in the olden days of time-sharing, they can potentially share the same infrastructure.

Job-based computing is usually of a number-crunching nature — scientific or high-performance computing. This is the sort of thing that users usually like to do on parallel computers with very fast interconnection (Infiniband or the equivalent thereof), but in the cloud, total compute time may be traded for a lower cost, and, eventually, algorithms may be altered to reduce dependency on server-to-server or server-to-storage communications. Putting these jobs on the cloud generally reduces reliance on, and scheduling for, a fixed amount of supercomputing infrastructure. Alternatively, job-based computing on the cloud may represent one-time computationally-intensive projects (transcoding, for instance).

Request-based computing, on the other hand, demands instant response to interaction. This kind of use of the cloud is classically for Web hosting, whether the interaction is based on a user with a browser, or another server making Web services requests. Most of this kind of computing is not CPU-intensive.

Observation: Most cloud compute services today target request-based computing, and this is the logical evolution of the hosting industry. However, a significant amount of large-enterprise immediate-term adoption is job-based computing.

Dilemma for cloud providers: Optimize infrastructure with low-power low-cost processors for request-based computing? Or try to balance job-based and request-based compute in a way that maximizes efficient use of faster CPUs?

Bookmark and Share

“Enterprise class” cloud

There seems to be an endless parade of hosting companies eager to explain to me that they have an “enterprise class” cloud offering. (Cloud systems infrastructure services, to be precise; I continue to be careless in my shorthand on this blog, although all of us here at Gartner are trying to get into the habit of using cloud as an adjective attached to more specific terminology.)

If you’re a hosting vendor, get this into your head now: Just because your cloud compute service is differentiated from Amazon’s doesn’t mean that you’re differentiated from any other hoster’s cloud offering.

Yes, these offerings are indeed targeted at the enterprise. Yes, there are in fact plenty of non-startups who are ready and willing and eager to adopt cloud infrastructure. Yes, there are features that they want (or need) that they can’t get on some of the existing cloud offerings, especially those of the early entrants. But that does not make them unique.

These offerings tend to share the following common traits:

1. “Premium” equipment. Name-brand everything. HP blades, Cisco gear except for F5’s ADCs, etc. No white boxes.

2. VMware-based. This reflects the fact that VMware is overwhelmingly the most popular virtualization technology used in enterprises.

3. Private VLANs. Enterprises perceive private VLANs as more secure.

4. Private connectivity. That usually means Internet VPN support, but also the ability to drop your own private WAN connection into the facility. Enterprises who are integrating cloud-based solutions with their legacy infrastructure often want to be able to get MPLS VPN connections back to their own data center.

5. Colocated or provider-owned dedicated gear. Not all workloads virtualize well, and some things are available only as hardware. If you have Oracle RAC clusters, you are almost certainly going to do it on dedicated servers. People have Google search appliances, hardware ADCs custom-configured for complex tasks, black-box encryption devices, etc. Dedicated equipment is not going away for a very, very long time. (Clients only: See statistics and advice on what not to virtualize.)

6. Managed service options. People still want support, managed services, and professional services; the cloud simplifies and automates some operations tasks, but we have a very long way to go before it fulfills its potential to reduce IT operations labor costs. And this, of course, is where most hosters will make their money.

These are traits that it doesn’t take a genius to think of. Most are known requirements established through a decade and a half of hosting industry experience. If you want to differentiate, you need to get beyond them.

On-demand cloud offerings are a critical evolution stage for hosters. I continue to be very, very interested in hearing from hosters who are introducing this new set of capabilities. For the moment, there’s also some differentiation in which part of the cloud conundrum a hoster has decided to attack first, creating provider differences for both the immediate offerings and the near-term roadmap offerings. But hosters are making a big mistake by thinking their cloud competition is Amazon. Amazon certainly is a competitor now, but a hoster’s biggest worry should still be other hosters, given the worrisome similarities in the emerging services.

Bookmark and Share

Verizon and Carpathia launch hybrid offerings

Two public cloud announcements from hosting providers this week, with some interesting similarities…

Verizon

Verizon has launched its Computing as a Service (CaaS) offering. This is a virtual data center (VDC) offering, which means that it’s got a Web-based GUI within which you provision and manage your infrastructure. You contract for CaaS itself on a one-year basis, paying for that base access on a monthly basis. Within CaaS, you can provision “farms”, which are individual virtual data centers. Within a farm, you can provision servers (along with the usual storage, load-balancing, firewall, etc.). Farms and servers are on-demand, with a daily price.

Two things make the Verizon offering distinctive (at least temporarily). First, farms can contain both physical servers and virtual (VMware-based) servers, on an HP C-class blade platform; while hybridized offerings have become increasingly common, Verizon is one of the few to allow them to be managed out of a unified GUI. Second, Verizon offers managed services across the whole platform. By default, you get basic management (including patch management) for the OS and Verizon-provided app infrastructure. You can also upgrade to full managed service. It looks like, compared to similar providers, the Verizon offering is going to be extremely cost-competitive.

Carpathia Hosting

In yet another example of a smaller hoster “growing up” with serious cloud computing ambitions, Carpathia has released an offering it calls Cloud Orchestration. It’s a hybrid utility hosting model, combining its managed dedicated hosting service (AlwaysOn) with scaling on its virtual server offering, InstantOn.

Carpathia has stated it’s the first hybrid offering; I don’t agree that it is. However, I do think that Carpathia has rolled out a notable number of features on its cloud platform (Citrix Xen-based). It’s made a foray into the cloud storage space, based on ParaScale. It also has auto-scaling, including auto-provisioning based on performance and availability SLA violations (the only vendor I know of that currently offers that feature). OS patch management is included, as are other basic managed hosting services. Check out Carpathia CTO Jon Greaves’s blog post on the value proposition, for an indication of where their thinking is at.

Side thought: Carpathia is one of the few Xen-based cloud providers to use Citrix Xen, rather than open-source Xen. However, now that Citrix is offering XenServer for free, it seems likely that service providers will gradually drift that way. Live migration (XenMotion) will probably be the main thing that drives that switch.

Bookmark and Share

Amazon’s CloudWatch and other features

Catching up on some commentary…

Amazon recently introduced three new features: monitoring, load-balancing, and auto-scaling. (As usual, Werner Vogels has further explanation, and RightScale has a detailed examination.)

The monitoring service, called CloudWatch, provides utilization metrics for your running EC2 instances. This is a premium service on top of the regular EC2 fee; it costs 1.5 cents per instance-hour. The data is persisted for just two weeks, but is independent of running instances. If you need longer-term historical graphing, you’ll need to retrieve and archive the data yourself. There’s some simple data aggregation, but anyone who needs real correlation capabilities will want to feed this data back into their own monitoring tools.

CloudWatch is required to use the auto-scaling service, since that service uses the monitoring data to figure out when to launch or terminate instances. Basically, you define business rules for scaling that are based on the available CloudWatch metrics. Developers should take note that this is not magical auto-scaling. Adding or subtracting instances based on metrics isn’t rocket science. The tough part is usually writing an app that scales horizontally, plus automatically and seamlessly making other configuration changes necessary when you change the number of virtual servers in its capacity pool. (I field an awful lot of client calls from developers under the delusion that they can just write code any way they want, and simply putting their application on EC2 will remove all worries about scalability.)

The new load-balancing service essentially serves both global and local functions — between availability zones, and between instances within a zone. It’s auto-scaling-aware, but its health checks are connection-based, rather than using CloudWatch metrics. However, it’s free to EC2 customers and does not require use of CloudWatch. Customers who have been using HAproxy are likely to find this useful. It won’t touch the requirements of those who need full-fledged application delivery controller (ADC) functionality and have been using Zeus or the like.

As always, Amazon’s new features eat into the differentiating capabilities of third-party tools (RightScale, Elastra, etc.) with these services, but the “most, but not all of the way there” nature of their implementations mans that third-party tools still add value to the baseline. That’s particularly true given that only the load-balancing feature is free.

Bookmark and Share

VMware takes stake in Terremark

I have been crazily, insanely busy, and my frequency of blog posting has suffered for it. On the plus side, I’ve been busy because a huge number of people — users, vendors, investors — want to talk about cloud.

I’ve seen enough questions about VMware investing $20 million in Terremark that I figured I’d write a quick take, though.

Terremark is a close VMware partner (and their service provider of the year for 2008). Data Return (acquired by Terremark in 2007) was the first to have a significant VMware-based utility hosting offering, dating all the way back to 2005. Terremark has since also gotten good traction with its VMware-based Enterprise Cloud offering, which is a virtual data center service. However, Terremark is not just a hosting/cloud provider; it also does carrier-neutral colocation. It has been sinking capital into data center builds, so an external infusion, particularly one directed specifically at funding the cloud-related engineering efforts, is probably welcome.

Terremark has been the leading-edge service provider for VMware-based on-demand infrastructure. It is to VMware’s advantage to get service providers to use its cutting-edge stuff, particularly the upcoming vCloud, as soon as possible, so giving Terremark money to accelerate its cloud plans is a perfectly good tactical move. I don’t think it’s necessary to read any big strategic message into this investment, although undoubtedly it’s interesting to contemplate.

Bookmark and Share

If you worry about hardware, it’s not cloud

If you need more RAM, and you have to call your service provider, they’ve got to order the RAM, wait until they receive it, and then put it in a physical server, before you actually get more memory, and then they bill you on a one-off basis for buying and installing the RAM, you’re not doing cloud computing. If you have to negotiate the price of that RAM each time they buy some, you are really really not doing cloud computing.

I talked to a client yesterday who is in exactly this situation, with a small vendor who calls themselves a cloud computing provider. (I am not going to name names on my blog, in this case.)

Cloud infrastructure services should not be full of one-offs. (The example I cited is merely the worst of the service provider’s offenses against cloud concepts.) It’s reasonably to hybridize cloud solutions with non-cloud solutions, but for basic things — compute cores, RAM, storage, bandwidth — if it’s not on-demand, seamless, and nigh-instant, it’s not cloud, at least not in any reasonable definition of public cloud computing. (“Private cloud”, in the sense of in-house, virtualized data centers, adopts some but not all traits of the public cloud to varying degrees, and therefore gets cut more slack.)

Cloud infrastructure should be a fabric, not individual VMs that are tied to specific physical servers.

Bookmark and Share

Recent research

I’m at Gartner’s business continuity management summit (BCM2) this week, and my second talk, upcoming later this morning, is on the relevance of colocation and cloud computing (i.e., do-it-yourself external solutions) to disaster recovery.

My recent written research has been all focused on cloud, although plenty of my day to day client time has been dealing with more traditional services — colocation, data center leasing, managed hosting, CDN services. Yet, cloud remains a persistent hot topic, particularly since it’s now difficult to have a discussion about most of the other areas I cover without also getting into utility/cloud and future data center strategy.

Here’s what I’ve published recently:

How to Select a Cloud Computing Infrastructure Provider. This is a lengthy document that takes you methodically through the selection process of a provider for cloud infrastructure services, and provides an education in the sorts of options that are currently available. There’s an accompanying Toolkit: Comparing Cloud Computing Infrastructure Providers, which is a convenient spreadsheet for collecting all of this data for multiple providers, and scoring each of them according to your needs.

Cool Vendors in Cloud Computing System and Application Infrastructure, 2009. Our Cool Vendors notes highlight small companies that we think are doing something notable. These aren’t vendor recommendations, just a look at things that are interesting in the marketplace. This year’s selections were AppZero, Engine Yard, Enomaly, LongJump, ServePath (GoGrid), Vaultscape, and Voxel. (Note for the cynical: Cool Vendor status can’t be bought, in any way shape or form; client status is not at a consideration at any point, and these kinds of small vendors often don’t have the money to spend on research anyway.)

Key Issues for Managed and Professional Network Services, 2009. I’m not the primary author for this, but I contributed to the section on cloud-based services. This note is targeted at carriers and other network service providers, providing a broad overview of things they need to be thinking about in the next year.

I’m keeping egregiously busy. I recently did my yearly corporate work plan, showing my productivity metrics. I’ve already done a full year of work, based on our average productivity metrics, and it’s April. That’s the kind of year it’s been. It’s an exciting time in the market, though.

Bookmark and Share

Next round, Akamai vs. Limelight

In CDN news this past weekend, a judge has overturned the jury verdict in the Akamai vs. Limelight patent infringement case. Akamai has said it intends to appeal.

The judge cited Muniauction v. Thomson Corp. as the precedent for a judgement of law, which basically says that if you have a method claim in a patent that involves steps performed by multiple parties, you cannot claim direct infringement unless one party exercises control over the entire process.

I have not read the court filing yet, but based on the citation of precedent, it’s a good guess that because the CDN patent methods generally involve steps beyond the provider’s control, it falls under this citation. Unexpected, at least to me, and for those IP law watchers among you, rather fascinating, since in our increasingly federated, distributed, outsourced IT world, this would seem to raise a host of intellectual property issues for multi-party transactions, which are in some ways inherent to web services.

Bookmark and Share

McKinsey on cloud computing

McKinsey is claiming, in a report called Clearing the Air on Cloud Computing, that cloud infrastructure (specifically Amazon EC2) is as much as 150% more expensive than in-house data center infrastructure (specifically a set of straw-man assumptions given by McKinsey).

In my opinion, McKinsey’s report lacks analytical rigor. They’ve crunched down all data center costs to a “typical” cost of assets, but in reality, these costs vary massively depending upon the size of one’s IT infrastructure. They’ve reduced the cloud to the specific example of Amazon. They seem to have an inconsistent definition of what a compute core actually is. And they’ve simply assumed that cloud infrastructure gets you a 10% labor savings. That’s one heck of an assumption, given that the whole analysis is underpinned by that. The presentation is full of very pretty charts, but they are charts founded on what appears to be a substantial amount of guesswork.

Interestingly, McKinsey also talks about enterprises setting their internal SLAs at 99.99%, vs. Amazon’s 99.95% on EC2. However, most businesses meet those SLAs through luck. Most enterprise data centers have mathematical uptimes below 99.99% (i.e., calculated mean time between failure), and a single server sitting in one of those data centers certainly has a mathematical uptime below that point. There is a vast gulf between engineering for reliability, and just trying to avoid attracting the evil eye. (Of course, sometimes cloud providers die at the hands of their own engineering safeguards.) Everyone wants 99.99% availability — but they often decide against paying for it, once they find out what it actually costs to reliably mathematically achieve it.

In my December note, Dataquest Insight: a Service Provider Roadmap to the Cloud Infrastructure Transformation, I wrote that Gartner’s Key Metrics data for servers (fully-loaded, broken-out costs for running data centers of various sizes) showed that for larger IT infrastructure bases, cloud infrastructure represented a limited cost savings on a TCO basis — but that it was highly compelling for small and mid-sized infrastructures. (Note that business size and infrastructure size don’t correlate; that depends on how heavily the business depends on IT.) Our Key Metrics numbers — a database gathered from examining the costs of thousands of businesses, broken down into hardware, software, data center facilities, labor, and more — show internal costs far higher than McKinsey cites, even for larger, more efficient organizations.

The primary cost savings for cloud infrastructure does not come in the savings on the hard assets. If you do an analysis based on the assumption that this is where it saves you money, your analysis will be flawed. Changing capex to opex, and taking advantage of the greater purchasing power of a cloud provider, can and will drive significant financial benefits for small to mid-size IT organizations that use the cloud. However, a substantial chunk of the benefits come from reducing the labor costs. You cannot analyze the cost of the cloud and simply handwave the labor differences. The labor costs on a per-CPU basis do vary widely as well — for instance, a larger IT organization with substantial automation is going to have much lower per-CPU costs than a small business with a network admin who does everything by hand.

I’ve been planning to publish some research analyzing the cost of cloud infrastructure vs. the internal data center, based on our Key Metrics data. I’ve also been planning to write, along with one of my colleagues with a finance background, an analysis of cloud financial benefits from a cost of capital perspective. I guess I should get on that…

Bookmark and Share