Blog Archives
Recent research
I’m at Gartner’s business continuity management summit (BCM2) this week, and my second talk, upcoming later this morning, is on the relevance of colocation and cloud computing (i.e., do-it-yourself external solutions) to disaster recovery.
My recent written research has been all focused on cloud, although plenty of my day to day client time has been dealing with more traditional services — colocation, data center leasing, managed hosting, CDN services. Yet, cloud remains a persistent hot topic, particularly since it’s now difficult to have a discussion about most of the other areas I cover without also getting into utility/cloud and future data center strategy.
Here’s what I’ve published recently:
How to Select a Cloud Computing Infrastructure Provider. This is a lengthy document that takes you methodically through the selection process of a provider for cloud infrastructure services, and provides an education in the sorts of options that are currently available. There’s an accompanying Toolkit: Comparing Cloud Computing Infrastructure Providers, which is a convenient spreadsheet for collecting all of this data for multiple providers, and scoring each of them according to your needs.
Cool Vendors in Cloud Computing System and Application Infrastructure, 2009. Our Cool Vendors notes highlight small companies that we think are doing something notable. These aren’t vendor recommendations, just a look at things that are interesting in the marketplace. This year’s selections were AppZero, Engine Yard, Enomaly, LongJump, ServePath (GoGrid), Vaultscape, and Voxel. (Note for the cynical: Cool Vendor status can’t be bought, in any way shape or form; client status is not at a consideration at any point, and these kinds of small vendors often don’t have the money to spend on research anyway.)
Key Issues for Managed and Professional Network Services, 2009. I’m not the primary author for this, but I contributed to the section on cloud-based services. This note is targeted at carriers and other network service providers, providing a broad overview of things they need to be thinking about in the next year.
I’m keeping egregiously busy. I recently did my yearly corporate work plan, showing my productivity metrics. I’ve already done a full year of work, based on our average productivity metrics, and it’s April. That’s the kind of year it’s been. It’s an exciting time in the market, though.
McKinsey on cloud computing
McKinsey is claiming, in a report called Clearing the Air on Cloud Computing, that cloud infrastructure (specifically Amazon EC2) is as much as 150% more expensive than in-house data center infrastructure (specifically a set of straw-man assumptions given by McKinsey).
In my opinion, McKinsey’s report lacks analytical rigor. They’ve crunched down all data center costs to a “typical” cost of assets, but in reality, these costs vary massively depending upon the size of one’s IT infrastructure. They’ve reduced the cloud to the specific example of Amazon. They seem to have an inconsistent definition of what a compute core actually is. And they’ve simply assumed that cloud infrastructure gets you a 10% labor savings. That’s one heck of an assumption, given that the whole analysis is underpinned by that. The presentation is full of very pretty charts, but they are charts founded on what appears to be a substantial amount of guesswork.
Interestingly, McKinsey also talks about enterprises setting their internal SLAs at 99.99%, vs. Amazon’s 99.95% on EC2. However, most businesses meet those SLAs through luck. Most enterprise data centers have mathematical uptimes below 99.99% (i.e., calculated mean time between failure), and a single server sitting in one of those data centers certainly has a mathematical uptime below that point. There is a vast gulf between engineering for reliability, and just trying to avoid attracting the evil eye. (Of course, sometimes cloud providers die at the hands of their own engineering safeguards.) Everyone wants 99.99% availability — but they often decide against paying for it, once they find out what it actually costs to reliably mathematically achieve it.
In my December note, Dataquest Insight: a Service Provider Roadmap to the Cloud Infrastructure Transformation, I wrote that Gartner’s Key Metrics data for servers (fully-loaded, broken-out costs for running data centers of various sizes) showed that for larger IT infrastructure bases, cloud infrastructure represented a limited cost savings on a TCO basis — but that it was highly compelling for small and mid-sized infrastructures. (Note that business size and infrastructure size don’t correlate; that depends on how heavily the business depends on IT.) Our Key Metrics numbers — a database gathered from examining the costs of thousands of businesses, broken down into hardware, software, data center facilities, labor, and more — show internal costs far higher than McKinsey cites, even for larger, more efficient organizations.
The primary cost savings for cloud infrastructure does not come in the savings on the hard assets. If you do an analysis based on the assumption that this is where it saves you money, your analysis will be flawed. Changing capex to opex, and taking advantage of the greater purchasing power of a cloud provider, can and will drive significant financial benefits for small to mid-size IT organizations that use the cloud. However, a substantial chunk of the benefits come from reducing the labor costs. You cannot analyze the cost of the cloud and simply handwave the labor differences. The labor costs on a per-CPU basis do vary widely as well — for instance, a larger IT organization with substantial automation is going to have much lower per-CPU costs than a small business with a network admin who does everything by hand.
I’ve been planning to publish some research analyzing the cost of cloud infrastructure vs. the internal data center, based on our Key Metrics data. I’ve also been planning to write, along with one of my colleagues with a finance background, an analysis of cloud financial benefits from a cost of capital perspective. I guess I should get on that…
Google App Engine and other tidbits
As anticipated, Java support on Google App Engine has been announced. To date, GAE has supported only the Python programming language. In keeping with the “phenomenal cosmic power, itty bitty living space” sandboxing that’s become common to cloud execution environments, GAE/Java has all the restrictions of GAE/Python. However, the already containerized nature of Java applications means that the restrictions probably won’t feel as significant to developers. Many Python libraries and frameworks are not “pure Python”; they include C extensions for speed. Java libraries and frameworks are, by contrast, usually pure Java; the biggest issues for porting Java into the GAE environment are likely to be the restrictions on system calls and the lack of threads. Generically, GAE/Java offers servlets. The other things that developers are likely to miss are support for JMS and JMX (Java’s messaging and monitoring, respectively).
Overall, the Java introduction is a definite plus for GAE, and is presumably also an important internal proof point for them — a demonstration that GAE can scale and work with other languages. Also, because there are lots of languages that now target the Java virtual machine (i.e., they’ve got compilers/interpreters that produce byte code for the Java VM) — Clojure and Scala, for instance — as well as ports of other languages, like JRuby, we’ll likely see additional languages available on GAE ahead of Google’s own support for those environments.
Google also followed through on an earlier announcement, adding support for scheduld tasks (“cron”). Basically, at a scheduled time, GAE cron will invoke a URL that you specify. This is useful, but probably not everything people were hoping it would be. It’s still subject to GAE’s normal restrictions; this doesn’t let you invoke a long-running background process. It requires a shift in thinking — for instance, instead of doing the once-daily data cleanup run at 4 am, you ought to be doing cleanup throughout the day, every couple of minutes, a bit of your data set at a time.
All of that is going to be chewed over thoroughly by the press and blogosphere, and I’ve contributed my two cents to a soon-to-be-published Gartner take on the announcement and GAE itself, so now I’ll point out something that I don’t think has been widely noticed: the unladen-swallow project plan.
unladen-swallow is apparently an initiative within Google’s compiler optimization team, with a goal of achieving a 5x speed-up in CPython (i.e., the normal, mainstream, implementation of Python), starting from the 2.6 base (the current version, which is a transition point between the 2.5 used by App Engine, and the much-different Python 3.0). The developers intend to achieve this speed-up in part by moving from the existing custom VM to one built on top of LLVM. (I’ve mentioned Google’s interest in LLVM in the past.) I think this particular approach answers some of the mystery surrounding Google and Python 3.0 — this seems to indicate longer-term commitment to the existing 2.x base, while still being transition-friendly. As is typical with Google’s work with open-source code, they plan to release these changes back to the community.
All of which goes back to a point of mine earlier this week: Although programming language communities strongly resemble fandoms, languages are increasingly fungible. We’re a long way from platform maturity, too.
AWS in Eclipse, and Azure announcements
Amazon’s announcement for today, with timing presumably associated with EclipseCon, is an AWS toolkit for the Eclipse IDE.
Eclipse, which is an open-source project under the aegis of IBM (who also offers a commercial version), is one of the most popular IDEs (the other is Microsoft Visual Studio). Originally designed for Java applications, it has since been extended to support many other languages and environments.
Integrating with Eclipse is a useful step for Amazon, and hopefully other cloud providers will follow suit. It’s also a competitive response to the integration that Microsoft has done between Visual Studio and its Azure platform.
Speaking of Azure, as part of a set of announcements, Microsoft has said that it’s supporting non-.Net languages on Azure via FastCGI. FastCGI is a webserver extension that basically compiles and loads your scripts once, instead of every time they’re accessed, resulting in a reduction of computational overhead. You can run most languages under it, including Java, but it doesn’t really give you the full featureset that you get with tight integration with the webserver through a language-specific extension. (Note that because .NET’s languages encompass anything that supports the CLR, users already had some reasonable access to non-C# languages on Azure — implementations like Ruby.NET, IronRuby, IronPython, etc.)
Also, in an interesting Q&A on a ZDnet blog post, Microsoft said that there will be no private Azure-based clouds, i.e., enterprises won’t be able to take the Azure software and host it in their own data centers. What’s not clear is whether or not the software written for Azure will be portable into the enterprise environment. Portability of this sort is a feature that Microsoft, with its complete control over the entire stack, is uniquely well-positioned to be able to deliver.
Gartner BCM summit pitches
I’ve just finished writing one of my presentations for Gartner’s Business Continuity Management Summit. My pitch is focused upon looking at colocation as well as the future of cloud infrastructure for disaster recovery purposes. (My other pitch at the conference is on network resiliency.)
When I started out to write this, I’d actually been expecting that some providers who had indicated that they’d have formal cloud DR services coming out shortly would be able to provide me with a briefing on what they were planning to offer. But that, unfortunately, turned out not to be the case in the end. So the pitch has been more focused on do-it-yourself cloud DR.
Lightweight DR services have appeared and disappeared from the market at an interesting rate ever since Inflow (many years and many acquisitions ago) began offering a service focused on smaller mid-market customers that couldn’t typically afford full-service DR solutions. It’s a natural complement to colocation (in fact, a substantial percentage of the people who use colo do it for a secondary site), and now, a natural complement to the cloud.
Research du jour
My newest research notes are all collaborative efforts.
Forecast: Sizing the Cloud; Understanding the Opportunities in Cloud Services. This is Gartner’s official take on cloud segmentation and forecasting through 2013. It was a large-team effort; my contribution was primarily on the compute services portion.
Invest Insight: Content Delivery Network Arbitrage Increases Market Competition. This is a note specifically for Gartner Invest clients, written in conjunction with my colleague Frank Marsala (a former sell-side analyst who heads up our telecom sector for investors). It’s primarily about Conviva but also touches on Cotendo, but its key point is not to look at particular companies, but to look at technology-enabled long-term trends.
Cool Vendors in Cloud Computing Management and Professional Services, 2009. This is part of our annual “cool vendors” series highlighting small vendors whom we think are doing something notable. It’s a group effort, and we pick the vendors via committee. (And no, there is no way to buy your way into the report.) This year’s picks (never a secret, since vendors usually do press releases) are Appirio, CohesiveFT, Hyperic, RightScale, and Ylastic.
Sun, IBM, and the cloud
The morning’s hot rumor: IBM and Sun are in acquisition talks. The punditry is in full swing in the press. My mailbox here at work is filling rapidly with research-community discussion of the implications, too. (As if Cisco’s Unified Computing Strategy wasn’t creating enough controversy for the week.)
Don’t let that buzz drown out Sun’s cloud announcement, though. An insider has useful detailed comments, along with links to the API itself. It’s Q-Layer inside, a RESTful API on top, and clearly in the early stages of development. I’ll likely post some further commentary once I get some time to read through all the documentation and think it through.
Linkage du jour
Tossing a few links out there…
In the weekend’s biggest cloud news, Microsoft’s Azure was down for 22 hours. It’s now back up, with no root cause known.
Geva Perry has posted a useful Zoho Sheet calculator for figuring out whether an Amazon EC2 reserved instance will save you money over an unreserved instance.
Craig Balding has posted a down-to-earth dissection of PCI compliance in the cloud, and the practical reality that cloud infrastructure providers tend to deal with PCI compliance by encouraging you to push the actual payment stuff off to third parties.
Google App Engine updates
For those of you who haven’t been following Google’s updates to App Engine, I want to call your attention to a number of recent announcements. At the six-month point of the beta, I asked when App Engine would be enterprise-ready; now, as we come to almost the year mark, these announcements show the progress and roadmap to addressing many of the issues I mentioned in my previous post.
Paid usage. Google is now letting applications grow beyond the free limits. You set quotas for various resources, and pay for what you use. I still have concerns about the quota model, but being able to bill for these services is an important step for Google. Google intends to be price-competitive with Amazon, but there’s an important difference — there’s still some free service. Google anticipates that the free quotas are enough to serve about five million page views. 5 MPVs is a lot; it pretty much means that if you’re willing to write to the platform, you can easily host your hobby project on it for free. For that matter, many enterprises don’t get 5 MPVs worth of hits on an individual Web app or site each month — it’s just that the platform restrictions are a barrier to mainstream adoption.
Less aggressive limits and fewer restrictions. Google has removed or reduced some limits and restrictions that were significant frustrations for developers.
Promised new features. Google has announced that it’s going to provide APIs for some vital bits of functionality that it doesn’t currently allow, like the ability to run scheduled jobs and background processes.
Release of Python 3.0. While there’s no word on how Google plans to manage the 3.0 transition for App Engine, it’s interesting to see how many Python contributors have been absorbed into Google.
Speaking personally, I like App Engine. Python is my strongest scripting language skill, so I prefer to write in it whenever possible. I also like Django, though I appreciate that Google’s framework is easier to get started with than Django (it’s very easy to crank out basic stuff). Like a lot of people, I’ve had trouble adjusting to the non-relational database, but that’s mostly a matter of programming practice. It is, however, clear that the platform is still in its early stages. (I once spent several hours of a weekend tearing my hair out at something that didn’t work, only to eventually find that it was a known bug in the engine.) But Google continues to work at improving it, and it’s worth keeping an eye on to see what it will eventually become. Just don’t expect it to be enterprise-ready this year.
Amazon announces reserved instances
Amazon’s announcement du jour is “reserved instances” for EC2.
Basically, with a reserved instance, you pay an up-front non-refundable fee for a one-year term or a three-year term. That buys you a discount on the usage fee for that instance, during that period of time. Reserved instances are only available for Unix flavors (i.e., no Windows) and, at present, only in the US availability zones.
Let’s do some math to see what the cost savings turn out to be.
An Amazon small instance (1 virtual core equivalent to a 1.0-1.2 GHz 2007 Opteron or Xeon) is normally $0.10 per hour. Assuming 720 hours in a month, that’s $72 a month, or $864 per year, if you run that instance full-time.
Under the reserved instance pricing scheme, you pay $325 for a one-year term, then $0.03 per hour. That would be $21 per month, or $259 per year. Add in the reserve fee and you’re at $584 for the year, averaging out to $49 per month — a pretty nice cost savings.
On a three-year basis, unreserved would cost you $2,592; reserved, full-time, is a $500 one-time fee, and with usage, a grand total of $1277. Big savings over the base price, averaging out to $35 per month.
This is important because at the unreserved prices, on a three-year cash basis, it’s cheaper to just buy your own servers. At the reserved price, does that equation change?
Well, let’s see. Today, in a Dell PowerEdge R900 (a reasonably popular server for virtualized infrastructure), I can get a four-socket server populated with quad-cores for around $15,000. That’s sixteen Xeon cores clocking at more than 2 GHz. Call it $1000 per modern core; split up over a 3-year period, that’s about $28 per month. Cheaper than the reserved price, and much less than the unreserved price.
Now, this is a crude, hardware-only, three-year cash calculation, of course, and not a TCO calculation. But it shows that if you plan to run your servers full-time on Amazon, it’s not as cheap as you might think when you think “it’s just three cents an hour!”