Monthly Archives: April 2009
I’m at Gartner’s business continuity management summit (BCM2) this week, and my second talk, upcoming later this morning, is on the relevance of colocation and cloud computing (i.e., do-it-yourself external solutions) to disaster recovery.
My recent written research has been all focused on cloud, although plenty of my day to day client time has been dealing with more traditional services — colocation, data center leasing, managed hosting, CDN services. Yet, cloud remains a persistent hot topic, particularly since it’s now difficult to have a discussion about most of the other areas I cover without also getting into utility/cloud and future data center strategy.
Here’s what I’ve published recently:
How to Select a Cloud Computing Infrastructure Provider. This is a lengthy document that takes you methodically through the selection process of a provider for cloud infrastructure services, and provides an education in the sorts of options that are currently available. There’s an accompanying Toolkit: Comparing Cloud Computing Infrastructure Providers, which is a convenient spreadsheet for collecting all of this data for multiple providers, and scoring each of them according to your needs.
Cool Vendors in Cloud Computing System and Application Infrastructure, 2009. Our Cool Vendors notes highlight small companies that we think are doing something notable. These aren’t vendor recommendations, just a look at things that are interesting in the marketplace. This year’s selections were AppZero, Engine Yard, Enomaly, LongJump, ServePath (GoGrid), Vaultscape, and Voxel. (Note for the cynical: Cool Vendor status can’t be bought, in any way shape or form; client status is not at a consideration at any point, and these kinds of small vendors often don’t have the money to spend on research anyway.)
Key Issues for Managed and Professional Network Services, 2009. I’m not the primary author for this, but I contributed to the section on cloud-based services. This note is targeted at carriers and other network service providers, providing a broad overview of things they need to be thinking about in the next year.
I’m keeping egregiously busy. I recently did my yearly corporate work plan, showing my productivity metrics. I’ve already done a full year of work, based on our average productivity metrics, and it’s April. That’s the kind of year it’s been. It’s an exciting time in the market, though.
The judge cited Muniauction v. Thomson Corp. as the precedent for a judgement of law, which basically says that if you have a method claim in a patent that involves steps performed by multiple parties, you cannot claim direct infringement unless one party exercises control over the entire process.
I have not read the court filing yet, but based on the citation of precedent, it’s a good guess that because the CDN patent methods generally involve steps beyond the provider’s control, it falls under this citation. Unexpected, at least to me, and for those IP law watchers among you, rather fascinating, since in our increasingly federated, distributed, outsourced IT world, this would seem to raise a host of intellectual property issues for multi-party transactions, which are in some ways inherent to web services.
McKinsey is claiming, in a report called Clearing the Air on Cloud Computing, that cloud infrastructure (specifically Amazon EC2) is as much as 150% more expensive than in-house data center infrastructure (specifically a set of straw-man assumptions given by McKinsey).
In my opinion, McKinsey’s report lacks analytical rigor. They’ve crunched down all data center costs to a “typical” cost of assets, but in reality, these costs vary massively depending upon the size of one’s IT infrastructure. They’ve reduced the cloud to the specific example of Amazon. They seem to have an inconsistent definition of what a compute core actually is. And they’ve simply assumed that cloud infrastructure gets you a 10% labor savings. That’s one heck of an assumption, given that the whole analysis is underpinned by that. The presentation is full of very pretty charts, but they are charts founded on what appears to be a substantial amount of guesswork.
Interestingly, McKinsey also talks about enterprises setting their internal SLAs at 99.99%, vs. Amazon’s 99.95% on EC2. However, most businesses meet those SLAs through luck. Most enterprise data centers have mathematical uptimes below 99.99% (i.e., calculated mean time between failure), and a single server sitting in one of those data centers certainly has a mathematical uptime below that point. There is a vast gulf between engineering for reliability, and just trying to avoid attracting the evil eye. (Of course, sometimes cloud providers die at the hands of their own engineering safeguards.) Everyone wants 99.99% availability — but they often decide against paying for it, once they find out what it actually costs to reliably mathematically achieve it.
In my December note, Dataquest Insight: a Service Provider Roadmap to the Cloud Infrastructure Transformation, I wrote that Gartner’s Key Metrics data for servers (fully-loaded, broken-out costs for running data centers of various sizes) showed that for larger IT infrastructure bases, cloud infrastructure represented a limited cost savings on a TCO basis — but that it was highly compelling for small and mid-sized infrastructures. (Note that business size and infrastructure size don’t correlate; that depends on how heavily the business depends on IT.) Our Key Metrics numbers — a database gathered from examining the costs of thousands of businesses, broken down into hardware, software, data center facilities, labor, and more — show internal costs far higher than McKinsey cites, even for larger, more efficient organizations.
The primary cost savings for cloud infrastructure does not come in the savings on the hard assets. If you do an analysis based on the assumption that this is where it saves you money, your analysis will be flawed. Changing capex to opex, and taking advantage of the greater purchasing power of a cloud provider, can and will drive significant financial benefits for small to mid-size IT organizations that use the cloud. However, a substantial chunk of the benefits come from reducing the labor costs. You cannot analyze the cost of the cloud and simply handwave the labor differences. The labor costs on a per-CPU basis do vary widely as well — for instance, a larger IT organization with substantial automation is going to have much lower per-CPU costs than a small business with a network admin who does everything by hand.
I’ve been planning to publish some research analyzing the cost of cloud infrastructure vs. the internal data center, based on our Key Metrics data. I’ve also been planning to write, along with one of my colleagues with a finance background, an analysis of cloud financial benefits from a cost of capital perspective. I guess I should get on that…
One of the ongoing refrains of the analyst job is listening to clients gripe, day in and day out, about the things they don’t like about their vendors. Sometimes these things are niggling annoyances. Sometimes, though, these things are rage-inducing, or, in clients who tend to take everything calmly in stride, at least a distinct issue that materially impacts the service that they receive.
Sometimes these issues are recurring problems with a given vendor. I can tell you, for instance, that Vendor X has a process and organizational structure in place which essentially incentivizes its operations staff to kick requests from department to department without anyone being accountable for problems being resolved; unsurprisingly, this results in long resolution times for complex cross-functional issues, and frustrated customers. If you are with Vendor X, it’s something that you have to live with, since Vendor X’s internal politics do not permit fixing the core problem.
Sometimes, however, these issues are out of the ordinary, and would benefit from escalation. However, the majority of the time, the customer has generally not said anything to their provider about the issues they’re having — even if they’re so unhappy they’re planning to leave. Or if they’ve said something, they haven’t escalated into management. They don’t want to rock the boat, or disrupt the “relationship”. They’d rather suffer.
Since I have executive-level contacts at most of the service providers that our clients use, I usually offer to put such clients in touch with someone at their vendor who can see to it that real attention gets paid to the problem. Generally, unless their project is on the brink of failure, clients refuse that offer. Sometimes, they’ll permit me to raise the issue with the vendor, in a more anonymous fashion — i.e., something that doesn’t identify them personally, but which might provide just enough of a hint that the vendor can figure out who it is they ought to be helping.
I don’t get this. You are not dating your vendor. If you wait for them to bring you roses and chocolate, you are going to be disappointed. They will not read your mind, or recognize that you are quietly sulking and waiting for them to notice just how hurt you are and beg you to love them again. You are paying what is sometimes an egregious amount of money for services, and you deserve to get what you’re paying for.
To the vendors who wonder why they get anonymized passed-on complaints from analysts: It’s because analysts can be sort of like a combination of newspaper advice columnists, girl-gossip circles, and therapists. We can only do so much to coax clients into being honest with their vendors.
To the IT buyers out there: When you’re dealing with vendor frustrations, why do you seethe in silence, rather than complaining and escalating?
We are, it seems, in the midst of a wave of distributed denial of service attacks. The victims include:
- Neustar’s UltraDNS. (Problems with specific regional DNS clusters, with little customer-visible impact.)
- Register.com. (Severe impact on Web hosting and email customers.)
- GoGrid. (Severe impact on cloud hosting customers.)
- ThePlanet. (Attack on their DNS servers, with severe impact on customers.)
The attack on ThePlanet is unusual in that it received minimal attention in the press, despite the company being one of the largest Web hosters, and having Cisco Guard (DDoS mitigation) appliances in place. Also, the status updates were eventually issued via Twitter, rather than a more expected form of customer communication. Here’s the full text, aggregated off Twitter:
“Between 2:30am and 5:00am CDT on April 8, The Planet’s name servers were flooded again with a large brute force (DDoS) attack. Unlike the previous attack, this attack did not appear to be DNS-specific; instead, targeted resources indirectly supporting DNS services. Because the nature of this attack was different from the previous event, mirroring the response to the previous attack was ineffective. Once our investigation determined the nature of the attack, we applied filters throughout our DNS support system to alleviate the effects. The Planet’s network and DNS performance have been restored, and the attack originator has ceased actions. Any lingering issues may be indicative of a different problem that may have been exacerbated by the attack and should be resolved quickly. We are working on several projects to help mitigate similar attacks in the future. Once those plans are in order, we will update the DNS Status announcement thread in our community forums. We understand that other providers are experiencing similar events. We will reach out to them, pool our information and then work together to find consistencies between attacks. Our goal is to establish best practices as an industry to better respond to these recent events.“
Jose Nazario of Arbor Networks claims these attacks are not Conficker at work, which makes this wave of attacks even more interesting.
The takeaway from this: Customers understand if you get DDoS’d. They don’t put up with a lack of communication. It’s enormously difficult to communicate with customers in the midst of a crisis, especially one that takes down customer-facing infrastructure in a customer-impacting way, but it’s also incredibly critical. Clearly, not everyone in the company is out trying to troubleshoot the problem, so you can usefully put them to work reaching out to your customers, if you have the policies and procedures in place to do so successfully.
Something to think about today, no matter who you are and who you work for: What policies do you have in place for customer communications when a crisis hits your company? (Book recommendation: Eric Dezenhall’s Damage Control, which is a hard-edged, realistic look at communication in a crisis, including coping with competitors who are deliberately fanning the negative-PR flames.)
As anticipated, Java support on Google App Engine has been announced. To date, GAE has supported only the Python programming language. In keeping with the “phenomenal cosmic power, itty bitty living space” sandboxing that’s become common to cloud execution environments, GAE/Java has all the restrictions of GAE/Python. However, the already containerized nature of Java applications means that the restrictions probably won’t feel as significant to developers. Many Python libraries and frameworks are not “pure Python”; they include C extensions for speed. Java libraries and frameworks are, by contrast, usually pure Java; the biggest issues for porting Java into the GAE environment are likely to be the restrictions on system calls and the lack of threads. Generically, GAE/Java offers servlets. The other things that developers are likely to miss are support for JMS and JMX (Java’s messaging and monitoring, respectively).
Overall, the Java introduction is a definite plus for GAE, and is presumably also an important internal proof point for them — a demonstration that GAE can scale and work with other languages. Also, because there are lots of languages that now target the Java virtual machine (i.e., they’ve got compilers/interpreters that produce byte code for the Java VM) — Clojure and Scala, for instance — as well as ports of other languages, like JRuby, we’ll likely see additional languages available on GAE ahead of Google’s own support for those environments.
Google also followed through on an earlier announcement, adding support for scheduld tasks (“cron”). Basically, at a scheduled time, GAE cron will invoke a URL that you specify. This is useful, but probably not everything people were hoping it would be. It’s still subject to GAE’s normal restrictions; this doesn’t let you invoke a long-running background process. It requires a shift in thinking — for instance, instead of doing the once-daily data cleanup run at 4 am, you ought to be doing cleanup throughout the day, every couple of minutes, a bit of your data set at a time.
All of that is going to be chewed over thoroughly by the press and blogosphere, and I’ve contributed my two cents to a soon-to-be-published Gartner take on the announcement and GAE itself, so now I’ll point out something that I don’t think has been widely noticed: the unladen-swallow project plan.
unladen-swallow is apparently an initiative within Google’s compiler optimization team, with a goal of achieving a 5x speed-up in CPython (i.e., the normal, mainstream, implementation of Python), starting from the 2.6 base (the current version, which is a transition point between the 2.5 used by App Engine, and the much-different Python 3.0). The developers intend to achieve this speed-up in part by moving from the existing custom VM to one built on top of LLVM. (I’ve mentioned Google’s interest in LLVM in the past.) I think this particular approach answers some of the mystery surrounding Google and Python 3.0 — this seems to indicate longer-term commitment to the existing 2.x base, while still being transition-friendly. As is typical with Google’s work with open-source code, they plan to release these changes back to the community.
All of which goes back to a point of mine earlier this week: Although programming language communities strongly resemble fandoms, languages are increasingly fungible. We’re a long way from platform maturity, too.
A recent interview of some Twitter developers, on Twitter’s use of Scala has touched off a fair amount of controversy in the Ruby community, and prompting Todd Hoff of the High Scalability to muse on an interesting statement: At some point, the cost of servers outweighs the cost of programmers.
We all know that the scripting languages that are frequently favored in Web development today — Ruby, Python, and PHP — do not perform as well as Java, and Java in turn can be outperformed by well-written native C/C++ code. However, these popular dynamic programming languages typically lead to better programmer productivity. The argument has been that it’s more cost-effective to have more productive developers, than it is to buy less infrastructure. There is a point, though, when that scale equation can be flipped on its head — when the cost of the servers, due to the performance sacrifices, gets too high. (I would add that you can’t look at simple hardware spend alone, either. You’ve got a infrastructure TCO to look at. It’s not just about more people to maintain more servers, either — that equation is not linear, as a sysadmin can manage more systems if they’re all identical and there are good automation tools. But systems that are struggling due to performance issues soak up operations time with daily firefighting.)
Twitter’s developers are not advocating that people abandon what they know and love, but they’re forging a new path for themselves, with an open-source language developed in academia. Scala can be compiled to either Java or .NET bytecode, allowing it to interoperate bidirectionally with Java and CLR code; this is important for driving adoption because programmers generally like to work with languages that have a solid base of libraries (i.e., someone else has conveniently done the work of producing code for commonly-needed capabilities), and because this makes it possible for Scala to leverage the existing tools community for Java and .NET. Scala’s equivalent of Rails, i.e., a convenient framework, is Lift.
Scala doesn’t have much adoption now, but it’s worth noting that the rapid pace of Web 2.0 innovation is capable of driving extremely fast uptake of things that turn out to solve real-world problems. (For comparison: Not long ago, practically no one had heard of Hadoop, either, but it’s built quite a bit of buzz now.) That’s important for anyone contemplating the long-term future of particular platforms, particularly APaaS offerings that are tied to specific programming languages. The favored platforms can and do change in a tidal fashion — just look at the Google trend graph for Ruby on Rails to see just how aggressively interest can increase over a single year (2005 to 2006).
As a coda to all of this, Twitter’s Alex Payne has a smart blog post, noting that social media fills the vacuum between peer-reviewed journals and water-cooler conversations, yet deploring the fact that in these mediums, emotion can rule over what is measurable. The takeaway — whether you’re an IT manager, a marketing manager at a vendor, or an investor — from my perspective, is this: There’s an emotional context to programming language choice. These are not merely technical communities; these are fandoms, and they form part of a developer’s self-identity.