Monthly Archives: December 2008
Scaling limits and friendly failure
I’m on vacation, and I’ve been playing World of Goo (possibly the single-best construction puzzle game since 1991’s Lemmings by Psygnosis). I was reading the company’s blog (2D Boy), when I came across an entry about BlueHost’s no-notice termination of 2D Boy’s hosting.
And that got me thinking about “unlimited” hosting plans, throttling, limits, and the other challenges of running mass-market hosting — all issues also directly applicable to cloud computing.
BlueHost is a large and reputable provider of mass-market shared hosting. Their accounts are “unlimited”, and their terms of service essentially says that you can consume resources until you negatively impact other customers.
Now, in practice there are limits, and customers are sort of expected to know whether or not their needs fit shared hosting. Most people plan accordingly — although there have been some spectacular failures to do so, such as Sk*rt, a female-focused Digg competitor launched using BlueHost, prompting vast wonder at what kind of utter lack of thought results in trying to launch a high-traffic social networking site on a $7 hosting plan. Unlike Sk*rt, though, it was reasonable for 2D Boy to expect that shared hosting would cover their needs — hosting a small corporate site and blog. They were two guys who were making an indie garage game getting a gradual traffic ramp thanks to word-of-mouth, not an Internet company doing a big launch.
Limits are necessary, but no-notice termination of a legitimate company is bad customer service, however you slice it. Moreover, it’s avoidable bad customer service. Whatever mechanism is used to throttle, suspend service, etc. ought to be adaptable to sending out a warning alert: the “hey, if you keep doing this, you will be in violation of our policies and we’ll have to terminate you” note. Maybe even a, “hey, we will continue to serve your traffic for $X extra, and you have Y time to find a new host or reduce your traffic to normal volumes”. BlueHost does not sell anything beyond its $7 plan, so it has no upsell path; a provider with an upgrade path would hopefully have tried to encourage a migration, rather than executing a cold-turkey cut-off. (By the way, I have been on the service provider side of this equation, so I have ample sympathy for the vendor’s position against a customer whose usage is excessive, but I also firmly believe that no-notice termination of legitimate businesses is not the way to go.)
Automated elastic scaling is the key feature of a cloud, and consequently, limits and the way that they’re enforced technically and managed from a customer service standpoint, will be one of the ways that cloud infrastructure providers differentiate their services.
A vendor’s approach to limits has to be tied to their business goals. Similarly, what a customer desires out of limits must also be tied to their business goals. The customer wants reliable service within a budget. The vendor wants to be fairly compensated and ensure that his infrastructure remains stable.
Ideally, on cloud infrastructure, a customer scales seamlessly and automatically until the point where he is in danger of exceeding his budget. At that point, the system should alert him automatically, allowing him to increase his budget. If he doesn’t want to pay more, he will experience degraded service; degradation should mean slower or lower-priority service, or an automatic switch to a “lite” site, rather than outright failure.
Perhaps when you get right down to it, it’s really about what the failure mode is. Fail friendly. A vendor has a lot more flexibility in imposing limits if it can manage that.
Peer influence and the use of Magic Quadrants
The New Scientist has an interesting article commenting that the long tail may be less potent than previously postulated — and that peer pressure creates a winner-take-all situation.
I was jotting this blog post about Gartner clients and the target audience for the Magic Quadrant, and that article got me thinking about the social context for market research and vendor recommendations.
Gartner’s client base is primarily mid-sized business to large enterprise — our typical client is probably $100 million or more in revenue, but we also serve a lot of technology companies who are smaller than that. Beyond that subscription base, though, we also talk to people at conferences; those attendees usually represent a much more diverse set of organizations. But it’s the subscription base that we mostly talk to. (I carry an unusually high inquiry load — I’ll talk to something on the order of 700 clients this year.)
Normally, I’m interested in the comprehensive range of a vendor’s business (at least insofar as it’s relevant to my coverage). When I do an MQ, though, my subscriber base is the lens through which I evaluate companies. While I’m interested in the ways vendors service small businesses at other times, when it’s in the context of an MQ, I care only about a vendor’s relevance to our clients — i.e., the IT buyers who subscribe to Gartner services and who are reading the MQ to figure out what vendors they want to short-list.
Sometimes, when vendors think about our client base, they mistakenly assume that it’s Fortune 1000 and the largest of enterprises. While we serve those companies, we have more than 10,000 client organizations — so obviously, we serve a lot more than giant entities. The customers I talk to day after day may have a single cabinet in colocation — or fifty data centers of their own. (Sometimes both.) They might have one or two servers in managed hosting, or dozens of websites deployed via multi-dozen-server contracts. They might deliver less than a TB of content per month via a CDN, or they might be one of the largest media companies on the planet, with staggering video volumes.
These clients span an enormous range of wants and needs, but they have one significant common denominator: They are the kinds of companies that subscribe to a research and advisory firm, which means they make enough tech purchases to justify the cost of a research contract, and they have a culture which values (or at least bureaucratically mandates) seeking a neutral outside opinion.
That ideal of objectivity, however, often masks something more fundamental that ties back to the article that I mentioned: namely, the fact that many clients have an insatiable hunger to know “What are companies like mine doing?“. They are not necessarily seeking best practice, but common practice. Sometimes they seek the assurance that their non-ideal situation is not dissimilar to that of their peers at similar companies. (Although the opening line of Tolstoy’s Anna Karenina — “Happy families are all alike, but every unhappy family is unhappy in its own way” — quite possibly applies to IT departments, too.)
This is also reflected in the fact that customers often have a deep desire to talk to other customers of the same vendor, on an informal and social basis. That hunger is sometimes satisfied by online forums, but the larger the company, the more reluctant they are to discuss their business in public, although they may still share freely in a one-on-one or directly personal context.
IBM was the ultimate winner-take-all company (to use the New Scientist phrase) — the company that everyone was buying from, thus guaranteeing that you were unlikely to get fired buying IBM. Arguably, it and its brethren still are at the fat forefront of the outsourced IT infrastructure market share curve, while the bazillion hosting companies out there are spread out over the long tail. Even within the narrower confines of pure hosting, which is a highly fragmented market, and despite massive amounts of online information, peer influence has concentrated market share in the hands of relatively few vendors.
To quote the article: Which leads to a curious puzzle: why, when we have so much information at our fingertips, are we so concerned with what our peers like? Don’t we trust our own judgement? Watts thinks it is partly a cognitive problem. Far from liberating us, the proliferation of choice that modern technology has brought is overwhelming us — making us even more reliant on outside cues to determine what we like.
So I can sum up: A Magic Quadrant is an outside cue, offering expert opinion that factors in aggregated peer opinion.
Pricing transparency and CDNs
It is possible that I am going to turn out to be mildly wrong about something. I predicted that neither Amazon’s CloudFront CDN nor the comparable Rackspace/Limelight offering (Mosso Cloud Files) would really impact the mainstream CDN market. I am no longer as certain that’s going to be the case, as it appears that behavioral economics play into these decisions more than one might expect. The impact is subtle, but I think it’s there.
I’m not talking about the giant video deals, mind you; those guys already get prices well below that of the cloud CDNs. I’m talking about the classic bread-and-butter of the CDN market, the e-commerce and enterprise customers, significant B2B and B2C brands that have traditionally been Akamai loyalists, or been scattered with smaller players like Mirror Image.
Simply put, the cloud CDNs put indirect pressure on mainstream CDN prices, and will absorb some new mainstream (enterprise but low-volume) clients, for a simple reason: Their pricing is transparent. $0.22/GB for Rackspace/Limelight. $0.20/GB for SoftLayer/Internap. $0.17/GB for Amazon CloudFront. And so on.
Transparent pricing forces people to rationalize what they’re buying. If I can buy Limelight service on zero commit for $0.22/GB, there’s a fair chance that I’m going to start wondering just what exactly Akamai is giving me that’s worth paying $2.50/GB for on a multi-TB commit. Now, the answer to that might be, “DSA Secure that speeds up my global e-commerce transactions and is invaluable to my business”, but that answer might also be, “The same old basic static caching I’ve been doing forever and have been blindly signing renewals for.” It is going to get me to wonder things like, “What are the actual competitive costs of the services I am using?” and, “What is the business value of what I’m buying?” It might not alter what people buy, but it will certainly alter their perception of value.
Since grim October, businesses have really cared about what things cost and what benefit they’re getting out of them. Transparent pricing really amps up the scrutiny, as I’m discovering as I talk to clients about CDN services. And remember that people can be predictably irrational.
While I’m on the topic of cloud CDNs: There have been two recent sets of public performance measurements for Rackspace (Mosso) Cloud Files on Limelight. One is part of a review by Matthew Sacks, and the other is Rackspace’s own posting of Gomez metrics comparing Cloud Files with Amazon CloudFront. The Limelight performance is, unsurprisingly, overwhelmingly better.
What I haven’t seen yet is a direct performance comparison of regular Limelight and Rackspace+Limelight. The footprint appears to be the same, but differences in cache hit ratios (likely, given that stuff on Cloud Files will likely get fewer eyeballs) and the like will create performance differences on a practical level. I assume it creates no differences for testing purposes, though (i.e., the usual “put a 10k file on two CDNs”), unless Limelight prioritizes Cloud Files requests differently.
Google’s pricing for App Engine
Google made a number of App Engine-related announcements earlier this week. The most notable of these was a preview of the future paid service, which allows you to extend App Engine’s quotas. Google has previously hinted at pricing, and at their developer conference this past May, they asserted that effectively, the first 5 MPV (million page views) are free, and thereafter, it’d be about $40 per MPV.
The problem is not the price. It’s the way that the quotas are structured. Basically, it looks like Google is going to allow you to raise the quota caps, paying for however much you go over, but never to exceed the actual limit that you set. That means Google is committing itself to a quota model, not backing away from it.
Let me explain why quotas suck as a way to run your business.
Basically, the way App Engine’s quotas work is like this: As you begin to approach the limit (currently Google-set, but eventually set by you), Google will start denying those requests. If you’re reaching the limit of a metered API call, when your app tries to make that call, Google will return an exception, which your app can catch and handle; inelegant, but at least something you can present to the user as a handled error. However, if you’re reaching a more fundamental limit, like bandwidth, Google will begin returning page requests with a the 403 HTTP status code. 403 is an error that prevents your user from getting the page at all, and there’s no elegant way to handle it in App Engine (no custom error pages).
As you approach quota, Google tries to budget your requests so that only some of them fail. If you get a traffic spike, it’ll drop some of those requests so that it still has quota left to serve traffic later. (Steve Jones’ SOA blog chronicles quite a bit of empirical testing, for those who want to see what this “throttling” looks like in practice.)
The problem is, now you’ve got what are essentially random failures of your application. If you’ve got failing API calls, you’ve got to handle the error and your users will probably try again — exacerbating your quota problem and creating an application headache. (For instance, what if I have to make two database API calls to commit data from an operation, and the first succeeds but the second fails? Now I have data inconsistency, and thanks to API calls continuing to fail, quite possibly no way to fix it. Google’s Datastore transactions are restricted to operations on the same entity group, so transactions will not deal with all such problems.) Worse still, if you’ve got 403 errors, your site is functionally down, and your users are getting a mysterious error. As someone who has a business online, do you really want, under circumstances of heavy traffic, your site essentially failing randomly?
Well, one might counter, if you don’t want that to happen, just set your quota limits really really high — high enough that you never expect a request to fail. The problem with that, though, is that if you do it, you have no way to predict what your costs actually will be, or to throttle high traffic in a more reasonable way.
If you’re on traditional computing infrastructure, or, say, a cloud like Amazon EC2, you decide how many servers to provision. Chances are that under heavy traffic, your site performance would degrade — but you would not get random failures. And you would certainly not get random failures outside of the window of heavy traffic. The quota system under use by Google means that you could get past the spike, have enough quota left to serve traffic for most of the rest of the day, but still cross the over-quota-random-drop threshold later in the day. You’d have to go micro-manage (temporarily adjusting your allowable quota after a traffic spike, say) or just accept a chance of failure. Either way, it is a terrible way to operate.
This is yet another example of how Google App Engine is not and will not be near-term ready for prime-time, and how more broadly, Google is continuing to fail to understand the basic operational needs of people who run their businesses online. It’s not just risk-averse enterprises who can’t use something with this kind of problem. It’s the start-ups, too. Amazon has set a very high bar for reliability and understanding of what you need to run a business online, and Google is devoting lots of misdirected technical acumen to implementing something that doesn’t hit the mark.
Aflexi, a new CDN aggregator
Aflexi has announced its launch, which is slated for January of 2009.
Aflexi is a CDN aggregator, targeting small Web hosters much in the same way that Velocix’s Metro product targets broadband providers. (What’s old is new again: remember Content Bridge and CDN peering, a hot idea back in 2001?)
Here’s how it works: Aflexi operates a marketplace and CDN routing infrastructure (i.e., the DNS-based brain servers that tell an end-user client what server to pull content from), plus has Linux-based CDN server software.
Web hosters can pay a nominal fee of $150 to register with Aflexi, granting them the right to deploy unlimited copies of Aflexi’s CDN server software. (Aflexi is recommending a minimum of a dual-core server with 4 GB of RAM and 20-30 GB of storage, for these cache servers. That is pretty much “any old hardware you have lying around.”) A hoster can put these servers wherever he likes, and is responsible for their connectivity and so forth. The Web hoster then registers his footprint, desired price for delivering a GB of traffic, and any content restrictions (like “no adult content”) on Aflexi’s marketplace portal.
Content owners can come to the portal to shop for CDN services. If they’re going through one of Aflexi’s hosting partners, they may be limited in their choices, at the hoster’s discretion. The content owner chooses which CDNs he wants to aggregate. Then, he can simply go live; Aflexi will serve his content only over the CDNs he’s chosen. Currently, the content routing is based upon the usual CDN performance metrics; Aflexi plans to offer price/performance routing late next year. Aflexi takes a royalty of 0.8 cents per GB (thus, under a penny); the remainder of the delivery fee goes to whatever hoster served a particular piece of content. Customers will typically be billed through their hoster; Aflexi integrates with the Parallels control panel (they’re packaging in the APS format).
Broadly, although the idea of aggregation isn’t new, the marketplace is an interesting take on it. This kind of federated model raises significant challenges in terms of business concerns — the ability to offer an SLA across a diversified base, and ensuring that content is not tampered with, are likely at the forefront of those concerns. Also, a $150 barrier to entry is essentially negligible, which means there will have to be some strenuous efforts to keep out bad actors in the ecosystem.
Aflexi sees the future of the CDN market as being hosters. I disagree, given that most hosters don’t own networks. However, I do believe that hosting and CDN are natural matches from a product standpoint, and that hosters need to have some form of CDN strategy. It’s clear that Aflexi wants to target small Web hosters and their small-business customers. They’re going to occupy a distinct niche, but I wonder how well that approach will hold up against Rackspace-plus-Limelight and Amazon’s CloudFront, which have solid credibility and are targeted at small customers. But the existence of Aflexi will offer small hosters a CDN option beyond pure resale.
Aflexi says its initial launch hosters will include ThePlanet. That in and of itself is an interesting tidbit, as ThePlanet (which is one of the largest providers of simple dedicated hosting in the world) currently resells EdgeCast.
One more odd little tidbit: The CEO is Whei Meng Wong, previously of UltraUnix, but also, apparently, previously of an interesting SpamHaus ROKSO record (designating a hoster who is a spam haven — willing to host the sites that spammers advertise). Assuming that it’s the same person, which it appears to be, that reputation could have significant effects upon Aflexi’s ability to attract legitimate customers — either hosters or content owners.
The company is funded through a Malaysian government grant. The CTO is Wai-Keen Woon; the VP of Engineering is Yuen-Chi Lian. Neither of them appears to have executive experience, or indeed, much experience period — the CTO’s Facebook profile says he’s an ’07 university graduate. The CEO’s blog seems to indicate he is also an ’07 graduate. So this is apparently a fresh-out-of-college group-of-buddies company — notably, without either a Sales or Marketing executive that they deemed worth mentioning in their launch presentation.
Bottom line, though: This is another example of CDN services moving up a level towards software overlays. The next generation of providers own software infrastructure and the CDN routing brain, but don’t deploy a bunch of servers and network capacity themselves.
Tips for a Magic Quadrant
It has been a remarkably busy December, with my client inquiries dominated by colocation calls, and it looks like the last bit of the year’s inquiries will be rounded out with last-minute year-end deals for CDN services. I’ve published what I’m going to publish this year, so I’m focusing on my first-quarter 2009 agenda, and all the preparations that go into the Magic Quadrant for Web Hosting.
We’re looking at probably double the number of providers this year than we had last year, with the high likelihood that there’s nobody at the new providers who have gone through an MQ process in some previous life. That means a certain amount of handholding, as well as an aggressive spin-up to learn providers that we don’t know well yet — providers who are entering the enterprise space but don’t necessarily have many enterprise clients yet.
I’m going to devote a certain amount of blog space over the next couple of weeks to talking about what it’s like to do an MQ, because I imagine it’s something that both IT buyers and vendors are occasionally curious about. Keep in mind that this will be personal narrative, though; what’s true for me is not necessarily true for other analysts, including my usual partner-in-crime for this particular MQ.
The quick tips for vendors:
1. Know who Gartner is advising and therefore, what our clients care about (and thus, the products and services of yours that matter to them).
2. Be able to concisely and concretely articulate what makes you different from your competitors.
3. Have a vision of the market and be able to explain how that ties into the way that you run your company and how it ties into your product plans for its future.
4. Make sure your customer references still like you.
Ranking of ISPs
Datahounds might be interested in the Renesys ISP rankings for 2008.
Renesys is a company that specializes in collecting data about the Internet, focused upon the peering ecosystem. Its rankings are essentially a matter of size — how much IP address space ends up transiting each provider?
Among the interesting data points: Level 3 has overtaken Sprint for the #1 spot, Global Crossing has continued its rapid climb to become #3, Telia Sonera has grown steadily, and, broadly, Asia is a huge source for growth.
Analysts travel a lot. As I think over the year, here are the cities where I’ve visited clients during 2008…
Northeast: Baltimore, Boston, New York City, Philadelphia, Stamford, Washington DC
South: Atlanta, Birmingham, Charlotte, Memphis, Miami, Nashville, Richmond
Midwest: Austin, Chicago, Dallas/Fort Worth, Detroit, Houston, Milwaukee, Minneapolis/St. Paul, San Antonio, St. Louis
West: Las Vegas, Los Angeles, Portland, San Diego, San Jose, San Francisco
Canada: Montreal, Toronto
Google builds a CDN for its own content
An article in the Wall Street Journal today describes Google’s OpenEdge initiative (along with a lot of spin around net neutrality, resulting in a Google reply on its public policy blog).
Basically, Google is trying to convince broadband providers to let it place caches within their networks — effectively, pursuing the same architecture that a deep-footprint CDN like Akamai uses, but for Google content alone.
Much of the commentary around this seems to center on the idea that if Google can use this to obtain better performance for its content and applications, everyone else is at a disadvantage and it’s a general stab to net neutrality. (Even Om Malik, who is not usually given to mindless panic, asserts, “If Google can buy better performance for its service, your web app might be at a disadvantage. If the cost of doing business means paying baksheesh to the carriers, then it is the end of innovation as we know it.”)
I think this is an awful lot of hyperbole. Today, anyone can buy better performance for their Web content and applications by paying money to a CDN. And in turn, the CDNs pay baksheesh, if you want to call it that, to the carriers. Google is simply cutting out the middleman, and given that it accounts for as more traffic on the Internet than most CDNs, it’s neither illogical nor commercially unreasonable.
Other large content providers — Microsoft and AOL notably on a historical basis — have built internal CDNs in the past; Google is just unusual in that it’s attempting to push those caches deeper into the network on a widespread basis. I’d guess that it’s YouTube, more than anything else, that’s pushing Google to make this move.
This move is likely driven at least in part by the fact that most of the broadband providers simply don’t have enough 10 Gbps ports for traffic exchange (and space and power constraints in big peering points like Equinix’s aren’t helping matters, making it artificially hard for providers to get the expansions necessary to put big new routers into those facilities). Video growth has sucked up a ton of capacity. Google, and YouTube in particular, is a gigantic part of video traffic. If Google is offering to alleviate some of that logjam by putting its servers deeper into a broadband provider’s network, that might be hugely attractive from a pure traffic engineering standpoint. And providers likely trust Google to have enough remote management and engineering expertise to ensure that those cache boxes are well-behaved and not annoying to host. (Akamai has socialized this concept well over much of the last decade, so this is not new to the providers.)
I suspect that Google wouldn’t even need to pay to do this. For the broadband providers, the traffic engineering advantages, and the better performace to end-users, might be enough. In fact, this is the same logic that explains why Akamai doesn’t pay for most of its deep-network caches. It’s not that this is unprecedented. It’s just that this is the first time that an individual content provider has reached the kind of scale where they can make the same argument as a large CDN.
The cold truth is that small companies generally do not enjoy the same advantages as large companies. If you are a small company making widgets, chances are that a large company making widgets has a lower materials cost than you do, because they are getting a discount for buying in bulk. If you are a small company doing anything whatsoever, you aren’t going to see the kind of supplier discounts that a large company gets. The same thing is true for bandwidth — and for that matter, for CDN services. And big companies often leverage their scale into greater efficiency, to boot; for instance, unsurprisingly, Gartner’s metrics data shows that the average cost to running servers drops as you get more servers in your data center. Google employs both scale and efficiency leverage.
One of the key advantages of the emerging cloud infrastructure services, for start-ups, is that such services offer the leverage of scale, on a pay-by-the-drink basis. With cloud, small providers can essentially get the advantage of big providers by banding together into consortiums or paying an aggregator. However, on the deep-network CDN front, this probably won’t help. Highly distributed models work very well for extremely popular content. For long-tail content, cache hit ratios can be too low for it to be really worthwhile. That’s why it’s doubtful that you’ll see, say, Amazon’s Cloudfront CDN, push deep rather than continuing to follow a megaPOP model.
Ironically, because caching techniques aren’t as efficient for small content providers, it might actually be useful to them to be able to buy bandwidth at a higher QoS.
Anti-virus integration with cloud storage
Anti-virus vendor Authentium is now offering its AV-scanning SDK to cloud providers.
Authentium, unlike most other AV vendors, has traditionally been focused at the gateway; they offer an SDK designed to be embedded in applications and appliances. (Notably, Authentium is the scanning engine used by Google’s Postini service.) So courting cloud providers is logical for them.
Anti-virus integration makes particular sense for cloud storage providers. Users of cloud storage upload millions of files a day. Many businesses that use cloud storage do so for user-generated content. AV-scanning a file as part of an upload could be just another API call — one that could be charged for on a per-operation basis, just like GET, PUT, and other cloud storage operations. That would turn AV scanning into a cloud Web service, making it trivially easy for developers to integrate AV scanning into their applications. It’d be a genuine value-add for using cloud storage — a reason to do so beyond “it’s cheap”.
More broadly, security vendors have become interested in offering scanning as a service, although most have desktop installed bases to defend, and thus are looking at it as a supplement as opposed to a replacement for traditional desktop AV products; see the past news on McAfee’s Project Artemis or Trend Micro’s Smart Protection Network for examples.