Blog Archives
The culture of service
I recently finished reading Punching In, a book by Alex Frankel. It’s about his experience working as a front-line employee in a variety of companies, from UPS to Apple. The book is focused upon corporate culture, the indoctrination of customer-facing employees, and how such employees influence the customer experience. And that got me thinking.
Culture may be the distinguishing characteristic between managed hosting companies. Managed hosting is a service industry. You make an impression upon the customer with every single touch, from the response to the initial request for information, to the day the customer says good-bye and moves on. (The same is true for more service-intensive cloud computing and CDN providers, too.)
I had the privilege, more than a decade ago, of spending several years working at DIGEX (back when all-uppercase names were trendy, before the chain of acquisitions that led to the modern Digex, absorbed into Verizon Business). We were a classic ISP of the mid-90s — we offered dial-up, business frame relay and leased lines, and managed hosting. Back then, DIGEX had a very simple statement of differentiation: “We pick up the phone.” Our CEO used to road-show dialing our customer service number, promising a human being would pick up in two rings or less. (To my knowledge, that demo never went wrong.) We wanted to be the premium service company in the space, and a culture of service really did permeate the company — the idea that, as individuals and as an organization, we were going to do whatever it took to make the customer happy.
For those of you who have never worked in a culture like that: It’s awesome. Most of us, I think, take pleasure in making our customers happy; it gives meaning to our work, and creates the feeling that we are not merely chasing the almighty dime. Cultures genuinely built around service idolize doing right by the customer, and they focus on customer satisfaction as the key metric. (That, by the way, means that you’ve got to be careful in picking your customers, so that you only take business that you know that you can service well and still make a profit on.)
You cannot fake great customer service. You have to really believe in it, from the highest levels of executive management down to the grunt who answers the phones. You’ve got to build your company around a set of principles that govern what great service means to you. You have to evaluate and compensate employees accordingly, and you’ve got to offer everyone the latitude to do what’s right for your customers — people have to know that the management chain will back them up and reward them for it.
Importantly, great customer service is not equivalent to heroics. Some companies have cultures, especially in places like IT operations, where certain individuals ride in like knights to save the day. But heroics almost always implies that something has gone wrong — that service hasn’t been what it needed to be. Great service companies, on the other hand, ensure that the little things are right — that routine interactions are pleasant and seamless, that processes and systems help employees to deliver better service, and that everyone is incentivized to cooperate across functions and feel ownership of the customer outcome.
When I talk to hosting companies, I find that many of them claim to value customer service, but their culture and the way they operate clash directly with their ability to deliver great service. They haven’t built service-centric cultures, they haven’t hired people who value service (admittedly tricky: hire smart competent geeks who also like and are good at talking to people), and they aren’t organized and incentivized to deliver great service.
Similarly, CDN vendors have a kind of tragedy of growth. Lots of people love new CDNs because at the outset, there’s an extremely high-touch support model — if you’ve got a problem, you’re probably going to get an engineer on the phone with you right away, a guy who may have written the CDN software or architected the network, who knows everything inside and out and can fix things promptly. As the company grows, the support model has to scale — so the engineers return to the back room and entry-level lightly-technical support folks take their place. It’s a necessity, but that doesn’t mean that customers don’t miss having that kind of front-line expertise.
So ask yourself: What are the features of your corporate culture that create the delivery of great customer service, beyond a generic statement like “customers matter to us”? What do you do to inspire your front-line employees to be insanely awesome?
Scaling limits and friendly failure
I’m on vacation, and I’ve been playing World of Goo (possibly the single-best construction puzzle game since 1991’s Lemmings by Psygnosis). I was reading the company’s blog (2D Boy), when I came across an entry about BlueHost’s no-notice termination of 2D Boy’s hosting.
And that got me thinking about “unlimited” hosting plans, throttling, limits, and the other challenges of running mass-market hosting — all issues also directly applicable to cloud computing.
BlueHost is a large and reputable provider of mass-market shared hosting. Their accounts are “unlimited”, and their terms of service essentially says that you can consume resources until you negatively impact other customers.
Now, in practice there are limits, and customers are sort of expected to know whether or not their needs fit shared hosting. Most people plan accordingly — although there have been some spectacular failures to do so, such as Sk*rt, a female-focused Digg competitor launched using BlueHost, prompting vast wonder at what kind of utter lack of thought results in trying to launch a high-traffic social networking site on a $7 hosting plan. Unlike Sk*rt, though, it was reasonable for 2D Boy to expect that shared hosting would cover their needs — hosting a small corporate site and blog. They were two guys who were making an indie garage game getting a gradual traffic ramp thanks to word-of-mouth, not an Internet company doing a big launch.
Limits are necessary, but no-notice termination of a legitimate company is bad customer service, however you slice it. Moreover, it’s avoidable bad customer service. Whatever mechanism is used to throttle, suspend service, etc. ought to be adaptable to sending out a warning alert: the “hey, if you keep doing this, you will be in violation of our policies and we’ll have to terminate you” note. Maybe even a, “hey, we will continue to serve your traffic for $X extra, and you have Y time to find a new host or reduce your traffic to normal volumes”. BlueHost does not sell anything beyond its $7 plan, so it has no upsell path; a provider with an upgrade path would hopefully have tried to encourage a migration, rather than executing a cold-turkey cut-off. (By the way, I have been on the service provider side of this equation, so I have ample sympathy for the vendor’s position against a customer whose usage is excessive, but I also firmly believe that no-notice termination of legitimate businesses is not the way to go.)
Automated elastic scaling is the key feature of a cloud, and consequently, limits and the way that they’re enforced technically and managed from a customer service standpoint, will be one of the ways that cloud infrastructure providers differentiate their services.
A vendor’s approach to limits has to be tied to their business goals. Similarly, what a customer desires out of limits must also be tied to their business goals. The customer wants reliable service within a budget. The vendor wants to be fairly compensated and ensure that his infrastructure remains stable.
Ideally, on cloud infrastructure, a customer scales seamlessly and automatically until the point where he is in danger of exceeding his budget. At that point, the system should alert him automatically, allowing him to increase his budget. If he doesn’t want to pay more, he will experience degraded service; degradation should mean slower or lower-priority service, or an automatic switch to a “lite” site, rather than outright failure.
Perhaps when you get right down to it, it’s really about what the failure mode is. Fail friendly. A vendor has a lot more flexibility in imposing limits if it can manage that.
Google’s pricing for App Engine
Google made a number of App Engine-related announcements earlier this week. The most notable of these was a preview of the future paid service, which allows you to extend App Engine’s quotas. Google has previously hinted at pricing, and at their developer conference this past May, they asserted that effectively, the first 5 MPV (million page views) are free, and thereafter, it’d be about $40 per MPV.
The problem is not the price. It’s the way that the quotas are structured. Basically, it looks like Google is going to allow you to raise the quota caps, paying for however much you go over, but never to exceed the actual limit that you set. That means Google is committing itself to a quota model, not backing away from it.
Let me explain why quotas suck as a way to run your business.
Basically, the way App Engine’s quotas work is like this: As you begin to approach the limit (currently Google-set, but eventually set by you), Google will start denying those requests. If you’re reaching the limit of a metered API call, when your app tries to make that call, Google will return an exception, which your app can catch and handle; inelegant, but at least something you can present to the user as a handled error. However, if you’re reaching a more fundamental limit, like bandwidth, Google will begin returning page requests with a the 403 HTTP status code. 403 is an error that prevents your user from getting the page at all, and there’s no elegant way to handle it in App Engine (no custom error pages).
As you approach quota, Google tries to budget your requests so that only some of them fail. If you get a traffic spike, it’ll drop some of those requests so that it still has quota left to serve traffic later. (Steve Jones’ SOA blog chronicles quite a bit of empirical testing, for those who want to see what this “throttling” looks like in practice.)
The problem is, now you’ve got what are essentially random failures of your application. If you’ve got failing API calls, you’ve got to handle the error and your users will probably try again — exacerbating your quota problem and creating an application headache. (For instance, what if I have to make two database API calls to commit data from an operation, and the first succeeds but the second fails? Now I have data inconsistency, and thanks to API calls continuing to fail, quite possibly no way to fix it. Google’s Datastore transactions are restricted to operations on the same entity group, so transactions will not deal with all such problems.) Worse still, if you’ve got 403 errors, your site is functionally down, and your users are getting a mysterious error. As someone who has a business online, do you really want, under circumstances of heavy traffic, your site essentially failing randomly?
Well, one might counter, if you don’t want that to happen, just set your quota limits really really high — high enough that you never expect a request to fail. The problem with that, though, is that if you do it, you have no way to predict what your costs actually will be, or to throttle high traffic in a more reasonable way.
If you’re on traditional computing infrastructure, or, say, a cloud like Amazon EC2, you decide how many servers to provision. Chances are that under heavy traffic, your site performance would degrade — but you would not get random failures. And you would certainly not get random failures outside of the window of heavy traffic. The quota system under use by Google means that you could get past the spike, have enough quota left to serve traffic for most of the rest of the day, but still cross the over-quota-random-drop threshold later in the day. You’d have to go micro-manage (temporarily adjusting your allowable quota after a traffic spike, say) or just accept a chance of failure. Either way, it is a terrible way to operate.
This is yet another example of how Google App Engine is not and will not be near-term ready for prime-time, and how more broadly, Google is continuing to fail to understand the basic operational needs of people who run their businesses online. It’s not just risk-averse enterprises who can’t use something with this kind of problem. It’s the start-ups, too. Amazon has set a very high bar for reliability and understanding of what you need to run a business online, and Google is devoting lots of misdirected technical acumen to implementing something that doesn’t hit the mark.
Cloud research
I am spending as much of my research time as possible on cloud these days, although my core coverage (colocation, hosting, and CDNs) still demands most of my client-facing time.
Reflecting the fact that hosting and cloud infrastructure services are part of the same broad market (if you’re buying service from Joyent or GoGrid or MediaTemple or the like, you’re buying hosting), the next Gartner Magic Quadrant for Web Hosting will include cloud providers. That means I’m currently busy working on an awful lot of stuff, preparatory to beginning the formal process in January. I know we’ll be dealing with a lot of vendors who have never participated in a Magic Quadrant before, which should make this next iteration personally challenging but hopefully very interesting to our clients and exciting to vendors in the space.
Anyway, I have two new research notes out today:
Web Hosting and Cloud Infrastructure Prices, North America, 2008. This defines a segmentation for the emerging cloud infrastructure services market, and provides guidance to current pricing for the various category of Web hosting services, including cloud services.
Dataquest Insight: A Service Provider Road Map to the Cloud Infrastructure Transformation. This is a note targeted at hosting companies, carriers, IT outsourcers, and others who are in, or plan to enter, the hosting or cloud infrastructure services markets. It’s a practical guide to the evolving market, with a look at product and customer segmentation, the financial impacts, and the practicalities of evolving from traditional hosting to the cloud.
Gartner clients only for those notes, sorry.
IronScale launches
Sacramento-based colocation provider RagingWire has launched a subsidiary, StrataScale, whose first product is a managed cloud hosting service, IronScale. (I’ve mentioned this before, but the launch is now official.) I’ll be posting more on it once I’ve had time to check out a demo, but here’s a quick take:
What’s interesting is that IronScale is not a virtualized service. The current offering is on dedicated hardware — similar to the approach taken by SoftLayer, but this is a managed service. But it has the key cloud trait of elasticity — the ability to scale up and down at will, without commitments.
IronScale has automated fast provisioning (IronScale claims 3 minutes for the whole environment), management through the OS layer (including services like patch management), an integrated environment that includes the usual network suspects (firewall, load balancing, SSL acceleration), and a 100% uptime SLA. You can buy service on a month-to-month basis or an annual contract. This is a virtual data center offering; there’s a Web portal for provisioning plus a Web services API, along with some useful tricks like cloning and snapshots.
It’s worth noting that cloud infrastructure services, in their present incarnation, are basically just an expansion of the hosting market — moving the bar considerably in terms of expected infrastructure flexibility. This is real-time infrastructure, virtualized or not. It’s essentially a challenge to other companies who offer basic managed services — Rackspace, ThePlanet, and so on — but you can also expect it to compete with the VDC hosting offerings that target the mid-sized to enteprise organizations.
Recently-published research
Here’s a quick round-up of some of my recently-published research.
Is Amazon EC2 Right For You? This is an introduction to Amazon’s Elastic Compute Cloud, written for a mildly technical audience. It summarizes Amazon’s capabilities, the typical business case for using it, and what you’ve got to do to use it. If you’re an engineer looking for a quick briefing, or you want to show a “what this newfangled thing is” summary to your manager, or you’re an investor trying to understand what exactly it is that Amazon does, this is the document for you.
Dataquest Insight: Web Hosting, North America, 2006-2012. This is an in-depth look at the colocation and hosting business, together with market forecasts and trends. (Investors may also want to look at the Invest Implications.)
Dataquest Insight: Content Delivery Networks, North America, 2006-2012. This is an in-depth look at the CDN market, segment-by-segment, with market forecasts and trends. (Investors may also want to look at the Invest Implications.)
You’ll need to be a Gartner subscriber (or purchase the individual document) in order to view these pieces.
Upcoming research (for publication in the next month): A pricing guide for Web hosting and cloud infrastructure services; a classification scheme and service provider roadmap for cloud offerings; a toolkit for CDN requirements gathering and price estimation; a framework for gathering video requirements; and a CDN selection guide.
Quick takes: Comcast, Cogent, IronScale
Some quick takes on recent news:
Comcast P4P Results. Comcast is one of the ISPs working with hybrid-P2P CDN Pando Networks on a trial, and is showing better numbers than its competitors. The takeaway: Broadband ISPs are actively interested in P2P, CDN, and figuring out a way to monetize all of the video delivery they’re doing to their end-users.
Sprint Depeers Cogent (and Repeers). In this latest round of Cogent’s peering disputes, it’s arguing over a contract it signed with Sprint. The takeaway: Cogent is trying to keep its costs down, and is responsible for driving down bandwidth costs for everyone; its competitors are hitting back, rooted in the belief that Cogent is able to keep its prices low because it isn’t pulling its fair share of traffic carriage, which gets expressed in disputes over peering settlements.
IronScale Launches. RagingWire (a colo provider in Sacramento) has launched a managed hosting offering. Like SoftLayer, this is rapidly-provisioned dedicated servers and associated infrastructure, but unlike most of the competition in this space, it’s a managed solution. The takeaway: Like I wrote almost three years ago, it’s not about virtualization, it’s about flexibility. (“Beyond the Hype“: clients only, sorry.)
What Rackspace’s cloud moves mean
Last week, Rackspace made a bunch of announcements about its cloud strategy. I wrote previously about its deal with Limelight; now I want to contemplate its two acquisitions, Jungle Disk and Slicehost. (I have been focused on writing research notes in the last week, or I would have done this sooner…)
Jungle Disk provides online storage, via Amazon S3. Its real strength is in its easy-to-use interface; you can make your Jungle Disk storage look like a network drive, it has automated backup into the cloud, and there are premium features like Web-based (WebDAV) access. Files are store encrypted. You pay for their software, then pay the S3 charges; there’s only a monthly recurring from them if you get their “plus” service. The S3 account is yours, so if you decide to dump Jungle Disk, you can keep using your storage.
The Jungle Disk acquisition looks like a straightforward feature addition — it’s a value-add for Rackspace’s Cloud Files offering, and Rackspace has said that Jungle Disk will be offering storage on both platforms. It’s a popular brand in the S3 backup space, and it’s a scrappy little self-funded start-up.
I suspect Rackspace likes scrappy little self-funded start-ups. The other acquisition, Slicehost, is also one. At this point, outright buying smart and ambitious entrepreneurial engineers with cloud experience is not a bad plan for Rackspace, whose growth has already resulted in plenty of hiring challenges.
Slicehost is a cloud hosting company. They offer unmanaged Linux instances on a Xen-based platform; their intellectual property comes in the form of their toolset. What’s interesting about this acquisition is that this kind of “server rental” — for $X per month, you can get server hardware (except this time it’s virtual rather than physical) — is actually akin to Rackspace’s old ServerBeach business (sold to Peer 1 back in 2004), not to Rackspace’s current managed hosting business.
Rackspace got out of the ServerBeach model because it was fundamentally different from their “fanatical support” desires, and because it has much less attractive returns on invested capital. The rental business offers a commodity at low prices, where you hope that nobody calls you because that’s going to eat your margin on the deal; you are ultimately just shoving hardware at the customer. What Rackspace’s managed hosting customers pay for is to have their hands held. The Slicehost model is the opposite of that.
Cloud infrastructure providers, hope, of course, that they’ll be able to offer enough integrated value-adds on top of the raw compute to earn higher margins, and gain greater stickiness. It’s clear that Rackspace wants to be a direct competitor to Amazon (and companies like Joyent). Now the question is exactly how they’re going to reconcile that model with the fanatical support model, not to mention their ROIC model.
Cloud risks and organizational culture
I’ve been working on a note about Amazon EC2, and pondering how different the Web operations culture of Silicon Valley is from that of the typical enterprise IT organization.
Silicon Valley’s prevailing Ops culture is about speed. There’s a desperate sense of urgency that seems to prevail there, a relentless expectation that you can be the Next Big Thing, if only you can get there fast enough. Or, alternatively, you are the Current Big Thing, and it is all you can do to keep up with your growth, or at least not have the Out Of Resources truck run right over you.
Enterprise IT culture tends to be about risk mitigation. It is about taking your time, being thorough, and making the right decisions and ensuring that nothing bad happens as the result of them.
To techgeeks at start-ups in the Valley (and I mean absolutely no disparagement by this, as I was one, and perhaps still would be, if I hadn’t become an analyst), the promise and usefulness of cloud computing is obvious. The question is not if; it is when — when can I buy a cloud that has the particular features I need to make my life easier? But: Simplify my architecture? Solve my scaling problems and improve my availability? Give me infrastructure the instant I need it, and charge me only when I get it? I want it right now. I wanted it yesterday, I wanted it last year. Got a couple of problems? Hey, everyone makes mistakes; just don’t make them twice. If I’d done it myself, I’d have made mistakes too; anyone would have. We all know this is hard. No SLA? Just fix it as quickly as you can, and let me know what went wrong. It’s not like I’m expecting you to go to Tahiti while my infrastructure burns; I know you’ll try your best. Sure, it’s risky, but heck, my whole business is a risk! No guts, no glory!
Your typical enterprise IT guy is struck aghast by that attitude. He does not have the problem of waking up one morning and discovering that his sleepy little Facebook app has suddenly gotten the attention of teenyboppers world-wide and now he needs a few hundred or a few thousand servers right this minute, while he prays that his application actually scales in a somewhat linear fashion. He’s not dealing with technology he’s built himself that might or might not work. He isn’t pushing the limits and having to call the vendor to report an obscure bug in the operating system. He isn’t being asked to justify his spending to the board of directors. He lives in a world of known things — budgets worked out a year in advance, relatively predictable customer growth, structured application development cycles stretched out over months, technology solutions that are thoroughly supported by vendors. And so he wants to try to avoid introducing unknowns and risks into his environment.
Despite eight years at Gartner, advising clients that are mostly fairly conservative in their technology decisions, I still find myself wanting to think in early-adopter mode. In trying to write for our clients, I’m finding it hard to shift from that mode. It’s not that I’m not skeptical about the cloud vendors (and I’m trying to be hands-on with as many platforms as I can, so I can get some first-hand understanding and a reality check). It’s that I am by nature rooted in that world that doesn’t care as much about risk. I am interested in reasonable risk versus the safest course of action.
Realistically, enterprises are going to adopt cloud infrastructure in a very different way and at a very different pace than fast-moving technology start-ups. At the moment, few enterprises are compelled towards that transformation in the way that the Web 2.0 start-ups are — their existing solutions are good enough, so what’s going to make them move? All the strengths of cloud infrastructure — massive scalability, cost-efficient variable capacity, Internet-readiness — are things that most enterprises don’t care about that much.
That’s the decision framework I’m trying to work out next.
I am actively interested in cloud infrastructure adoption stories, especially from “traditional” enterprises who have made the leap, even in an experimental way. If you’ve got an experience to share, using EC2, Joyent, Mosso, EngineYard, Terremark’s Infinistructure, etc., I’d love to hear it, either in a comment on my blog or via email at lydia dot leong at gartner dot com.
Rackspace buys itself some cloud
Rackspace’s cloud event resulted in a very significant announcement: the acquisition of Slicehost and Jungle Disk. There’s also an announced Limelight partnership (unknown at the moment what this means, as the two companies already have a relationship), and a Sonian partnership to offer email archiving to Rackspace’s Mailtrust hosted email business.
My gut reaction: Very interesting moves. Signals an intent to be much more aggressive in the cloud space than I think most people were expecting.