Scaling limits and friendly failure
I’m on vacation, and I’ve been playing World of Goo (possibly the single-best construction puzzle game since 1991’s Lemmings by Psygnosis). I was reading the company’s blog (2D Boy), when I came across an entry about BlueHost’s no-notice termination of 2D Boy’s hosting.
And that got me thinking about “unlimited” hosting plans, throttling, limits, and the other challenges of running mass-market hosting — all issues also directly applicable to cloud computing.
BlueHost is a large and reputable provider of mass-market shared hosting. Their accounts are “unlimited”, and their terms of service essentially says that you can consume resources until you negatively impact other customers.
Now, in practice there are limits, and customers are sort of expected to know whether or not their needs fit shared hosting. Most people plan accordingly — although there have been some spectacular failures to do so, such as Sk*rt, a female-focused Digg competitor launched using BlueHost, prompting vast wonder at what kind of utter lack of thought results in trying to launch a high-traffic social networking site on a $7 hosting plan. Unlike Sk*rt, though, it was reasonable for 2D Boy to expect that shared hosting would cover their needs — hosting a small corporate site and blog. They were two guys who were making an indie garage game getting a gradual traffic ramp thanks to word-of-mouth, not an Internet company doing a big launch.
Limits are necessary, but no-notice termination of a legitimate company is bad customer service, however you slice it. Moreover, it’s avoidable bad customer service. Whatever mechanism is used to throttle, suspend service, etc. ought to be adaptable to sending out a warning alert: the “hey, if you keep doing this, you will be in violation of our policies and we’ll have to terminate you” note. Maybe even a, “hey, we will continue to serve your traffic for $X extra, and you have Y time to find a new host or reduce your traffic to normal volumes”. BlueHost does not sell anything beyond its $7 plan, so it has no upsell path; a provider with an upgrade path would hopefully have tried to encourage a migration, rather than executing a cold-turkey cut-off. (By the way, I have been on the service provider side of this equation, so I have ample sympathy for the vendor’s position against a customer whose usage is excessive, but I also firmly believe that no-notice termination of legitimate businesses is not the way to go.)
Automated elastic scaling is the key feature of a cloud, and consequently, limits and the way that they’re enforced technically and managed from a customer service standpoint, will be one of the ways that cloud infrastructure providers differentiate their services.
A vendor’s approach to limits has to be tied to their business goals. Similarly, what a customer desires out of limits must also be tied to their business goals. The customer wants reliable service within a budget. The vendor wants to be fairly compensated and ensure that his infrastructure remains stable.
Ideally, on cloud infrastructure, a customer scales seamlessly and automatically until the point where he is in danger of exceeding his budget. At that point, the system should alert him automatically, allowing him to increase his budget. If he doesn’t want to pay more, he will experience degraded service; degradation should mean slower or lower-priority service, or an automatic switch to a “lite” site, rather than outright failure.
Perhaps when you get right down to it, it’s really about what the failure mode is. Fail friendly. A vendor has a lot more flexibility in imposing limits if it can manage that.