Google’s pricing for App Engine
Google made a number of App Engine-related announcements earlier this week. The most notable of these was a preview of the future paid service, which allows you to extend App Engine’s quotas. Google has previously hinted at pricing, and at their developer conference this past May, they asserted that effectively, the first 5 MPV (million page views) are free, and thereafter, it’d be about $40 per MPV.
The problem is not the price. It’s the way that the quotas are structured. Basically, it looks like Google is going to allow you to raise the quota caps, paying for however much you go over, but never to exceed the actual limit that you set. That means Google is committing itself to a quota model, not backing away from it.
Let me explain why quotas suck as a way to run your business.
Basically, the way App Engine’s quotas work is like this: As you begin to approach the limit (currently Google-set, but eventually set by you), Google will start denying those requests. If you’re reaching the limit of a metered API call, when your app tries to make that call, Google will return an exception, which your app can catch and handle; inelegant, but at least something you can present to the user as a handled error. However, if you’re reaching a more fundamental limit, like bandwidth, Google will begin returning page requests with a the 403 HTTP status code. 403 is an error that prevents your user from getting the page at all, and there’s no elegant way to handle it in App Engine (no custom error pages).
As you approach quota, Google tries to budget your requests so that only some of them fail. If you get a traffic spike, it’ll drop some of those requests so that it still has quota left to serve traffic later. (Steve Jones’ SOA blog chronicles quite a bit of empirical testing, for those who want to see what this “throttling” looks like in practice.)
The problem is, now you’ve got what are essentially random failures of your application. If you’ve got failing API calls, you’ve got to handle the error and your users will probably try again — exacerbating your quota problem and creating an application headache. (For instance, what if I have to make two database API calls to commit data from an operation, and the first succeeds but the second fails? Now I have data inconsistency, and thanks to API calls continuing to fail, quite possibly no way to fix it. Google’s Datastore transactions are restricted to operations on the same entity group, so transactions will not deal with all such problems.) Worse still, if you’ve got 403 errors, your site is functionally down, and your users are getting a mysterious error. As someone who has a business online, do you really want, under circumstances of heavy traffic, your site essentially failing randomly?
Well, one might counter, if you don’t want that to happen, just set your quota limits really really high — high enough that you never expect a request to fail. The problem with that, though, is that if you do it, you have no way to predict what your costs actually will be, or to throttle high traffic in a more reasonable way.
If you’re on traditional computing infrastructure, or, say, a cloud like Amazon EC2, you decide how many servers to provision. Chances are that under heavy traffic, your site performance would degrade — but you would not get random failures. And you would certainly not get random failures outside of the window of heavy traffic. The quota system under use by Google means that you could get past the spike, have enough quota left to serve traffic for most of the rest of the day, but still cross the over-quota-random-drop threshold later in the day. You’d have to go micro-manage (temporarily adjusting your allowable quota after a traffic spike, say) or just accept a chance of failure. Either way, it is a terrible way to operate.
This is yet another example of how Google App Engine is not and will not be near-term ready for prime-time, and how more broadly, Google is continuing to fail to understand the basic operational needs of people who run their businesses online. It’s not just risk-averse enterprises who can’t use something with this kind of problem. It’s the start-ups, too. Amazon has set a very high bar for reliability and understanding of what you need to run a business online, and Google is devoting lots of misdirected technical acumen to implementing something that doesn’t hit the mark.
Posted on December 19, 2008, in Infrastructure and tagged cloud, Google, hosting. Bookmark the permalink. 2 Comments.
Your concerns are understandable. But with billing enabled, App Engine applications will be able to consume their budgets at the rate they choose. There will be an upper limit in place for the sake of safety, but this will affect very few applications. The documentation covers this idea in the section on burst limits. More detail will be added when billing launches.
My concern is the budget itself, as opposed to the absolute limits.
Businesses need to be able to plan, so they’ve got to be able to set limits, of course. In fact, most businesses will probably choose to set relatively narrow absolute limits. We’ve found that even deep-pocketed businesses are highly sensitive about knowing what range their bills in a given month are going to fall into.
Most of the time, when you plan system capacity within a budget, you basically decide that when your traffic exceeds particular boundaries, you’re going to simply deal with the fact that it will be slow for a while — i.e., you’re accepting some level of degradation of service. But normally you architect so that it’s slow rather than actually failing.
The thing about the quota system is that under temporary anomalous load, you don’t get degradation — you get throttling as App Engine tries to prevent you from eating all your resource allocation at once, and thus failures, and you get a situation where only *some* requests fail rather than *all* requests fail, which may actually lead to more rather than less user frustration.
What the system needs is graceful degradation — where the app is still served but under less desirable circumstances, or a “lite” site (designed by the content owner) is automatically switched to, or the like. I hope that there will be an alerting mechanism, as well.
I think the system also needs greater transparency, so application owners can understand exactly the circumstances under which App Engine will decide that they’re in danger of running out of resources, as well as clear warnings (and alerts) generated when the app begins to approach the threshold (i.e., before throttling begins). Similarly, when an app gets throttled, there ought to be a way for the content owner to grant temporary resource bumps for that day, without changing his whole budgeting mechanism — and for the control panel to clearly indicate how much resource needs to be allocated in order to ensure that App Engine doesn’t start to throttle again. It’s the unpredictability of the throttling that’s part of the problem.