A few days ago, an unexpected side-effect of some new code caused a major Gmail outage. Last year, a small bug triggered a series of cascading failures that resulted in a major Amazon outage. These are not the first cloud failures, nor will they be the last.
Cloud failures are as complex as the underlying software that powers them. No longer do you have isolated systems; you have complex, interwoven ecosystems, delicately orchestrated by a swarm of software programs. In presenting simplicity to the user, the cloud provider takes on the burden of dealing with that complexity themselves.
People sometimes say that these clouds aren’t built to enterprise standards. In one sense, they aren’t — most aren’t intended to meet enterprise requirements in terms of feature-set. In another sense, though, they are engineered to far exceed anything that the enterprise would ever think of attempting themselves. Massive-scale clouds are designed to never, ever, fail in a user-visible way. The fact that they do fail nonetheless should not be a surprise, given the potential for human error encoded in software. It is, in fact, surprising that they don’t visibly fail more often. Every day, within these clouds, a whole host of small errors that would be outages if they occurred within the enterprise — server hardware failures, storage failures, network failures, even some software failures — are handled invisibly by the back-end. Most of the time, the self-healing works the way it’s supposed to. Sometimes it doesn’t. The irony in both the Gmail outage and the S3 outage is that both appear to have been caused by the very software components that were actively trying to create resiliency.
To run infrastructure on a massive scale, you are utterly dependent upon automation. Automation, in turn, depends on software, and no matter how intensively you QA your software, you will have bugs. It is extremely hard to test complex multi-factor failures. There is nothing that indicates that either Google or Amazon are careless about their software development processes or their safeguards against failure. They undoubtedly hate failure as much as, and possibly more than, their customers do. Every failure means sleepless nights, painful internal post-mortems, lost revenue, angry partners, and embarrassing press. I believe that these companies do, in fact, diligently seek to seamlessly handle every error condition they can, and that they generally possess sufficient quantity and quality of engineering talent to do it well.
But the nature of the cloud — the one homogenous fabric — magnifies problems. Still, that’s not isolated to the cloud alone. Let’s not forget VMware’s license bug from last year. People who normally booted up their VMs at the beginning of the day were pretty much screwed. It took VMware the better part of a day to produce a patch — and their original announced timeframe was 36 hours. I’m not picking on VMware — certainly you could find yourself with a similar problem with any kind of widely deployed software that was vulnerable to a bug that caused it all to fail.
Enterprise-quality software produced the SQL Slammer worm, after all. In the cloud, we ain’t seen nothing yet…
I heard a radio ad today for Google Federal. It sounded like every other “please, government IT purchasing person, buy our stuff” ad that you hear on news radio in Washington DC. It was a far cry from the sort of ad that one expects to hear from Google, and to hear a federal-targeted ad from them, period, was sort of fascinating.
The Federal government can still afford stuff (and is probably one of the few bright spots in purchasing this year, period). It’s the states that are screwed, and seriously thinking about alternatives to traditional IT.
My colleague Tom Austin has posted a call for Google to be more transparent about enterprise usage of Google Apps. This was triggered by a TechCrunch article on Google’s reduction of the number of free users for a given Google Apps account.
I’ve been wondering how many businesses use Google Apps almost exclusively for messaging, and how many of them make substantial use of collaboration. My expectation is that a substantial number of the folks with custom domains on Google Apps solely or almost-solely do email or email forwarding. For instance, for my WordPress.com-hosted blog, I have no option for email for that domain other than via Google Apps, because WordPress.com has explicit MX record support for them and nobody else — so I use that to forward email for that domain to my regular email account. Given how heavily bloggers have driven domain registrations and “vanity” domains, I’d expect Google Apps to be wrapped up pretty heavily in that phenomenon. This is not to discount the small business, of course, whose usage of this kind of service also becomes more significant over time.
Those statistics aside, though, and going back to Tom’s thoughts on transparency, I think he’s right, if Google intends to court the enterprise market in the way that the enterprise is accustomed to being courted. I am uncertain if Google intends that, though, especially when fighting more featureful, specialized vendors in order to get an enterprise clientele is likely a waste of resources at the moment. The type of enterprise who is going to adopt this kind of solution is probably not the kind of enterprise who wants to see a bunch of case studies and feel reassured by them; they’re independent early adopters with high tolerance for risk. (This goes back to a point I made in a previous post: Enterprise IT culture tends to be about risk mitigation.)
Google made a number of App Engine-related announcements earlier this week. The most notable of these was a preview of the future paid service, which allows you to extend App Engine’s quotas. Google has previously hinted at pricing, and at their developer conference this past May, they asserted that effectively, the first 5 MPV (million page views) are free, and thereafter, it’d be about $40 per MPV.
The problem is not the price. It’s the way that the quotas are structured. Basically, it looks like Google is going to allow you to raise the quota caps, paying for however much you go over, but never to exceed the actual limit that you set. That means Google is committing itself to a quota model, not backing away from it.
Let me explain why quotas suck as a way to run your business.
Basically, the way App Engine’s quotas work is like this: As you begin to approach the limit (currently Google-set, but eventually set by you), Google will start denying those requests. If you’re reaching the limit of a metered API call, when your app tries to make that call, Google will return an exception, which your app can catch and handle; inelegant, but at least something you can present to the user as a handled error. However, if you’re reaching a more fundamental limit, like bandwidth, Google will begin returning page requests with a the 403 HTTP status code. 403 is an error that prevents your user from getting the page at all, and there’s no elegant way to handle it in App Engine (no custom error pages).
As you approach quota, Google tries to budget your requests so that only some of them fail. If you get a traffic spike, it’ll drop some of those requests so that it still has quota left to serve traffic later. (Steve Jones’ SOA blog chronicles quite a bit of empirical testing, for those who want to see what this “throttling” looks like in practice.)
The problem is, now you’ve got what are essentially random failures of your application. If you’ve got failing API calls, you’ve got to handle the error and your users will probably try again — exacerbating your quota problem and creating an application headache. (For instance, what if I have to make two database API calls to commit data from an operation, and the first succeeds but the second fails? Now I have data inconsistency, and thanks to API calls continuing to fail, quite possibly no way to fix it. Google’s Datastore transactions are restricted to operations on the same entity group, so transactions will not deal with all such problems.) Worse still, if you’ve got 403 errors, your site is functionally down, and your users are getting a mysterious error. As someone who has a business online, do you really want, under circumstances of heavy traffic, your site essentially failing randomly?
Well, one might counter, if you don’t want that to happen, just set your quota limits really really high — high enough that you never expect a request to fail. The problem with that, though, is that if you do it, you have no way to predict what your costs actually will be, or to throttle high traffic in a more reasonable way.
If you’re on traditional computing infrastructure, or, say, a cloud like Amazon EC2, you decide how many servers to provision. Chances are that under heavy traffic, your site performance would degrade — but you would not get random failures. And you would certainly not get random failures outside of the window of heavy traffic. The quota system under use by Google means that you could get past the spike, have enough quota left to serve traffic for most of the rest of the day, but still cross the over-quota-random-drop threshold later in the day. You’d have to go micro-manage (temporarily adjusting your allowable quota after a traffic spike, say) or just accept a chance of failure. Either way, it is a terrible way to operate.
This is yet another example of how Google App Engine is not and will not be near-term ready for prime-time, and how more broadly, Google is continuing to fail to understand the basic operational needs of people who run their businesses online. It’s not just risk-averse enterprises who can’t use something with this kind of problem. It’s the start-ups, too. Amazon has set a very high bar for reliability and understanding of what you need to run a business online, and Google is devoting lots of misdirected technical acumen to implementing something that doesn’t hit the mark.
Basically, Google is trying to convince broadband providers to let it place caches within their networks — effectively, pursuing the same architecture that a deep-footprint CDN like Akamai uses, but for Google content alone.
Much of the commentary around this seems to center on the idea that if Google can use this to obtain better performance for its content and applications, everyone else is at a disadvantage and it’s a general stab to net neutrality. (Even Om Malik, who is not usually given to mindless panic, asserts, “If Google can buy better performance for its service, your web app might be at a disadvantage. If the cost of doing business means paying baksheesh to the carriers, then it is the end of innovation as we know it.”)
I think this is an awful lot of hyperbole. Today, anyone can buy better performance for their Web content and applications by paying money to a CDN. And in turn, the CDNs pay baksheesh, if you want to call it that, to the carriers. Google is simply cutting out the middleman, and given that it accounts for as more traffic on the Internet than most CDNs, it’s neither illogical nor commercially unreasonable.
Other large content providers — Microsoft and AOL notably on a historical basis — have built internal CDNs in the past; Google is just unusual in that it’s attempting to push those caches deeper into the network on a widespread basis. I’d guess that it’s YouTube, more than anything else, that’s pushing Google to make this move.
This move is likely driven at least in part by the fact that most of the broadband providers simply don’t have enough 10 Gbps ports for traffic exchange (and space and power constraints in big peering points like Equinix’s aren’t helping matters, making it artificially hard for providers to get the expansions necessary to put big new routers into those facilities). Video growth has sucked up a ton of capacity. Google, and YouTube in particular, is a gigantic part of video traffic. If Google is offering to alleviate some of that logjam by putting its servers deeper into a broadband provider’s network, that might be hugely attractive from a pure traffic engineering standpoint. And providers likely trust Google to have enough remote management and engineering expertise to ensure that those cache boxes are well-behaved and not annoying to host. (Akamai has socialized this concept well over much of the last decade, so this is not new to the providers.)
I suspect that Google wouldn’t even need to pay to do this. For the broadband providers, the traffic engineering advantages, and the better performace to end-users, might be enough. In fact, this is the same logic that explains why Akamai doesn’t pay for most of its deep-network caches. It’s not that this is unprecedented. It’s just that this is the first time that an individual content provider has reached the kind of scale where they can make the same argument as a large CDN.
The cold truth is that small companies generally do not enjoy the same advantages as large companies. If you are a small company making widgets, chances are that a large company making widgets has a lower materials cost than you do, because they are getting a discount for buying in bulk. If you are a small company doing anything whatsoever, you aren’t going to see the kind of supplier discounts that a large company gets. The same thing is true for bandwidth — and for that matter, for CDN services. And big companies often leverage their scale into greater efficiency, to boot; for instance, unsurprisingly, Gartner’s metrics data shows that the average cost to running servers drops as you get more servers in your data center. Google employs both scale and efficiency leverage.
One of the key advantages of the emerging cloud infrastructure services, for start-ups, is that such services offer the leverage of scale, on a pay-by-the-drink basis. With cloud, small providers can essentially get the advantage of big providers by banding together into consortiums or paying an aggregator. However, on the deep-network CDN front, this probably won’t help. Highly distributed models work very well for extremely popular content. For long-tail content, cache hit ratios can be too low for it to be really worthwhile. That’s why it’s doubtful that you’ll see, say, Amazon’s Cloudfront CDN, push deep rather than continuing to follow a megaPOP model.
Ironically, because caching techniques aren’t as efficient for small content providers, it might actually be useful to them to be able to buy bandwidth at a higher QoS.
Google announced something very interesting yesterday: their Native Client project.
The short form of what this does: You can develop part or all of your application client in a language that compiles down to native code (for instance, C or C++, compiled to x86 assembly), then let the user run it in their browser, in a semi-sandboxed environment that theoretically prevents malicious code from being executed.
It’s an ambitious project, not to mention one that is probably making every black-hat hacker on the planet drool right now. The security challenges inherent in this are enormous.
Adobe has previously had a similar thought, in the form of Alchemy, a labs project for a C/C++ compiler that generates code for AVM2 (the virtual machine inside the Flash player). But Google takes the idea all the way down to true native code.
The broader trend has been towards managed code environments and just-in-time compilers (JITs). But the idea of native code with managed-code-like protections is certainly extremely interesting, and the techniques developed will likely be interesting in the broader context of malware prevention in non-browser applications, too.
And while we’re talking about lower-level application infrastructure pies that Google has its fingers in, it’s worth noting that Google has also exhibited significant interest in LLVM (which stands for Low-Level Virtual Machine). LLVM is an open-source project now sponsored by Apple, who hired its developer and is now using it within MacOS X. In layman’s terms, LLVM makes it easier for developers to write new programming languages, and makes it possible to develop composite applications using multiple programming languages. A compiler or interpreter developer can generate LLVM instructions rather than compiling to native code, then let LLVM take care of dealing with the back-end, the final stage of getting it to run natively. But LLVM also makes it easier to do analysis of code, something that is going to be critical if Google’s efforts with Native Client are to succeed. I am somewhat curious if Google’s interests intersect here, or if they’re entirely unrelated (not all that uncommon in Google’s chaotic universe).
There’s no free lunch on the Internet. That’s the title of a research note that I wrote over a year ago, to explain the peering ecosystem to clients who wanted to understand how the money flows. What we’ve got today is the result of a free market. Precursor’s Scott Cleland thinks that’s unfair — he claims Google uses 21 times more bandwidth than it pays for. Now, massive methodological flaws in his “study” aside, his conclusions betray an utter lack of understanding of the commercial arrangements underlying today’s system of Internet traffic exchange in the United States.
Internet service providers (whether backbone providers or broadband providers) offer bandwidth at a particular price, or a settlement-free peering arrangement. Content providers negotiate for the lowest prices they can get. ISPs interconnect with each other for a fee, or settlement-free. And everyone’s trying to minimize their costs.
So, let’s say that you’re a big content provider (BCP). You, Mr. BCP, want to pay as little for bandwidth as possible. So if you’ve got enough clout, you can go to someone with broadband eyeballs, like Comcast, and say, “Please can I have free peering?” And Comcast will look at your traffic, and say to itself, “Hmm. If I don’t give you free peering, you’ll go buy bandwidth from someone like Level 3, and I will have to take your singing cow videos over my peer with them. That will increase my traffic there, which will have implications for my traffic ratios, which might mean that Level 3 would charge me for the traffic. It’s better for me to take your traffic directly (and get better performance for my end-users, too) than to congest my other peer.”
That example is a cartoonishly grotesque oversimplification, but you get the idea: Comcast is going to consider where your traffic is flowing and decide whether it’s in their commercial interest to give you settlement-free peering, charge you a low rate for bandwidth, or tell you that you have too little traffic and you can pay them more money or buy from someone else. They’re not carrying your traffic as some kind of act of charity on their part. Money is changing hands, or the parties involved agree that the arrangement is fairly reciprocal and therefore no money needs to change hands.
Cleland’s suggestion that Google is somehow being subsidized by end-users or by the ISPs is ludicrous. Google isn’t forcing anyone to peer with them, or holding a gun to anyone’s head to sell them cheap bandwidth. Providers are doing it because it’s a commercially reasonable thing to do. And users are paying for Internet access — and part of the value that they’re paying for is access to Google. The cost of accessing content is implicit in what users are paying.
Now, are users paying too little for what they get? Maybe. But nobody forced the ISPs to sell them broadband at low prices, either. Sure, the carriers and cable companies are in a price war — but this is capitalism. It’s a free market. A free market is not license to act stupidly and hope that there’s a bailout coming down the road. If you, a vendor, price a service below what it costs, expect to pay the piper eventually. Blaming content providers for not paying their “fair share” is nothing short of whining about a commercial situation that ISPs have gotten themselves into and continue to actively promote.
Google has posted a response to Cleland’s “research” that’s worth reading, as are the other commentaries it links to. I’ll likely be posting my own take on the methodological flaws and dubious facts, as well.
Walt Mossberg of the Wall Street Journal has a detailed first look. Andrew Garcia of eWeek has a lengthy review. John Brandon of Computerworld has a first look and review round-up. But the reviews thus far have been focused on the core phone functionality, and it’s not clear to what extent the available third-party apps explore the capabilities of Android.
I am personally looking forward to checking out the new phone. I was an early user of the T-Mobile Sidekick (aka the Danger Hiptop), and I loved its rendering of webpages (and its smart proxy that reduced image sizes, did reformatting, and so on), its useful keyboard, its generally easy-to-use functionality, and the fact that it stored all of its data on the network, removing the need to ever back up the device. I was disappointed when the company did not follow through on its promise of broad third-party apps; despite release of an SDK and an app store, you couldn’t use third-party apps without voiding your warranty.
These days I carry a corporate-issued Cingular 8525 (aka HTC Hermes), but despite it being a very powerful Windows Mobile smartphone, I actually use fewer apps than I did on my Sidekick. I use my phone to tether my laptop, for SSH access to my home network, and for basic functionality (calls, SMS, browser), but despite one of the best keyboards of any current smartphone it’s still not good enough to for real note-taking (with serious annoyances like the lack of a double-quote key), the browser falls well short of the Sidekick’s, the lack of network storage means I’m reluctant to trust myself to put a lot of data on it, and the UI is uninspired. So I’m quite eager to see what Android, which represents the next generation of thinking of the key figures of the Sidekick team, is going to be able to do for me. But I don’t want to return to T-Mobile (and I need AT&T for our corporate plan anyway), which means I’m going to be stuck waiting.
On another note, I’m wondering how many Android developers will choose to put the back-ends of their applications on Google App Engine. Browsing around, it seems like developers are worried about exceeding GAE quotas — everyone likes to think their app will be popular, and quota-exceeded messages are deadly, since they are functionally equivalent to downtime. GAE also requires development in Python, whereas Android requires development in Java, but I suspect that’s probably not too significant.
I haven’t really seen anything on hosting for iPhone applications, thus far, except for Morph using it as a marketing ploy. (Morph seems to be a cloud infrastructure overlay provider leveraging Amazon EC2 et.al.)
Hosting the back-end for mobile apps is really no different than hosting any other kind of application, of course, but I’m curious what service providers are turning out to be popular for them. Such hosting providers could also potentially offer value-adds like mobile application acceleration, especially for enterprise-targeted mobile apps.
We’ve now hit the six-month mark on Google App Engine. And it’s still in beta. Few of the significant shortcomings in making GAE production-ready for “real applications” have been addressed.
In an internal Gartner discussion this past summer, I wrote:
The restrictions of the GAE sandbox are such that people writing complex, commercial Web 2.0 applications are quickly going to run into things they need and can’t have. Google Apps is required to use your own domain. The ability to do network callouts is minimal, which means that integrating with anything that’s not on GAE is limited to potentially impossible (and their URL fetcher can’t even do basic HTTP authentication). Everything has to be spawned via an HTTP request and all such requests must be short-lived, so you cannot run any persistent or cron-started background processes; this is a real killer since you cannot do any background maintenance. Datastore write performance is slow; so are large queries. The intent is that nothing you do is computationally expensive, and this is strictly enforced. You can’t do anything that accesses the filesystem. There’s a low limit to the total number of files allowed, and the largest possible file size is a mere 1 MB (and these limits are independent of the storage limit; you will be able to buy more storage but it looks like you won’t be allowed to buy yourself out of limitations like these). And so on.
Presumably over time Google will lift at least some of these restrictions, but in the near term, it seems unlikely to me that Web 2.0 startups will make commitments to the platform. This is doubly true because Google is entirely in control of what the restrictions will be in the future, too. I would not want to be the CTO in the unpleasant position of having my business depend on the Web 2.0 app my company’s written to the GAE framework, discovering that Google had just changed its mind and decided to enforce tighter restrictions that now prevented my app from working / scaling.
GAE, at least in the near term, suits apps that are highly self-contained, and very modest in scope. This will suit some Web 2.0 start-ups, but not many, in my opinion. GAE has gone for simplicity rather than power, at present, which is great if you are building things in your free time but not so great if you are hoping to be the next MySpace, or even 37Signals (Basecamp).
Add to that the issues about the future of Python. Python 3.0 — the theoretical future of Python — is very different from the 2.x branch. 3.0 support may take a while. So might support for the transition version, 2.6. The controversy over 3.0 has bifurcated the Python community at a time when GAE is actually helping to drive Python adoption, and it leaves developers wondering whether they ought to be thinking about GAE on 2.5 or GAE on 3.0 — or if they can make any kind of commitment to GAE at all with so much uncertainty.
These issues and more have been extensively explored by the blogosphere. The High Scalability blog’s aggregation of the most interesting posts is worth a look from anyone interested in the technical issues that people have found.
Google has been more forthcoming about the quotas and how to deal with them. I’ve made the assumption that quota limitations will eventually be replaced by paid units. The more serious limitations are the ones that are not clearly documented, and have more recently come to light, like the offset limit and the fact that the 1 MB limit doesn’t just apply to files, it also applies to data structures.
As this beta progresses, it becomes less and less clear what Google intends to limit as an inherent part of the business goals (and perhaps technical limitations) of the platform, and what they’re simply constraining in order to prevent their currently-free infrastructure from being voraciously gobbled up.
At present, Google App Engine remains a toy. A cool toy, but not something you can run your business on. Amazon, on the other hand, proved from the very beginning that EC2 was not a toy. Google needs to start doing the same, because you can bet that when Microsoft releases their cloud, they will pay attention to making it business-ready from the start.