CloudPundit: Massive-Scale Computing

the business of Internet infrastructure, cloud computing, and virtual worlds

Posts Tagged ‘Google’

Gmail, Macquarie, and US regulation

Posted by Lydia Leong on January 12, 2010

Google continues to successfully push Gmail into higher education, in an Australian deal with Macquarie University. (Microsoft is its primary competitor in this market, but for Microsoft, most such Live@edu represent cannibalization of their higher ed Exchange base.)

That, by itself, isn’t a particularly interesting announcement. Email SaaS is a huge trend, and the low-cost .edu offerings have been gaining particular momentum. What caught my eye was this:

The university was hesitant to move staff members on to Gmail due to regulatory and cost factors. They were concerned that their email messages would be subject to draconian US law. In particular, they were worried about protecting their intellectual property under the Patriot Act and Digital Millennium Copyright Act, Mr. Bailey said. “In the end, Google agreed to store that data under EU jurisdiction, which we accepted,” he said.

That tells us that Google can divide their data storage into zones if need be, as one would expect, but it also tells us that they can do so for particular customers (presumably, given Google’s approach to the world, as a configurable, automated thing, and not as a one-off).

However, the remark about the Patriot Act and DMCA is what really caught my attention. DMCA is a worry for universities (due to the high likelihood of pirated media), but USA PATRIOT is a significant worry for a lot of the non-US clients that I talk to about cloud computing, especially those in Europe — to the point where I speak with clients who won’t use US-based vendors, even if the infrastructure itself is in Europe. (Australian clients are more likely to end up with a vendor that has somewhat local infrastructure to begin with, due to the latency issues.)

Cross-border issues are a serious barrier to cloud adoption in Europe in general, often due to regulatory requirements to keep data within-country (or sometimes less stringently, within the EU). That will make it more difficult for European cloud computing vendors to gain significant operational scale. (Whether this will also be the case in Asia remains to be seen.)

But if you’re in the US, it’s worth thinking about how the Patriot Act is perceived outside the US, and how it and any similar measures will limit the desire to use US-based cloud vendors. A lot of US-based folks tell me that they don’t understand why anyone would worry about it, but the “you should just trust that the US government won’t abuse it” story plays considerably less well elsewhere in the world.

Bookmark and Share

Posted in Industry | Tagged: , | Leave a Comment »

A hodgepodge of links

Posted by Lydia Leong on July 3, 2009

This is just a round-up of links that I’ve recently found to be interesting.

Barroso and Holzle (Google): Warehouse-Scale Computing. This is a formal lecture-paper covering the design of what these folks from Google refer to as WSCs. They write, “WSCs differ significantly from traditional data centers: they belong to a single organization, use a relatively homogenous hardware and system software platform, and share a common systems management layer. Often, much of the application, middleware, and system software is built in-house compared to the predominance of third-party software running in conventional data centers. Most importantly, WSCs run a smaller number of very large applications (or Internet services), and the common resource management infrastructure allows significant deployment flexibility.” The paper is wide-ranging but written to be readily understandable by the mildly technical layman. Highly recommended for anyone interested in cloud.

Washington Post: Metrorail Crash May Exemplify Automation Paradox. The WaPo looks back at serious failures of automated systems, and quotes a “growing consensus among experts that automated systems should be designed to enhance the accuracy and performance of human operators rather than to supplant them or make them complacent. By definition, accidents happen when unusual events come together. No matter how clever the designers of automated systems might be, they simply cannot account for every possible scenario, which is why it is so dangerous to eliminate ‘human interference’.” Definitely something to chew over in the cloud context.

Malcolm Gladwell: Priced to Sell. The author of The Tipping Point takes on Chris Anderon’s Free, and challenges the notion that information wants to be free. In turn, Seth Godin thinks Gladwell is wrong, and the book seems to be setting off some healthy debate.

Bruce Robertson: Capacity Planning Equals Budget Planning. My colleague Bruce riffs off a recent blog post of mine, and discusses how enterprise architects need to change the way they design solutions.

Martin English: Install SAP on Amazon Web Services. An interesting blog devoted to how to get SAP running on AWS. This is for people interested in hands-on instructions.

Robin Burkinshaw: Being homeless in the Sims 3. This blog tells the story, in words and images, of “Alice and Kev”, a pair of characters that the author (a game design student) created in the Sims 3. It’s a fascinating bit of user-generated content, and a very interesting take on what can be done with modern sandbox-style games.

Bookmark and Share

Posted in Industry | Tagged: , , , , | 1 Comment »

Google and Salesforce.com

Posted by Lydia Leong on May 31, 2009

While I’ve been out of the office, Google has made some significant announcements. My colleague Ray Valdes has been writing about Google Wave and its secret sauce. I highly encourage you to go read his blog.

Google and Salesforce.com continue to build on their partnership. In April, they unveiled Salesforce for Google Apps. Now, they’re introducing Force.com for Google App Engine.

The announcement, in a nutshell, is this: There are now public Salesforce APIs that can be downloaded, and will work on Google App Engine (GAE). Those APIs are a subset of the functionality available in Force.com’s regular Web Services APIs. Check out the User Guide for details.

Note that this is not a replacement for Force.com and its (proprietary) Apex programming language. Salesforce clearly articulates web services vs. Force.com in its developer guide. Rather, this should be thought of as easing the curve for developers who want to extend their Web applications for use with Salesforce data.

A question that lingers in my mind: Normally, on Force.com, a Developer Edition account means that you can’t affect your organization’s live data. If a similar restriction exists on the GAE version of the APIs, it’s not mentioned in the documentation. I wonder if you can do very lightweight apps, using live data, with just a Developer Edition account with Salesforce, if you do it through GAE. If so, that would certainly open up the realm of developers who might try building something on the platform.

My colleague Eric Knipp has also blogged about the announcement. I’d encourage you to read his analysis.

Bookmark and Share

Posted in Industry | Tagged: , , | Leave a Comment »

Google App Engine and other tidbits

Posted by Lydia Leong on April 8, 2009

As anticipated, Java support on Google App Engine has been announced. To date, GAE has supported only the Python programming language. In keeping with the “phenomenal cosmic power, itty bitty living space” sandboxing that’s become common to cloud execution environments, GAE/Java has all the restrictions of GAE/Python. However, the already containerized nature of Java applications means that the restrictions probably won’t feel as significant to developers. Many Python libraries and frameworks are not “pure Python”; they include C extensions for speed. Java libraries and frameworks are, by contrast, usually pure Java; the biggest issues for porting Java into the GAE environment are likely to be the restrictions on system calls and the lack of threads. Generically, GAE/Java offers servlets. The other things that developers are likely to miss are support for JMS and JMX (Java’s messaging and monitoring, respectively).

Overall, the Java introduction is a definite plus for GAE, and is presumably also an important internal proof point for them — a demonstration that GAE can scale and work with other languages. Also, because there are lots of languages that now target the Java virtual machine (i.e., they’ve got compilers/interpreters that produce byte code for the Java VM) — Clojure and Scala, for instance — as well as ports of other languages, like JRuby, we’ll likely see additional languages available on GAE ahead of Google’s own support for those environments.

Google also followed through on an earlier announcement, adding support for scheduld tasks (“cron”). Basically, at a scheduled time, GAE cron will invoke a URL that you specify. This is useful, but probably not everything people were hoping it would be. It’s still subject to GAE’s normal restrictions; this doesn’t let you invoke a long-running background process. It requires a shift in thinking — for instance, instead of doing the once-daily data cleanup run at 4 am, you ought to be doing cleanup throughout the day, every couple of minutes, a bit of your data set at a time.

All of that is going to be chewed over thoroughly by the press and blogosphere, and I’ve contributed my two cents to a soon-to-be-published Gartner take on the announcement and GAE itself, so now I’ll point out something that I don’t think has been widely noticed: the unladen-swallow project plan.

unladen-swallow is apparently an initiative within Google’s compiler optimization team, with a goal of achieving a 5x speed-up in CPython (i.e., the normal, mainstream, implementation of Python), starting from the 2.6 base (the current version, which is a transition point between the 2.5 used by App Engine, and the much-different Python 3.0). The developers intend to achieve this speed-up in part by moving from the existing custom VM to one built on top of LLVM. (I’ve mentioned Google’s interest in LLVM in the past.) I think this particular approach answers some of the mystery surrounding Google and Python 3.0 — this seems to indicate longer-term commitment to the existing 2.x base, while still being transition-friendly. As is typical with Google’s work with open-source code, they plan to release these changes back to the community.

All of which goes back to a point of mine earlier this week: Although programming language communities strongly resemble fandoms, languages are increasingly fungible. We’re a long way from platform maturity, too.

Bookmark and Share

Posted in Infrastructure | Tagged: , , , | 2 Comments »

Google App Engine updates

Posted by Lydia Leong on March 13, 2009

For those of you who haven’t been following Google’s updates to App Engine, I want to call your attention to a number of recent announcements. At the six-month point of the beta, I asked when App Engine would be enterprise-ready; now, as we come to almost the year mark, these announcements show the progress and roadmap to addressing many of the issues I mentioned in my previous post.

Paid usage. Google is now letting applications grow beyond the free limits. You set quotas for various resources, and pay for what you use. I still have concerns about the quota model, but being able to bill for these services is an important step for Google. Google intends to be price-competitive with Amazon, but there’s an important difference — there’s still some free service. Google anticipates that the free quotas are enough to serve about five million page views. 5 MPVs is a lot; it pretty much means that if you’re willing to write to the platform, you can easily host your hobby project on it for free. For that matter, many enterprises don’t get 5 MPVs worth of hits on an individual Web app or site each month — it’s just that the platform restrictions are a barrier to mainstream adoption.

Less aggressive limits and fewer restrictions. Google has removed or reduced some limits and restrictions that were significant frustrations for developers.

Promised new features. Google has announced that it’s going to provide APIs for some vital bits of functionality that it doesn’t currently allow, like the ability to run scheduled jobs and background processes.

Release of Python 3.0. While there’s no word on how Google plans to manage the 3.0 transition for App Engine, it’s interesting to see how many Python contributors have been absorbed into Google.

Speaking personally, I like App Engine. Python is my strongest scripting language skill, so I prefer to write in it whenever possible. I also like Django, though I appreciate that Google’s framework is easier to get started with than Django (it’s very easy to crank out basic stuff). Like a lot of people, I’ve had trouble adjusting to the non-relational database, but that’s mostly a matter of programming practice. It is, however, clear that the platform is still in its early stages. (I once spent several hours of a weekend tearing my hair out at something that didn’t work, only to eventually find that it was a known bug in the engine.) But Google continues to work at improving it, and it’s worth keeping an eye on to see what it will eventually become. Just don’t expect it to be enterprise-ready this year.

Bookmark and Share

Posted in Infrastructure | Tagged: , , | 2 Comments »

Cloud failures

Posted by Lydia Leong on February 27, 2009

A few days ago, an unexpected side-effect of some new code caused a major Gmail outage. Last year, a small bug triggered a series of cascading failures that resulted in a major Amazon outage. These are not the first cloud failures, nor will they be the last.

Cloud failures are as complex as the underlying software that powers them. No longer do you have isolated systems; you have complex, interwoven ecosystems, delicately orchestrated by a swarm of software programs. In presenting simplicity to the user, the cloud provider takes on the burden of dealing with that complexity themselves.

People sometimes say that these clouds aren’t built to enterprise standards. In one sense, they aren’t — most aren’t intended to meet enterprise requirements in terms of feature-set. In another sense, though, they are engineered to far exceed anything that the enterprise would ever think of attempting themselves. Massive-scale clouds are designed to never, ever, fail in a user-visible way. The fact that they do fail nonetheless should not be a surprise, given the potential for human error encoded in software. It is, in fact, surprising that they don’t visibly fail more often. Every day, within these clouds, a whole host of small errors that would be outages if they occurred within the enterprise — server hardware failures, storage failures, network failures, even some software failures — are handled invisibly by the back-end. Most of the time, the self-healing works the way it’s supposed to. Sometimes it doesn’t. The irony in both the Gmail outage and the S3 outage is that both appear to have been caused by the very software components that were actively trying to create resiliency.

To run infrastructure on a massive scale, you are utterly dependent upon automation. Automation, in turn, depends on software, and no matter how intensively you QA your software, you will have bugs. It is extremely hard to test complex multi-factor failures. There is nothing that indicates that either Google or Amazon are careless about their software development processes or their safeguards against failure. They undoubtedly hate failure as much as, and possibly more than, their customers do. Every failure means sleepless nights, painful internal post-mortems, lost revenue, angry partners, and embarrassing press. I believe that these companies do, in fact, diligently seek to seamlessly handle every error condition they can, and that they generally possess sufficient quantity and quality of engineering talent to do it well.

But the nature of the cloud — the one homogenous fabric — magnifies problems. Still, that’s not isolated to the cloud alone. Let’s not forget VMware’s license bug from last year. People who normally booted up their VMs at the beginning of the day were pretty much screwed. It took VMware the better part of a day to produce a patch — and their original announced timeframe was 36 hours. I’m not picking on VMware — certainly you could find yourself with a similar problem with any kind of widely deployed software that was vulnerable to a bug that caused it all to fail.

Enterprise-quality software produced the SQL Slammer worm, after all. In the cloud, we ain’t seen nothing yet…

Bookmark and Share

Posted in Infrastructure | Tagged: , , , | 2 Comments »

Google Federal

Posted by Lydia Leong on February 3, 2009

I heard a radio ad today for Google Federal. It sounded like every other “please, government IT purchasing person, buy our stuff” ad that you hear on news radio in Washington DC. It was a far cry from the sort of ad that one expects to hear from Google, and to hear a federal-targeted ad from them, period, was sort of fascinating.

The Federal government can still afford stuff (and is probably one of the few bright spots in purchasing this year, period). It’s the states that are screwed, and seriously thinking about alternatives to traditional IT.

Bookmark and Share

Posted in Marketing | Tagged: | Leave a Comment »

Google Apps and enterprises

Posted by Lydia Leong on January 26, 2009

My colleague Tom Austin has posted a call for Google to be more transparent about enterprise usage of Google Apps. This was triggered by a TechCrunch article on Google’s reduction of the number of free users for a given Google Apps account.

I’ve been wondering how many businesses use Google Apps almost exclusively for messaging, and how many of them make substantial use of collaboration. My expectation is that a substantial number of the folks with custom domains on Google Apps solely or almost-solely do email or email forwarding. For instance, for my WordPress.com-hosted blog, I have no option for email for that domain other than via Google Apps, because WordPress.com has explicit MX record support for them and nobody else — so I use that to forward email for that domain to my regular email account. Given how heavily bloggers have driven domain registrations and “vanity” domains, I’d expect Google Apps to be wrapped up pretty heavily in that phenomenon. This is not to discount the small business, of course, whose usage of this kind of service also becomes more significant over time.

Those statistics aside, though, and going back to Tom’s thoughts on transparency, I think he’s right, if Google intends to court the enterprise market in the way that the enterprise is accustomed to being courted. I am uncertain if Google intends that, though, especially when fighting more featureful, specialized vendors in order to get an enterprise clientele is likely a waste of resources at the moment. The type of enterprise who is going to adopt this kind of solution is probably not the kind of enterprise who wants to see a bunch of case studies and feel reassured by them; they’re independent early adopters with high tolerance for risk. (This goes back to a point I made in a previous post: Enterprise IT culture tends to be about risk mitigation.)

Bookmark and Share

Posted in Applications | Tagged: | Leave a Comment »

Google’s pricing for App Engine

Posted by Lydia Leong on December 19, 2008

Google made a number of App Engine-related announcements earlier this week. The most notable of these was a preview of the future paid service, which allows you to extend App Engine’s quotas. Google has previously hinted at pricing, and at their developer conference this past May, they asserted that effectively, the first 5 MPV (million page views) are free, and thereafter, it’d be about $40 per MPV.

The problem is not the price. It’s the way that the quotas are structured. Basically, it looks like Google is going to allow you to raise the quota caps, paying for however much you go over, but never to exceed the actual limit that you set. That means Google is committing itself to a quota model, not backing away from it.

Let me explain why quotas suck as a way to run your business.

Basically, the way App Engine’s quotas work is like this: As you begin to approach the limit (currently Google-set, but eventually set by you), Google will start denying those requests. If you’re reaching the limit of a metered API call, when your app tries to make that call, Google will return an exception, which your app can catch and handle; inelegant, but at least something you can present to the user as a handled error. However, if you’re reaching a more fundamental limit, like bandwidth, Google will begin returning page requests with a the 403 HTTP status code. 403 is an error that prevents your user from getting the page at all, and there’s no elegant way to handle it in App Engine (no custom error pages).

As you approach quota, Google tries to budget your requests so that only some of them fail. If you get a traffic spike, it’ll drop some of those requests so that it still has quota left to serve traffic later. (Steve Jones’ SOA blog chronicles quite a bit of empirical testing, for those who want to see what this “throttling” looks like in practice.)

The problem is, now you’ve got what are essentially random failures of your application. If you’ve got failing API calls, you’ve got to handle the error and your users will probably try again — exacerbating your quota problem and creating an application headache. (For instance, what if I have to make two database API calls to commit data from an operation, and the first succeeds but the second fails? Now I have data inconsistency, and thanks to API calls continuing to fail, quite possibly no way to fix it. Google’s Datastore transactions are restricted to operations on the same entity group, so transactions will not deal with all such problems.) Worse still, if you’ve got 403 errors, your site is functionally down, and your users are getting a mysterious error. As someone who has a business online, do you really want, under circumstances of heavy traffic, your site essentially failing randomly?

Well, one might counter, if you don’t want that to happen, just set your quota limits really really high — high enough that you never expect a request to fail. The problem with that, though, is that if you do it, you have no way to predict what your costs actually will be, or to throttle high traffic in a more reasonable way.

If you’re on traditional computing infrastructure, or, say, a cloud like Amazon EC2, you decide how many servers to provision. Chances are that under heavy traffic, your site performance would degrade — but you would not get random failures. And you would certainly not get random failures outside of the window of heavy traffic. The quota system under use by Google means that you could get past the spike, have enough quota left to serve traffic for most of the rest of the day, but still cross the over-quota-random-drop threshold later in the day. You’d have to go micro-manage (temporarily adjusting your allowable quota after a traffic spike, say) or just accept a chance of failure. Either way, it is a terrible way to operate.

This is yet another example of how Google App Engine is not and will not be near-term ready for prime-time, and how more broadly, Google is continuing to fail to understand the basic operational needs of people who run their businesses online. It’s not just risk-averse enterprises who can’t use something with this kind of problem. It’s the start-ups, too. Amazon has set a very high bar for reliability and understanding of what you need to run a business online, and Google is devoting lots of misdirected technical acumen to implementing something that doesn’t hit the mark.

Bookmark and Share

Posted in Infrastructure | Tagged: , , | 2 Comments »

Google builds a CDN for its own content

Posted by Lydia Leong on December 15, 2008

An article in the Wall Street Journal today describes Google’s OpenEdge initiative (along with a lot of spin around net neutrality, resulting in a Google reply on its public policy blog).

Basically, Google is trying to convince broadband providers to let it place caches within their networks — effectively, pursuing the same architecture that a deep-footprint CDN like Akamai uses, but for Google content alone.

Much of the commentary around this seems to center on the idea that if Google can use this to obtain better performance for its content and applications, everyone else is at a disadvantage and it’s a general stab to net neutrality. (Even Om Malik, who is not usually given to mindless panic, asserts, “If Google can buy better performance for its service, your web app might be at a disadvantage. If the cost of doing business means paying baksheesh to the carriers, then it is the end of innovation as we know it.”)

I think this is an awful lot of hyperbole. Today, anyone can buy better performance for their Web content and applications by paying money to a CDN. And in turn, the CDNs pay baksheesh, if you want to call it that, to the carriers. Google is simply cutting out the middleman, and given that it accounts for as more traffic on the Internet than most CDNs, it’s neither illogical nor commercially unreasonable.

Other large content providers — Microsoft and AOL notably on a historical basis — have built internal CDNs in the past; Google is just unusual in that it’s attempting to push those caches deeper into the network on a widespread basis. I’d guess that it’s YouTube, more than anything else, that’s pushing Google to make this move.

This move is likely driven at least in part by the fact that most of the broadband providers simply don’t have enough 10 Gbps ports for traffic exchange (and space and power constraints in big peering points like Equinix’s aren’t helping matters, making it artificially hard for providers to get the expansions necessary to put big new routers into those facilities). Video growth has sucked up a ton of capacity. Google, and YouTube in particular, is a gigantic part of video traffic. If Google is offering to alleviate some of that logjam by putting its servers deeper into a broadband provider’s network, that might be hugely attractive from a pure traffic engineering standpoint. And providers likely trust Google to have enough remote management and engineering expertise to ensure that those cache boxes are well-behaved and not annoying to host. (Akamai has socialized this concept well over much of the last decade, so this is not new to the providers.)

I suspect that Google wouldn’t even need to pay to do this. For the broadband providers, the traffic engineering advantages, and the better performace to end-users, might be enough. In fact, this is the same logic that explains why Akamai doesn’t pay for most of its deep-network caches. It’s not that this is unprecedented. It’s just that this is the first time that an individual content provider has reached the kind of scale where they can make the same argument as a large CDN.

The cold truth is that small companies generally do not enjoy the same advantages as large companies. If you are a small company making widgets, chances are that a large company making widgets has a lower materials cost than you do, because they are getting a discount for buying in bulk. If you are a small company doing anything whatsoever, you aren’t going to see the kind of supplier discounts that a large company gets. The same thing is true for bandwidth — and for that matter, for CDN services. And big companies often leverage their scale into greater efficiency, to boot; for instance, unsurprisingly, Gartner’s metrics data shows that the average cost to running servers drops as you get more servers in your data center. Google employs both scale and efficiency leverage.

One of the key advantages of the emerging cloud infrastructure services, for start-ups, is that such services offer the leverage of scale, on a pay-by-the-drink basis. With cloud, small providers can essentially get the advantage of big providers by banding together into consortiums or paying an aggregator. However, on the deep-network CDN front, this probably won’t help. Highly distributed models work very well for extremely popular content. For long-tail content, cache hit ratios can be too low for it to be really worthwhile. That’s why it’s doubtful that you’ll see, say, Amazon’s Cloudfront CDN, push deep rather than continuing to follow a megaPOP model.

Ironically, because caching techniques aren’t as efficient for small content providers, it might actually be useful to them to be able to buy bandwidth at a higher QoS.

Bookmark and Share

Posted in Industry | Tagged: , , | 4 Comments »