Blog Archives

Google Compute Engine and live migration

Google Compute Engine (GCE) has been a potential cloud-emperor contender in the shadows, and although GCE is still in beta, it’s been widely speculated that Google will likely be the third vendor in the trifecta of big cloud IaaS market-share leaders, along with Amazon Web Services (AWS) and Microsoft Windows Azure.

Few would doubt Google’s technology prowess, if it decides to commit itself to a business, though. A critical question has remained, though: Will Google be able to deliver technology capabilities that can be used by mere mortals in the enterprise, and market, sell, contract for, and deliver service in a way that such businesses can use? (Its ability to serve ephemeral large-scale compute workloads, and perhaps meet the needs of start-ups, is not in doubt.)

One of the most heartburn-inducing aspects of GCE has been its scheduled maintenance, To quote Google: “For scheduled zone maintenance windows, Google takes an entire zone offline for roughly two weeks to perform various, disruptive maintenance tasks.” Basically, Google has said, “Your data center will be going away for up to two weeks. Deal with it. You should be running in multiple zones anyway.”

Even most cloud-native start-ups aren’t capable of easily executing this way. Remember that most applications are architected to have their data locally, in the same zone as the compute. Without using Google’s PaaS capabilities (like Datastore), this means that the customer needs to move and/or replicate storage into another zone, which also increases their costs. Many applications aren’t large enough to warrant the complexity of a multi-zone implementation, either — not only business applications, but also smaller start-ups, mobile back-end implementations, and so forth.

So inherently, a hard-line stance on taking zones offline for maintenance, limited GCE’s market opportunity. Despite positioning this as a hard-line stance previously, Google has clearly changed its mind, introducing “transparent maintenance”. This is accomplished with a combination of live migration technology, and some innovations related to their implementation of physical data center maintenance. It’s an interesting indication of Google listening to prospects and customers and flexing to do something that has not been the Google Way.

Not only will Google’s addition of migration help data center maintenance, but more importantly, it will mitigate downtime related to host maintenance. Although AWS, for instance, tries to minimize host maintenance in order to avoid instance downtime or reboots, host maintenance is necessary — and it’s highly useful to have a technology that allows you to host maintenance without downtime for the instances, because this encourages you not to delay host maintenance (since you want to update the underlying host OS, hypervisor, etc.).

VMware-based providers almost always do live migration for host maintenance, since it’s one of the core compelling features of VMware. But AWS, and many competitors that model themselves after AWS, don’t. I hope that Google’s decision to add live migration into GCE pushes the rest of the market — and specifically AWS, which today generally sets the bar for customer expectations — into doing the same, because it’s a highly useful infrastructure resilience feature, and it’s important to customers.

More broadly, though, AWS hasn’t really had innovation competitors to date. Microsoft Azure is a real competitor, but other than in PaaS, they’ve largely been playing catch-up. Thanks to its extensive portfolio of internal technologies, Google has the potential ability to inject truly new capabilities into the market. Similar to what customers have seen with AWS — when AWS has been successful at introducing capabilities that many customers weren’t really even aware that they wanted — I expect Google is going to launch truly innovative capabilities that will turn into customer demands. It’s not that AWS is going to simply mount a competitive response — it will become a situation where customers ask for these capabilities, pushing AWS to respond. That should be excellent for the market.

It’s worth noting that the value of Google is not just GCE — it is Google Cloud Platform as a whole, including the PaaS elements. This is similarly true with Microsoft Azure. And although AWS seems to broadly bucketed as IaaS, in reality their capabilities overlap into the PaaS space. These vendors understand that the goal is the ability to develop and deliver business capaiblities more quickly — not to provide cheap infrastructure.

Capabilities equate lock-in, by the way, but historically, businesses have embraced lock-in whenever it results in more value delivered.

Performance can be a disruptive competitive advantage

All of us are used to going to travel sites, especially for airline tickets, and waiting a while for the appropriate results to be extracted and displayed to us. I recently saw Google Flight Search for the first time and was astonished by its raw speed — essentially completely instant.

I frequently talk to customers about acceleration solutions, and discuss the business value of performance. Specifically, this is a look at business metrics that measure the success of a website or application — time spent on your site, conversion rate, shopping basket value, page views, ad views, transactions processed, employee productivity, decline in call center volume, and so forth. You compare the money associated with these metrics, against the cost of the solutions, to look at comparative ROI.

The business value of performance is usually tied to industry in a narrow and specific way, because users have a particular set of expectations and needs. For instance, for travel sites, a certain amount of performance is necessary in order to make the site usable, but the long waits for searches are things that users are conditioned to, making their overall performance expectations relatively low. Travel sites usually discover that generalized site responsiveness improve the user experience and cause revenue per site visit to increase — but only up to a certain point, at which point in time it plateaus, as the site has enough responsiveness that users aren’t discouraged from using it, and they’re going to buy what they came to buy.

Google Flight Search proves that you can “break through” the performance ceiling to actually entirely change the user experience, though. This is not the kind of incremental improvement you can achieve through acceleration techniques, though; instead, it’s a core change that affects the thing that is slowest, which is generally the back-end database and business logic, not the network. This can actually be a disruptive competitive advantage.

I typically ask my CDN clients, “What are the factors that make your site slow?” In many cases, they need to do something that goes beyond what edge caching or even network optimization (dynamic acceleration) can achieve. They need to reduce their page weight, or write better pages (and may benefit from front-end optimization techniques), or to improve the back-end responsiveness. Acceleration techniques are often used to band-aid a core problem with performance, just like CDN professional services to make a site cacheable are often used to band-aid a core problem with site structure. At some point in time it becomes more cost-effective to fix the core problem.

Too few businesses design their websites and applications with speed in mind.

Would you like to run Apache Wave, Grandma?

As many people already know, Google is sunsetting Google Wave. This has led to Google sending an email to people who previously signed up for Wave. The bit in the email that caught my eye was this:

If you would like to continue using Wave, there are a number of open source projects, including Apache Wave. There is also an open source project called Walkaround that includes an experimental feature that lets you import all your Waves from Google.

For an email sent to Joe Random Consumer, it’s remarkably clueless as to what consumers actually can comprehend. Grandma is highly unlikely to understand what the heck that means.

Tip for everyone offering a product or service to consumers: Any communication with consumers about that product or service should be in language that Grandma can understand.

Google’s mod_pagespeed and Cotendo

Those of you who are Gartner clients know that in the last year, my colleague Joe Skorupa and I have become excited about the emergence of software-based application acceleration via page optimization approaches, as exemplified by vendors like Aptimize and Strangeloop Networks. (Clients: See Cool Vendors in Enterprise Networking, 2010.) This approach to acceleration enhances the performance of Web-based content and applications, by automatically optimizing the page output of webservers according to the best practices described in books like High Performance Web Sites by Steve Souders. Techniques of this sort include automatically combining JavaScript files (which reduces overall download time), optimizing the order of the scripts, and rewriting HTML so that the browser can display the page more quickly.

Page optimization techniques can often provide significant acceleration boosts (2x or more) even when other acceleration techniques are in use, such as a hardware ADC with acceleration module (offered as add-ons by F5 and Citrix NetScaler, for instance), or a CDN (including CDN-based dynamic acceleration). Since early this year, we’ve been warning our CDN clients that this is a critical technology development to watch and to consider adopting in their networks. It’s a highly sensible thing to deploy on a CDN, for customers doing whole site delivery; the CDN offloads the computational expense of doing the optimization (which can be significant), and then caches the result (and potentially distributing the optimized pages to other nodes on the CDN). That gets you excellent, seamless acceleration for essentially no effort on the part of the customer.

Google’s Page Speed project provides free and open-source tools designed to help site owners implement these best practices. Google has recently released an open-source module, called mod_pagespeed, for the popular Apache webserver. This essentially creates an open-source competitor to commercial vendors like Aptimize and Strangeloop. Add the module into your Apache installation, and it will automatically try to optimize your pages for you.

Now, here’s where it gets interesting for CDN watchers: Cotendo has partnered wih Google. Cotendo is deploying the Google code (modified, obviously, to run on Cotendo’s proxy caches, which are not Apache-based), in order to be able to offer the page optimization service to its customers.

I know some of you will automatically be asking now, “What does this mean for Akamai?” The answer to that is, “Losing speed trials when it’s Akamai DSA vs. Cotendo DSA + Page Speed Automatic, until they can launch a competing service.” Akamai’s acceleration service hasn’t changed much since the Netli acquisition in 2007, and the evolution in technology here has to be addressed. Page optimization plus TCP optimization is generally much faster than TCP optimization alone. That doesn’t just have pricing implications; it has implications for the competitive dynamics of the space, too.

I fully expect that page optimization will become part of the standard dynamic acceleration service offerings of multiple CDNs next year. This is the new wave of innovation. Despite the well-documented nature of these best practices, organizations still frequently ignore them when coding — and even commercial packages like Sharepoint ignore them (Sharepoint gets a major performance boost when page optimization techniques are applied, and there are solutions like Certeon that are specific to it). So there’s a very broad swathe of customers that can benefit easily from these techniques, especially since they provide nice speed boosts even in environments where the network latency is pretty decent, like delivery within the United States.

Bookmark and Share

Gmail, Macquarie, and US regulation

Google continues to successfully push Gmail into higher education, in an Australian deal with Macquarie University. (Microsoft is its primary competitor in this market, but for Microsoft, most such Live@edu represent cannibalization of their higher ed Exchange base.)

That, by itself, isn’t a particularly interesting announcement. Email SaaS is a huge trend, and the low-cost .edu offerings have been gaining particular momentum. What caught my eye was this:

The university was hesitant to move staff members on to Gmail due to regulatory and cost factors. They were concerned that their email messages would be subject to draconian US law. In particular, they were worried about protecting their intellectual property under the Patriot Act and Digital Millennium Copyright Act, Mr. Bailey said. “In the end, Google agreed to store that data under EU jurisdiction, which we accepted,” he said.

That tells us that Google can divide their data storage into zones if need be, as one would expect, but it also tells us that they can do so for particular customers (presumably, given Google’s approach to the world, as a configurable, automated thing, and not as a one-off).

However, the remark about the Patriot Act and DMCA is what really caught my attention. DMCA is a worry for universities (due to the high likelihood of pirated media), but USA PATRIOT is a significant worry for a lot of the non-US clients that I talk to about cloud computing, especially those in Europe — to the point where I speak with clients who won’t use US-based vendors, even if the infrastructure itself is in Europe. (Australian clients are more likely to end up with a vendor that has somewhat local infrastructure to begin with, due to the latency issues.)

Cross-border issues are a serious barrier to cloud adoption in Europe in general, often due to regulatory requirements to keep data within-country (or sometimes less stringently, within the EU). That will make it more difficult for European cloud computing vendors to gain significant operational scale. (Whether this will also be the case in Asia remains to be seen.)

But if you’re in the US, it’s worth thinking about how the Patriot Act is perceived outside the US, and how it and any similar measures will limit the desire to use US-based cloud vendors. A lot of US-based folks tell me that they don’t understand why anyone would worry about it, but the “you should just trust that the US government won’t abuse it” story plays considerably less well elsewhere in the world.

Bookmark and Share

A hodgepodge of links

This is just a round-up of links that I’ve recently found to be interesting.

Barroso and Holzle (Google): Warehouse-Scale Computing. This is a formal lecture-paper covering the design of what these folks from Google refer to as WSCs. They write, “WSCs differ significantly from traditional data centers: they belong to a single organization, use a relatively homogenous hardware and system software platform, and share a common systems management layer. Often, much of the application, middleware, and system software is built in-house compared to the predominance of third-party software running in conventional data centers. Most importantly, WSCs run a smaller number of very large applications (or Internet services), and the common resource management infrastructure allows significant deployment flexibility.” The paper is wide-ranging but written to be readily understandable by the mildly technical layman. Highly recommended for anyone interested in cloud.

Washington Post: Metrorail Crash May Exemplify Automation Paradox. The WaPo looks back at serious failures of automated systems, and quotes a “growing consensus among experts that automated systems should be designed to enhance the accuracy and performance of human operators rather than to supplant them or make them complacent. By definition, accidents happen when unusual events come together. No matter how clever the designers of automated systems might be, they simply cannot account for every possible scenario, which is why it is so dangerous to eliminate ‘human interference’.” Definitely something to chew over in the cloud context.

Malcolm Gladwell: Priced to Sell. The author of The Tipping Point takes on Chris Anderon’s Free, and challenges the notion that information wants to be free. In turn, Seth Godin thinks Gladwell is wrong, and the book seems to be setting off some healthy debate.

Bruce Robertson: Capacity Planning Equals Budget Planning. My colleague Bruce riffs off a recent blog post of mine, and discusses how enterprise architects need to change the way they design solutions.

Martin English: Install SAP on Amazon Web Services. An interesting blog devoted to how to get SAP running on AWS. This is for people interested in hands-on instructions.

Robin Burkinshaw: Being homeless in the Sims 3. This blog tells the story, in words and images, of “Alice and Kev”, a pair of characters that the author (a game design student) created in the Sims 3. It’s a fascinating bit of user-generated content, and a very interesting take on what can be done with modern sandbox-style games.

Bookmark and Share

Google and Salesforce.com

While I’ve been out of the office, Google has made some significant announcements. My colleague Ray Valdes has been writing about Google Wave and its secret sauce. I highly encourage you to go read his blog.

Google and Salesforce.com continue to build on their partnership. In April, they unveiled Salesforce for Google Apps. Now, they’re introducing Force.com for Google App Engine.

The announcement, in a nutshell, is this: There are now public Salesforce APIs that can be downloaded, and will work on Google App Engine (GAE). Those APIs are a subset of the functionality available in Force.com’s regular Web Services APIs. Check out the User Guide for details.

Note that this is not a replacement for Force.com and its (proprietary) Apex programming language. Salesforce clearly articulates web services vs. Force.com in its developer guide. Rather, this should be thought of as easing the curve for developers who want to extend their Web applications for use with Salesforce data.

A question that lingers in my mind: Normally, on Force.com, a Developer Edition account means that you can’t affect your organization’s live data. If a similar restriction exists on the GAE version of the APIs, it’s not mentioned in the documentation. I wonder if you can do very lightweight apps, using live data, with just a Developer Edition account with Salesforce, if you do it through GAE. If so, that would certainly open up the realm of developers who might try building something on the platform.

My colleague Eric Knipp has also blogged about the announcement. I’d encourage you to read his analysis.

Bookmark and Share

Google App Engine and other tidbits

As anticipated, Java support on Google App Engine has been announced. To date, GAE has supported only the Python programming language. In keeping with the “phenomenal cosmic power, itty bitty living space” sandboxing that’s become common to cloud execution environments, GAE/Java has all the restrictions of GAE/Python. However, the already containerized nature of Java applications means that the restrictions probably won’t feel as significant to developers. Many Python libraries and frameworks are not “pure Python”; they include C extensions for speed. Java libraries and frameworks are, by contrast, usually pure Java; the biggest issues for porting Java into the GAE environment are likely to be the restrictions on system calls and the lack of threads. Generically, GAE/Java offers servlets. The other things that developers are likely to miss are support for JMS and JMX (Java’s messaging and monitoring, respectively).

Overall, the Java introduction is a definite plus for GAE, and is presumably also an important internal proof point for them — a demonstration that GAE can scale and work with other languages. Also, because there are lots of languages that now target the Java virtual machine (i.e., they’ve got compilers/interpreters that produce byte code for the Java VM) — Clojure and Scala, for instance — as well as ports of other languages, like JRuby, we’ll likely see additional languages available on GAE ahead of Google’s own support for those environments.

Google also followed through on an earlier announcement, adding support for scheduld tasks (“cron”). Basically, at a scheduled time, GAE cron will invoke a URL that you specify. This is useful, but probably not everything people were hoping it would be. It’s still subject to GAE’s normal restrictions; this doesn’t let you invoke a long-running background process. It requires a shift in thinking — for instance, instead of doing the once-daily data cleanup run at 4 am, you ought to be doing cleanup throughout the day, every couple of minutes, a bit of your data set at a time.

All of that is going to be chewed over thoroughly by the press and blogosphere, and I’ve contributed my two cents to a soon-to-be-published Gartner take on the announcement and GAE itself, so now I’ll point out something that I don’t think has been widely noticed: the unladen-swallow project plan.

unladen-swallow is apparently an initiative within Google’s compiler optimization team, with a goal of achieving a 5x speed-up in CPython (i.e., the normal, mainstream, implementation of Python), starting from the 2.6 base (the current version, which is a transition point between the 2.5 used by App Engine, and the much-different Python 3.0). The developers intend to achieve this speed-up in part by moving from the existing custom VM to one built on top of LLVM. (I’ve mentioned Google’s interest in LLVM in the past.) I think this particular approach answers some of the mystery surrounding Google and Python 3.0 — this seems to indicate longer-term commitment to the existing 2.x base, while still being transition-friendly. As is typical with Google’s work with open-source code, they plan to release these changes back to the community.

All of which goes back to a point of mine earlier this week: Although programming language communities strongly resemble fandoms, languages are increasingly fungible. We’re a long way from platform maturity, too.

Bookmark and Share

Google App Engine updates

For those of you who haven’t been following Google’s updates to App Engine, I want to call your attention to a number of recent announcements. At the six-month point of the beta, I asked when App Engine would be enterprise-ready; now, as we come to almost the year mark, these announcements show the progress and roadmap to addressing many of the issues I mentioned in my previous post.

Paid usage. Google is now letting applications grow beyond the free limits. You set quotas for various resources, and pay for what you use. I still have concerns about the quota model, but being able to bill for these services is an important step for Google. Google intends to be price-competitive with Amazon, but there’s an important difference — there’s still some free service. Google anticipates that the free quotas are enough to serve about five million page views. 5 MPVs is a lot; it pretty much means that if you’re willing to write to the platform, you can easily host your hobby project on it for free. For that matter, many enterprises don’t get 5 MPVs worth of hits on an individual Web app or site each month — it’s just that the platform restrictions are a barrier to mainstream adoption.

Less aggressive limits and fewer restrictions. Google has removed or reduced some limits and restrictions that were significant frustrations for developers.

Promised new features. Google has announced that it’s going to provide APIs for some vital bits of functionality that it doesn’t currently allow, like the ability to run scheduled jobs and background processes.

Release of Python 3.0. While there’s no word on how Google plans to manage the 3.0 transition for App Engine, it’s interesting to see how many Python contributors have been absorbed into Google.

Speaking personally, I like App Engine. Python is my strongest scripting language skill, so I prefer to write in it whenever possible. I also like Django, though I appreciate that Google’s framework is easier to get started with than Django (it’s very easy to crank out basic stuff). Like a lot of people, I’ve had trouble adjusting to the non-relational database, but that’s mostly a matter of programming practice. It is, however, clear that the platform is still in its early stages. (I once spent several hours of a weekend tearing my hair out at something that didn’t work, only to eventually find that it was a known bug in the engine.) But Google continues to work at improving it, and it’s worth keeping an eye on to see what it will eventually become. Just don’t expect it to be enterprise-ready this year.

Bookmark and Share

Cloud failures

A few days ago, an unexpected side-effect of some new code caused a major Gmail outage. Last year, a small bug triggered a series of cascading failures that resulted in a major Amazon outage. These are not the first cloud failures, nor will they be the last.

Cloud failures are as complex as the underlying software that powers them. No longer do you have isolated systems; you have complex, interwoven ecosystems, delicately orchestrated by a swarm of software programs. In presenting simplicity to the user, the cloud provider takes on the burden of dealing with that complexity themselves.

People sometimes say that these clouds aren’t built to enterprise standards. In one sense, they aren’t — most aren’t intended to meet enterprise requirements in terms of feature-set. In another sense, though, they are engineered to far exceed anything that the enterprise would ever think of attempting themselves. Massive-scale clouds are designed to never, ever, fail in a user-visible way. The fact that they do fail nonetheless should not be a surprise, given the potential for human error encoded in software. It is, in fact, surprising that they don’t visibly fail more often. Every day, within these clouds, a whole host of small errors that would be outages if they occurred within the enterprise — server hardware failures, storage failures, network failures, even some software failures — are handled invisibly by the back-end. Most of the time, the self-healing works the way it’s supposed to. Sometimes it doesn’t. The irony in both the Gmail outage and the S3 outage is that both appear to have been caused by the very software components that were actively trying to create resiliency.

To run infrastructure on a massive scale, you are utterly dependent upon automation. Automation, in turn, depends on software, and no matter how intensively you QA your software, you will have bugs. It is extremely hard to test complex multi-factor failures. There is nothing that indicates that either Google or Amazon are careless about their software development processes or their safeguards against failure. They undoubtedly hate failure as much as, and possibly more than, their customers do. Every failure means sleepless nights, painful internal post-mortems, lost revenue, angry partners, and embarrassing press. I believe that these companies do, in fact, diligently seek to seamlessly handle every error condition they can, and that they generally possess sufficient quantity and quality of engineering talent to do it well.

But the nature of the cloud — the one homogenous fabric — magnifies problems. Still, that’s not isolated to the cloud alone. Let’s not forget VMware’s license bug from last year. People who normally booted up their VMs at the beginning of the day were pretty much screwed. It took VMware the better part of a day to produce a patch — and their original announced timeframe was 36 hours. I’m not picking on VMware — certainly you could find yourself with a similar problem with any kind of widely deployed software that was vulnerable to a bug that caused it all to fail.

Enterprise-quality software produced the SQL Slammer worm, after all. In the cloud, we ain’t seen nothing yet…

Bookmark and Share

Follow

Get every new post delivered to your Inbox.

Join 70 other followers

%d bloggers like this: