Infrastructure resilience, fast VM restart, and Google Compute Engine
If you read Gartner research, you’ve probably noticed that we’ve started referring to something called “fast VM restart”. We consider it to be a critical infrastructure resiliency feature for many business application workloads.
Many applications are small. Really, really small. They take a fraction of a CPU core, and less than 1 GB of RAM. And they’ll never get any bigger. (They drive the big wins in server consolidation.) Most applications in mainstream businesses are like that. I often refer to these as “paperwork apps” — somebody fills out a form, that form is routed and processed, and eventually someone runs a report. Businesses have a zillion of these and continue to write more. When an organization says they have hundreds, or thousands, of apps, most of them are paperwork apps. They can be built by not especially bright or skilled programmers, and for resilience, they rely on the underlying infrastructure platform.
A couple things can happen to these kinds of paperwork apps in the future:
- They can be left on-premise to run as-is within an enterprise virtualization environment (that maybe eventually becomes private cloud-ish), relying on its infrastructure resilience.
- They can be migrated into a cloud IaaS environment, relying on it for infrastructure resilience.
- They can be migrated onto a PaaS, either on-premise or from a service provider, relying on it for resilience.
- They can be moved to business process management (BPM) platforms, either via on-premise deployment of a BPM suite, or a BPM PaaS, thereby making resilience the problem of the BPM software.
Note the thing that’s not on that list: Re-architecting the application for application-level resilience. That requires that your developers be skilled enough to do it, and for you to be able to run in a distributed fashion, which, due to the low level of resources consumed, isn’t economical.
Of the various scenarios above, the lift-and-shift onto cloud IaaS is a hugely likely one for many applications. But businesses want to be comfortable that the availability will be comparable to their existing on-premise infrastructure.
So what does infrastructure resilience mean? In short, it means minimal downtime due to either planned maintenance or a failure of the underlying hardware. Live migration is the most common modern technique used to mitigate downtime for planned maintenance that impacts the physical host. Fast VM restart is the most common technique used to mitigate downtime due to hardware failure.
Fast VM restart is built into nearly all modern hypervisors. It’s not magical — a shocking number of VMware customers believe that VM HA means they’ll instantly get a workload onto a happy healthy host from a failed host (i.e., they confuse live migration with VM HA). Fast VM restart is basically a technique to rapidly detect that a physical host has failed, and restart the VMs that were on that host, on some other host. It doesn’t necessarily need to be implemented at the virtualization level — you can just have monitoring that runs at a very short polling interval and that orchestrates a move-and-restart of VMs when it sees a host failure, for instance. (Of course, you need a storage and network architecture that makes this viable, too.)
Clearly, not all applications are happy when they get what is basically an unexpected reboot, but this is the level of infrastructure resiliency that works just fine for non-distributed applications. When customers babble about the reliability of their on-premise VMware-based infrastructure, this is pretty much what they mean. They think it has value. They’re willing to pay more for it. There’s no real reason why it shouldn’t be implemented by every cloud IaaS providers that intends to take general business applications, not just the VMware-based providers.
By the way: Lost in the news about live migration in Google Compute Engine has been an interesting new subtlety. I missed noticing this in my first read-through of the announcement, since it was phrased purely in the context of maintenance, and only a re-read while finishing up this blog post led me to wonder about general-case restart. And I haven’t seen this mentioned elsewhere:
The new GCE update also adds fast VM restart, which Google calls Automatic Restart. To quote the new documentation, “You can set up Google Compute Engine to automatically restart an instance if it is taken offline by a system event, such as a hardware failure or scheduled maintenance event, using the automaticRestart setting.” Answering a query from me, Google said that the restart time is dependent upon the type of failure, but that in most cases, it should be under three minutes.
So a gauntlet has been (subtly) thrown. This is one of the features that enterprises most want in an “enterprise-grade” cloud IaaS offering. Now Google has it. Will AWS, Microsoft, and others follow suit?