A few days ago, an unexpected side-effect of some new code caused a major Gmail outage. Last year, a small bug triggered a series of cascading failures that resulted in a major Amazon outage. These are not the first cloud failures, nor will they be the last.
Cloud failures are as complex as the underlying software that powers them. No longer do you have isolated systems; you have complex, interwoven ecosystems, delicately orchestrated by a swarm of software programs. In presenting simplicity to the user, the cloud provider takes on the burden of dealing with that complexity themselves.
People sometimes say that these clouds aren’t built to enterprise standards. In one sense, they aren’t — most aren’t intended to meet enterprise requirements in terms of feature-set. In another sense, though, they are engineered to far exceed anything that the enterprise would ever think of attempting themselves. Massive-scale clouds are designed to never, ever, fail in a user-visible way. The fact that they do fail nonetheless should not be a surprise, given the potential for human error encoded in software. It is, in fact, surprising that they don’t visibly fail more often. Every day, within these clouds, a whole host of small errors that would be outages if they occurred within the enterprise — server hardware failures, storage failures, network failures, even some software failures — are handled invisibly by the back-end. Most of the time, the self-healing works the way it’s supposed to. Sometimes it doesn’t. The irony in both the Gmail outage and the S3 outage is that both appear to have been caused by the very software components that were actively trying to create resiliency.
To run infrastructure on a massive scale, you are utterly dependent upon automation. Automation, in turn, depends on software, and no matter how intensively you QA your software, you will have bugs. It is extremely hard to test complex multi-factor failures. There is nothing that indicates that either Google or Amazon are careless about their software development processes or their safeguards against failure. They undoubtedly hate failure as much as, and possibly more than, their customers do. Every failure means sleepless nights, painful internal post-mortems, lost revenue, angry partners, and embarrassing press. I believe that these companies do, in fact, diligently seek to seamlessly handle every error condition they can, and that they generally possess sufficient quantity and quality of engineering talent to do it well.
But the nature of the cloud — the one homogenous fabric — magnifies problems. Still, that’s not isolated to the cloud alone. Let’s not forget VMware’s license bug from last year. People who normally booted up their VMs at the beginning of the day were pretty much screwed. It took VMware the better part of a day to produce a patch — and their original announced timeframe was 36 hours. I’m not picking on VMware — certainly you could find yourself with a similar problem with any kind of widely deployed software that was vulnerable to a bug that caused it all to fail.
Enterprise-quality software produced the SQL Slammer worm, after all. In the cloud, we ain’t seen nothing yet…