The myth of zero downtime
Every time there’s been a major Amazon outage, someone always says something like, “Regular Web hosters and colocation companies don’t have outages!” I saw an article in my Twitter stream today, and finally decided that the topic deserves a blog post. (The article seemed rather linkbait-ish, so I’m not going to link it.)
It is an absolute myth that you will not have downtime in colocation or Web hosting. It is also a complete myth that you won’t have downtime in cloud IaaS run by traditional Web hosting or data center outsourcing providers.
The typical managed hosting customer experiences roughly one outage a year. This figure comes from thirteen years of asking Gartner clients, day in and day out, about their operational track record. These outages are typically related to hardware failure, although sometimes they are related to service provider network outages (often caused by device misconfiguration, which can obliterate any equipment or circuit redundancy). Some customers are lucky enough to never experience any outages over the course of a given contract (usually two to three years for complex managed hosting), but this is actually fairly rare, because most customers aren’t architected to be resilient to all but the most trivial of infrastructure failures. (Woe betide the customer who has a serious hardware failure on a database server.) The “one outage a year” figure does not include any outages that the customer might have caused himself through application failure.
The typical colocation facility in the US is built to Tier III standards, with a mathematical expected availability of about 99.98%. In Europe, colocation facilities are often built to Tier II standards intead, for an expected availability of about 99.75%. Many colocation facilities do indeed manage to go for many years without an outage. So do many enterprise data centers — including Tier I facilities that have no redundancy whatsoever. The mathematics of the situation don’t say that you will have an outage — these are merely probabilities over the long term. Moreover, there will be an additional percentage of error that is caused by humans. Single-data-center kings who proudly proclaim that their one data center has never had an outage have gotten lucky.
The amount of publicity that a data center outage gets is directly related to its tenant constituency. The outage at the 365 Main colocation facility in San Francisco a few years back was widely publicized, for instance, because that facility happened to house a lot of Internet properties, including ones directly associated with online publications. There have been significant outages at many other colocation faciliities over the years, though, that were never noted in the press — I’ve found out about them because they were mentioned by end-user clients, or because the vendor disclosed them.
Amazon outages — and indeed, more broadly, outages at large-scale providers like Google — get plenty of press because of their mass effects, and the fact that they tend to impact large Internet properties, making the press aware that there’s a problem.
Small cloud providers often have brief outages — and long maintenance windows, and sometimes lengthy maintenance downtimes. You’re rolling the dice wherever you go. Don’t assume that just because you haven’t read about an outage in the press, it hasn’t occurred. Whether you decide on managed hosting, dedicated hosting, colocation, or cloud IaaS, you want to know a provider’s track record — their actual availability over a multi-year period, not excluding maintenance windows. Especially for global businesses with 24×7 uptime requirements, it’s not okay to be down at 5 am Eastern, which is prime-time in both Europe and Asia.
Sure, there are plenty of reasons to worry about availability in the cloud, especially the possibility of lengthy outages made worse by the fundamental complexity that underlies many of these infrastructures. But you shouldn’t buy into the myth that your local Web hoster or colocation provider necessarily has better odds of availability, especially if you have a non-redundant architecture.