Availability and the Microsoft CDN study
This post is the third in a series examining the Microsoft CDN study. My first post examined what was measured, and the second post looked at the blind spots created by the vantage-point discovery method they used. This time, I want to look at the availability and maintenance claims made by the study.
CDNs are inherently built for resilience. The whole point of a CDN is that individual servers can fail (or be taken offline for maintenance), with little impact on performance. Indeed, entire locations can fail, without affecting the availability of the whole.
If you’re a CDN, then the fewer nodes you have, the more impact the total failure of a node will have on your overall performance to end-users. However, the flip side of that is that megaPOP-architecture CDNs generally place their nodes in highly resilient facilities with extremely broad access to connectivity. The most likely scenario that takes out an entire such node is a power failure, which in such facilities generally requires a cascading chain of failure (but can go wrong at single critical points, as with the 365 Main outage of last year). By contrast, the closer you get to the edge, the higher the likelihood that you’re not in a particularly good facility and you’re getting connectivity from just one provider; failure is more probable but it also has less impact on performance.
Because the Microsoft study likely missed a significant number of Akamai server deployments, especially local deployments, it may underestimate Akamai’s single-server downtime, if you assume that such local servers are statistically more likely to be subject to failure.
I would expect, however, that most wider-scale CDN outages are related not to asset failure (facility or hardware), but to software errors. CDNs, especially large CDNs, are extraordinarily complex software systems. There are scaling challenges inherent in such systems, which is why CDNs often experience instability issues as part of their growing pains.
The problem with the Microsoft study of availability is that whether or not a particular server or set of servers responds to requests is not really germane to availability per se. What is useful to know is the variance in performance based upon that availability, and what percentage of the time the CDN selects a content server that is actually unavailable or which is returning poor performance. The variance plays into that edge-vs-megaPOP question, and the selection indicates the quality of the CDN’s software algorithms as well as real-world performance. The Microsoft study doesn’t help us there.
Similarly, whether or not a particular server is in service does not indicate what the actual maintenance cost of the CDN is. Part of the core skillset of a CDN company is the ability to maintain very large amounts of hardware without using a lot of people. They could very readily have automated processes pulling servers out of service, and executing software updates and the like with little to no human intervention.
Next up: Some conclusions.