Category Archives: Infrastructure
What the Microsoft CDN study measures
Cheng Huang et.al.’s Microsoft Research and NYU collaboration on a study entitled Measuring and Evaluating Large-Scale CDNs is worth a closer look. This is the first of what I expect will be a series of posts that aims to explain what was studied and what it means.
The study charts the Akamai and Limelight CDNs, and compares their performance. Limelight has publicly responded, based on questions from Dan Rayburn.
I want to begin by talking about what this study does and doesn’t measure.
The study measures two things: latency to the CDN’s DNS server, and latency to the CDN’s content server. This is latency in the purest network sense — the milliseconds of transit time between the origin measurement point (the “vantage point”) and a particular CDN server. The study uses a modified King methodology, which means the origin measurement points are open recursive DNS servers. In plain English, that means that the origin measurement points are ordinary DNS resolvers — the servers provided by ISPs, universities, and some businesses who have their resolvers outside the firewall. The paper states that 282,700 unique resolvers were used as vantage points.
Open recursive DNS servers (I’m just going to call them “resolvers” for short) are typically at the core of networks, not at the edge. They sit in the data centers of service providers and organizations; in the case of service providers, they may sit at major aggregation points. For instance, I’m a MegaPath DSL customer; the two MegaPath-based resolvers provided to me sit at locations with ping times that average 18 ms and 76 ms away. The issues with this are particularly acute given the study’s resolver discovery methodology — open authoritatives found by a reverse DNS lookup. Among other things, this results in the large diverse networks being significantly under-represented.
So what this study emphatically does not measure is latency to the end user. Instead, think of it as latency to the core of a very broad spectrum of networks, where “the core” means a significant data center or aggregation point, and “networks” mean service provider networks as well as enterprise networks. This is going to be very important when we consider the Akamai/Limelight performance comparison.
Content delivery performance can typically broken down into the “start time” — the amount of time that passes until the first byte of content is delivered to the user — and the “transfer time”, which is how long it takes for the content to actually get delivered.
The first component of the start time is the DNS resolution time. The URL is typically a human-readable name; this has to get turned into an IP address that a computer can understand. This is where CDNs are magic — they take that hostname and they turn it into the IP address of a “good”, “nearby” CDN server to get the content from. This component is what the study is measuring when it’s measuring the CDN DNS servers. The performance of this involves:
- the network latency between the end-user and his resolver
- the network latency between his resolver and the CDN’s DNS server
- the amount of time it takes for the CDN’s DNS server to return a response to the query (the CDN has to figure out which server it wants to serve the content from, which takes some computational cycles to process; in order to cut down computational time, it tends to be a “good enough” server rather than “the optimal” server)
The start time has another component, which is how long it takes for the CDN content server to find the file it’s going to serve, and start spitting it out over the network to the end user. This is a function of server performance and workload, but it’s also a function of whether or not the content is in cache. If it’s not in cache, it’s got to go fetch it from the origin server. Therefore, a cache miss is going to greatly increase the start time. The study doesn’t measure this at all, of course.
The transfer time itself is dependent upon the server performance and workload, but also upon the network performance between the CDN’s content server and the end user. This involves not just latency, but also packet loss (although most networks today have very little packet loss, to the point where some carriers offer 0% packet loss SLAs). During the transfer period, jitter (the consistency of the network performance) may also matter, since spikes in latency may impact things like video, causing a stream to rebuffer or a progressive-download viewing to pause. In the end, the performance comes down to throughput — how many bytes can be shoved across the pipe, each second. The study measures latency to the content server, but it does not measure throughput, and throughput is the real-world metric for understanding actual CDN performance. Moreover, the study measures latency using a DNS packet — lightweight and singular. So it in no way reflects any TCP/IP tricks that a CDN might be doing in order to optimize its throughput.
Now, let’s take all this in the context of the Akamai/Limelight comparison that’s being drawn. The study notes that DNS resolution time is 23% higher for Limelight than Akamai, and that Limelight’s content server latency is 114% higher. However, this includes regions for which Limelight has little or no geographic coverage. For instance, in North America, where both companies have good coverage, Akamai has a DNS server delay of 115.81 ms and a content server delay of 67.24, vs. 78.64 and 79.03 respectively for Limelight. (It’s well-known that Akamai’s DNS resolution can be somewhat slower than competitors, since its much more extensive and complex network results in greater computational complexity.)
The study theorizes that it’s comparing the Akamai “as far to the edge as possible” approach vs. the Limelight (and most other current-generation CDNs) “megaPOP” approach. In other words, the question being asked is, “How much performance difference is created by not being right at the edge?”
Unfortunately, this study doesn’t actually answer that question, because the vantage points — the open recursive DNS servers — are not at the edge. They’re at the core (whether of service provider or enterprise networks). They’re at locations with fast big-pipe connectivity, and likely located in places with excellent peering — both megaPOP-friendly. A CDN like Akamai is certainly also at those same megaPOP locations, of course, but the methodology means that a lot of vantage points are essentially looking at the same CDN points of presence, rather than the more diverse set that might otherwise be represented by actual end-users. It seems highly likely that the Akamai network performance difference, under conditions where both CDNs feel they have satisfactory coverage, is underestimated by the study’s methodology.
More to come…
Limelight workaround, and an Akamai comparison
DataCenterKnowledge has reported that Limelight has a workaround for the Akamai patents. Limelight’s last SEC filing noted that it amended its agreement with Microsoft (to whom it has licensed its CDN technology, for Microsoft to use in building its own internal CDN), to provide them with a new version of the software that they believe is non-infringing. That suggests that they have or will have a workaround for their own network.
Also, Microsoft and NYU researchers have recently released a paper, Measuring and Evaluating Large-Scale CDNs, that charts the Akamai and Limelight networks, and offers (DNS-based) delay measurements for their DNS resolvers and content servers.
I’ll have more commentary on both topics soon, when I’ve got some more time.
Also, I have decided that I’m going to start adding stock tickers to my tags, whenever I write about something that’s likely to be of interest to investors in a particular company. Hopefully, this will help Gartner Invest clients and others with similar interests to navigate my blog.
CDNs continue to get cheaper
Dan Rayburn has posted his quarterly CDN pricing update. It’s always interesting reading for me, compared to what I see out of Gartner client contracts. I’m somewhat bemused that he finds 250 TB a month to be an “average” customer; the overwhelming majority of CDN customers are below the 100 TB mark, and customers are getting smaller and smaller as prices drop and minimum commits decline, and enterprises revamp their websites with rich media and exhibit often-irrational fears about hosting video themselves. I have an awful lot of enterprise clients doing below 2 TB of delivery. But certainly prices are dropping across the board, although it’s really only at the higher volumes that the price drops have been murderous.
Heavy experiments with Amazon
Scott Penberthy of online video provider Heavy has an interesting blog post about trying to replace Rackspace and Akamai with Amazon web services — substituting S3 for Rackspace SAN storage, and direct delivery out of S3 for Akamai CDN services. Not surprisingly, the S3 performance fell well below Akamai performance, but they managed to achieve significant storage cost savings.
Who hosts Warhammer Online?
With the recent launch of EA/Mythic‘s Warhammer Online MMORPG, comes my usual curiosity about who’s providing the infrastructure.
Mythic has stated publicly that all of the US game servers are located in Virginia, near Mythic’s offices. A couple of traceroutes seem to indicate that they’re in Verizon, almost certainly in colocation (managed hosting is rare for MMOGs), and seem to have purely Verizon connectivity to the Internet. The webservers, on the other hand, look to be split between Verizon, and ThePlanet in Dallas. FileBurst (a single-location download hosting service) is used to serve images and cinematics.
During the beta, Mythic used BitTorrent to serve files. With the advent of full release, it doesn’t appear that they’re depending on peer-to-peer any longer — unlike Blizzard, for instance, which uses public P2P in the form of BitTorrent for its World of Warcraft updates, trading off cost with much higher levels of user frustration. MMO updates are probably an ideal case for P2P file distribution — Solid State Networks, a P2P CDN, has done well by that — and with hybrid CDNs (those combining a traditional distributed model with P2P) becoming more commonplace, I’d expect to see that model more often.
However, I’m not keen on either single data center locations or single-homing, for anything that wants to be reliable. I also believe that gaming — a performance-sensitive application — really ought to run in a multi-homed environment. My favorite “why you should use multiple ISPs, even if you’re using a premium ISP that you love” anecdote to my clients is an observation I made while playing World of Warcraft a few years ago. WoW originally used just AT&T’s network (in AT&T colocation). Latency was excellent — most of the time. Occasionally, you’d get a couple of seconds of network burp, where latency would spike hugely. If you’re websurfing, this doesn’t really impact your experience. If you’re playing an online game, you can end up dead. When WoW switched to Internap for the network piece (remaining in AT&T colo), overall latencies went up — but the latencies were still well below the threshold of problematic performance, and more importantly, the latencies were rock-solidly in a narrow window of variability. (This is the same reason multi-homed CDNs with lots of route diversity deliver better consistency of user experience than single-carrier CDNs.)
Companies like Fileburst, by the way, are going to be squarely in the crosshairs of the forthcoming Amazon CDN. Fileburst will do 5 TB of delivery at $0.80 per GB — $3,985/month. At the low end, they’ll do 100 GB or less at $1/GB. The first 100 MB of storage is free, then it’s $2/MB. They’ve got a delivery infrastructure at the Equinix IBX in Ashburn (Northern Virginia, near DC), extensive peering, but any other footprint is vague (they say they have a six-location CDN service, but it’s not clear whether it’s theirs or if they’re reselling).
If Amazon’s CDN pricing is anything like the S3 pricing, they’ll blow the doors off those prices. S3 is $0.15/GB for space and $0.17/GB for the first 10 TB of data transfer. So deliver 5 TB worth of content, out of a 1 GB store, would cost me $5,785/month with Fileburst, and about $850 with Amazon S3. Even if the CDN premium on data transfer is, say, 100%, that’d still be only $1,700 with Amazon.
Amazon has a key cloud trait — elasticity, basically defined as the ability to scale to zero (or near-zero) as easily as scaling to bogglosity. It’s that bottom end that’s really going to give them the potential to wipe out the zillion little CDNs that primarily have low-volume customers.
Oracle in the cloud… sort of
Today’s keynote at Oracle World mentioned that Oracle’s coming to Amazon’s EC2 cloud.
The bottom line is that you can now get some Oracle products, including the Oracle 11g database software, bundled as AMIs (Amazon machine images) for EC2 — i.e., ready-to-deploy — and you can license these products to run in the cloud. Any sysadmin who has ever personally gone through the pain of trying to install an Oracle database from scratch knows how frustrating it can be; I’m curious how much the task has or hasn’t been simplified by the ready-to-run AMIs.
On the plus side, this is going to address the needs of those companies who simply want to move apps into the cloud, without changing much if anything about their architecture and middleware. And it might make a convenient development and testing platform.
But simply putting a database on cloud infrastructure doesn’t make it make it a cloud database. Without that crucial distinction, what are the compelling economics or business value-add? It’s cool, but I’m having difficulty thinking of circumstances under which I would tell a client, yes, you should host your production Oracle database on EC2, rather than getting a flexible utility hosting contract with someone like Terremark, AT&T, or Savvis.
Amazon gets into the CDN business
Unsurprisingly, Amazon is getting into the CDN business. (They’re taking notification sign-ups but it’s still in private beta.)
Content delivery is a natural complement to S3 and EC2. There’s already been use and abuse of S3 as a “ghetto CDN”, and at least one commercial hosting provider (Voxel) already offers a productized S3-based CDN. If you’re an EC2 or an S3 customer, chances are high that you’ve got significant static content traffic suited to CDN delivery. Amazon is just gluing together the logical pieces, and like you’d expect, your content on their CDN will reside in S3.
Basic content delivery services can practically be thought of as nothing more than value-added bandwidth (or value-added storage, if you want to think of it that way). Chances are very high that every major carrier, not to mention every major provider of distributed computing services (i.e., infrastructure clouds), is going to end up in the CDN business sooner or later.
GigaOm and Dan Rayburn have more details about the announcement, and come to similar conclusions: Despite how badly the stock market is beating up on Akamai in the wake of this announcement, this really has very little impact on them. I concur with that bottom line.
I noted last year that the CDN market has bifurcated. Amazon’s new offering is going to squarely target the commoditized portion of the market. Of the existing CDNs, it will impact Level 3 and the smaller no-frills CDNs the most. It will probably also have a minor impact on Limelight (which has a significant percentage of commodity CDN traffic), but basically negligible impact upon Akamai, whose customer base is tilting more and more to the high end of this business.
Just like EC2 and S3 have, this new Amazon service is also going to create a market for overlay value-add companies — people who provide easier-to-use interfaces, analytics, and so on, over the Amazon offering. I’d expect to see some of the existing overlay companies provide management toolsets for the new service, and it will probably prompt some hosters to offer CDN services built on top of the Amazon platform.
Amazon’s entry, combining an elastic model with what at this point can reasonably be considered proven scalable infrastructure expertise, constitutes further market expansion, and supports my fundamental belief that CDNs are increasingly going to entirely dominate the front-end webserving tier. Delivery is becoming so cheap for the masses that there’s very little reason to bother with your own front-end infrastructure.
The Cloud skills shift
Something that I’ve been thinking about: The shift to global-class computing, and massively scalable infrastructure, represents a fundamental shift in the skill sets that will be valued in IT Operations.
Those of you who, like myself, have worked at service providers in hyper-growth mode, are already familiar with what occurs when you need to grow at red-shift speeds: You automate everything humanly possible, and you try to standardize the heck out of things. Usually you end up trying to make sure that your infrastructure is horizontally scalable, and that your hardware is as interchangeable as possible, alllowing any single server to fail and the system as a whole to go chugging along, while you eventually go yank that server out and replace it with another just-as-generic box that you’ve auto-provisioned.
The shift to the cloud model, whether public or private, basically pushes the idea that every IT organization does that, either in-house or through the services of a provider. It puts the premium on software development / scripting skills — these are the guys who automate things and who write the glue for integrating your toolsets. You’ll have a handful of guys who are your serious architects — the guys who tune and optimize your hardware and storage, design your configurations, and so on. (That might be a single guru, or you might go to consultants for that, alternatively.) You’ll have a few folks who know the operational ins-and-outs of troubleshooting your applications. Everyone else becomes a hardware monkey, entry-level folks who don’t need much more of a skillset than it takes to assemble a PC from parts.
This is writ large in the Google model of Operations, but it’s been true for the last decade in every dot-com of significant size, too. Your hardware operations guys are rack-and-stack types. Everyone else blends systems administration with scripting abilities, and because your toolsets have to scale and be highly maintainable, this is scripting that has the air of serious development, not a one-time thing that can be banged out unreadably.
The routine drudgery of IT Operations is going to get automated away, bit by bit. Right now, many enterprises still operate at a scale and lack of standardization that means that it’s not necessarily more efficient to automate a task than simply do it manually. In the cloud model, the balance tips to the automation side, and the basic value of “I can wrangle boxes” declines precipitously.
My advice to sysadmins: If you are not fluent in a scripting language, and/or not capable of writing structured, readable, maintainable, non-hackish code, now is the time to learn.
Who hosts the ’08 election sites?
The upcoming election in the United States is clearly the election of YouTube and social media. The websites of the presidential candidates are drawing substantial traffic — Nielsen Online estimates the week ending August 31st as 3.4 million unique visitors for Obama, and 1.8 million unique visitors for McCain (up from 524,000 the week before, a massive jump, the apparent result of naming Sarah Palin as his running mate). This is a huge boost for both campaigns over the unique visitors for the month of May, a quarter ago — 2.3 million for Obama, and 563,000 for McCain.
By dot-com and general media measures, these are respectable but not huge numbers. At the top of the popularity game, Facebook and Myspace boast about 115 million unique visitors world-wide, with MySpace tops in the US with around 75 million. Better comparisons might be the launch of the Age of Conan MMORPG (2.2 million unique visitors in 10 days), or the visitors to the website of Weight Watchers (around 2 million).
Websites have been mission-critical in this campaign. So who hosts all of this infrastructure? The traffic’s tremendously variable, the leadership of the free world is at stake… so who gets entrusted with it all?
We can figure that out.