Monthly Archives: October 2008
IDC’s take on cloud
IDC has recently published snippets of their cloud computing outlook on their blog; the data from the user survery is particularly interesting.
Power usage effectiveness
Two interesting blog posts:
Chirag Mehta: Greening the Data Centers
Microsoft: Charging Customers for Power Usage in Microsoft Data Centers
Also, Google has publicly released its data center efficiency measurements, as part of their docs on their commitment to sustainable computing. What they don’t say is the degree to which their green efforts impacts the availability of their facilities. Google can afford to have lower individual facility reliability, because their smart distributed infrastructure can seamlessly adapt to failure. Most enterprises don’t have that luxury.
Blog vs. research note
I’ve been grappling with finding the right balance between blogging and writing actual research notes. I am an all-at-once writer — I’m usually at my best when I sit down and write an entire research note at one go, so it comes out as one coherent whole. A research note is something that has usually percolated about in my head for a while and is now ready to be expressed in what I hope is a bit of crystallized clarity. Problematically, though, I spend nearly my entire day on the phone with clients — I often only have 15 minutes between calls, just long enough to attend to the needs of biology and deal with my email. Eight such fragments in no way equate to an actual uninterrupted two hours, or even one hour, which makes it very hard to write substantive documents.
On the other hand, I can write a blog entry in 15 minutes, or in a bunch of 15-minute fragments, because it’s far more stream-of-consciousness. It’s unpolished thought; it can be more disjointed. It can raise questions without trying to provide answers, speculate, and be wooly maunderings rather than actionable advice. It can be trivial in the broader scheme of things, but is given power by immediacy and connectedness. It’s enormously tempting to scribble things down and just let them float out into the world. I became an analyst in part because I like to write, and it’s easy to get sucked into scribbling something whenever I get a chance.
I was googling around for the thoughts of others on this subject, and I came across Andrew Sullivan’s newly-published piece in the Atlantic, “Why I Blog“. His musings in that piece have the elegance of long contemplation, and, I think, he does an excellent job of capturing the nature of blogging, writing:
A blog is not so much daily writing as hourly writing. And with that level of timeliness, the provisionality of every word is even more pressing — and the risk of error or the thrill of prescience that much greater.
Andrew Sullivan’s piece has, perhaps, one of the best indirect answers to the whole Bloggers vs. Analysts question, as well:
A traditional writer is valued by readers precisely because they trust him to have thought long and hard about a subject, given it time to evolve in his head, and composed a piece of writing that is worth their time to read at length and to ponder. Blogs don’t do this and cannot do this — and that limits them far more than it does traditional long-form writing. A blogger will air a variety of thoughts or facts on any subject in no particular order other than that dictated by the passing of time. A writer will instead use time, synthesizing these thoughts, ordering them, weighing which points count more than others, seeing how his views evolved in the writing process itself, and responding to an editor’s perusal of a draft or two. The result is almost always more measured, more satisfying, and more enduring than a blizzard of posts.
I think the need to engage with the wider community and to be more timely will inexorably push analysts towards adding blogging to their output activities (even if not employer-recognized), but it certainly won’t replace traditional research notes. Moreover, social media is here to stay in the lives of analysts; it’s useful and it’s relevant.
Forrester’s Jeremiah Owyang described 7 tenets of the connected analyst in his blog today; it’s a well-encapsulated set of thoughts on how analysts should engage with the community. To me, that emphasis on connection is a shift in the nature of analysts. Although we write research notes, our research clients probably derive the greatest value from the relationship, the one-on-one interactions that consider an individual client’s situation and provide tailored advice. Blogging, on the other hand, is a one-to-many, perhaps many-to-many, activity.
I’ll have something on the order of 800 one-on-one client interactions this year. Many of these clients will have read a research note before talking to me. But they want to talk about it — to privately ask detailed questions, to get help with their specific situation, to understand the data supporting the conclusions, and in short, to get the equivalent of boutique personalization.
Despite my belief in the value of the relatoinship, though, analyst firms, including mine, still make a lot of money off research subscriptions. And that gives me a professional responsibility to think hard about what to put in a freely-accessible blog versus what to put in a research note that people pay a lot of money for. So in the end, I think that what I’ll be blogging are the things that don’t yet make for good research notes — quick news takes, musings, interesting little tidbits, things that aren’t of ongoing interest to clients, and interaction with the broader blogosphere.
I’m curious to hear the thoughts of others on this subject, whether they’re other Gartner analysts, analysts at competing firms, our clients, or our detractors.
Google’s G1 Android phone
The first real reviews of Google’s first Android phone, the T-Mobile G1 (otherwise known as the HTC Dream), have begun to emerge, a week in advance of its release in stores.
Walt Mossberg of the Wall Street Journal has a detailed first look. Andrew Garcia of eWeek has a lengthy review. John Brandon of Computerworld has a first look and review round-up. But the reviews thus far have been focused on the core phone functionality, and it’s not clear to what extent the available third-party apps explore the capabilities of Android.
I am personally looking forward to checking out the new phone. I was an early user of the T-Mobile Sidekick (aka the Danger Hiptop), and I loved its rendering of webpages (and its smart proxy that reduced image sizes, did reformatting, and so on), its useful keyboard, its generally easy-to-use functionality, and the fact that it stored all of its data on the network, removing the need to ever back up the device. I was disappointed when the company did not follow through on its promise of broad third-party apps; despite release of an SDK and an app store, you couldn’t use third-party apps without voiding your warranty.
These days I carry a corporate-issued Cingular 8525 (aka HTC Hermes), but despite it being a very powerful Windows Mobile smartphone, I actually use fewer apps than I did on my Sidekick. I use my phone to tether my laptop, for SSH access to my home network, and for basic functionality (calls, SMS, browser), but despite one of the best keyboards of any current smartphone it’s still not good enough to for real note-taking (with serious annoyances like the lack of a double-quote key), the browser falls well short of the Sidekick’s, the lack of network storage means I’m reluctant to trust myself to put a lot of data on it, and the UI is uninspired. So I’m quite eager to see what Android, which represents the next generation of thinking of the key figures of the Sidekick team, is going to be able to do for me. But I don’t want to return to T-Mobile (and I need AT&T for our corporate plan anyway), which means I’m going to be stuck waiting.
On another note, I’m wondering how many Android developers will choose to put the back-ends of their applications on Google App Engine. Browsing around, it seems like developers are worried about exceeding GAE quotas — everyone likes to think their app will be popular, and quota-exceeded messages are deadly, since they are functionally equivalent to downtime. GAE also requires development in Python, whereas Android requires development in Java, but I suspect that’s probably not too significant.
I haven’t really seen anything on hosting for iPhone applications, thus far, except for Morph using it as a marketing ploy. (Morph seems to be a cloud infrastructure overlay provider leveraging Amazon EC2 et.al.)
Hosting the back-end for mobile apps is really no different than hosting any other kind of application, of course, but I’m curious what service providers are turning out to be popular for them. Such hosting providers could also potentially offer value-adds like mobile application acceleration, especially for enterprise-targeted mobile apps.
Software and thick vs. thin-slice computing
I’ve been thinking about the way that the economics of cloud computing infrastructure will impact the way people write applications.
Most of the cloud infrastructure providers out there offer virtual servers as a slice of some larger, physical server; Amazon EC2, GoGrid, Joyent, Terremark Enterprise Cloud, etc. all follow this model. This is in contrast to the abstracted cloud platform provided by Google App Engine or Mosso, which provide arbitrary, unsliced amounts of compute.
The virtual server providers typically provide thin slices — often single cores with 1 to 2 GB of RAM. EC2’s largest available slices are 4 virtual cores plus 15 GB, or 8 virtual cores plus 7 GB, for about $720/month. Joyent’s largest slice is 8 cores with 32 GB, for about $3300/month (including some data transfer). But on the scale of today’s servers, these aren’t very thick slices of compute, and the prices don’t scale linearly — thin slices are much cheaper than thick slices for the same total aggregate amount of compute.
The abstracted platforms are oriented around thin-slice compute, as well, at least from the perspective of desired application behavior. You can see this in the limitations imposed by Google App Engine; they don’t want you to work with large blobs of data nor do they want you consuming significant chunks of compute.
Now, in that context, contemplate this Intel article: “Kx – Software Which Uses Every Available Core“. In brief, Kx is a real-time database company; they process extremely large datasets, in-memory, parallelized across multiple cores. Their primary customers are financial services companies, who use it to do quantitative analysis on market data. It’s the kind of software whose efficiency increases with the thickness of the available slice of compute.
In the article, Intel laments the lack of software that truly takes advantage of multi-core architectures. But cloud economics are going to push people away from thick-sliced compute — away from apps that are most efficient when given more cores and more RAM. Cloud economics push people towards thin slices, and therefore applications whose performance does not suffer notably as the app gets shuffled from core to core (which hurts cache performance), or when limited to a low number of cores. So chances are that Intel is not going to get its wish.
The Microsoft CDN study
The Microsoft/NYU CDN study by Cheng Huang, Angela Wang, et.al., seems to no longer be available. Perhaps it’s simply been temporarily withdrawn pending its presentation at the upcoming Internet Measurement Conference. You can still find it in Google’s cache, HTMLified, by searching for the title “Measuring and Evaluating Large-Scale CDNs”, though.
To sum it up in brief for those who missed reading it while it was readily available: Researchers at Microsoft and the Polytechnic Institute of New York University explored the performance of the Akamai and Limelight CDNs. Using a set of IP addresses derived from end-user clients of the MSN video service, and web hosts in Windows Live search logs, the researchers derived a set of vantage points based on the open-recursive DNS servers authoritative for those domains. They used these vantage points to chart the servers/clusters of the two CDNs. Then, using the King methodology, which measures the latency between DNS servers, they measured the performance of the two CDNs from the perspective of the vantage points. They also measured the availability of the servers. Then, they drew some conclusions about the comparative performance of the CDNs and how to prioritize deployments of new locations.
Both Akamai and Limelight pointed to flaws in the study, and I’ve done a series of posts that critique the methodology and the conclusions.
For convenience, here are the links to my analysis:
What the Microsoft CDN study measures
Blind spots in the Microsoft CDN study
Availability and the Microsoft CDN study
Assessing CDN performance
Hopefully the full PDF of the study will return to public view soon. Despite its flaws, it’s still tremendously interesting and a worthwhile read.
MediaMelon and CDN overlays
MediaMelon has launched, with what they call their “video overlay network“.
I haven’t been briefed by the company yet (although I’ve just sent a request for a briefing), but from the press release and the website, it looks like what they’ve got is a client that utilizes multiple CDNs (and other data sources) to pull and assemble segments of video prior to the user watching the content. The company’s website mentions neither board of directors nor management team, though the press release mentions the CEO, Kumar Subramian.
I’ll post more when I have some details about the company and their technology, but I’ll note that I think that software-based CDN overlay networks are going to be a rising trend. As the high-volume video providers increasingly commoditize their CDN purchases, the value-added services layer will move from CDN-provided and CDN-specific, to CDN-neutral software-only components.
When will Google App Engine be ready?
We’ve now hit the six-month mark on Google App Engine. And it’s still in beta. Few of the significant shortcomings in making GAE production-ready for “real applications” have been addressed.
In an internal Gartner discussion this past summer, I wrote:
The restrictions of the GAE sandbox are such that people writing complex, commercial Web 2.0 applications are quickly going to run into things they need and can’t have. Google Apps is required to use your own domain. The ability to do network callouts is minimal, which means that integrating with anything that’s not on GAE is limited to potentially impossible (and their URL fetcher can’t even do basic HTTP authentication). Everything has to be spawned via an HTTP request and all such requests must be short-lived, so you cannot run any persistent or cron-started background processes; this is a real killer since you cannot do any background maintenance. Datastore write performance is slow; so are large queries. The intent is that nothing you do is computationally expensive, and this is strictly enforced. You can’t do anything that accesses the filesystem. There’s a low limit to the total number of files allowed, and the largest possible file size is a mere 1 MB (and these limits are independent of the storage limit; you will be able to buy more storage but it looks like you won’t be allowed to buy yourself out of limitations like these). And so on.
Presumably over time Google will lift at least some of these restrictions, but in the near term, it seems unlikely to me that Web 2.0 startups will make commitments to the platform. This is doubly true because Google is entirely in control of what the restrictions will be in the future, too. I would not want to be the CTO in the unpleasant position of having my business depend on the Web 2.0 app my company’s written to the GAE framework, discovering that Google had just changed its mind and decided to enforce tighter restrictions that now prevented my app from working / scaling.
GAE, at least in the near term, suits apps that are highly self-contained, and very modest in scope. This will suit some Web 2.0 start-ups, but not many, in my opinion. GAE has gone for simplicity rather than power, at present, which is great if you are building things in your free time but not so great if you are hoping to be the next MySpace, or even 37Signals (Basecamp).
Add to that the issues about the future of Python. Python 3.0 — the theoretical future of Python — is very different from the 2.x branch. 3.0 support may take a while. So might support for the transition version, 2.6. The controversy over 3.0 has bifurcated the Python community at a time when GAE is actually helping to drive Python adoption, and it leaves developers wondering whether they ought to be thinking about GAE on 2.5 or GAE on 3.0 — or if they can make any kind of commitment to GAE at all with so much uncertainty.
These issues and more have been extensively explored by the blogosphere. The High Scalability blog’s aggregation of the most interesting posts is worth a look from anyone interested in the technical issues that people have found.
Google has been more forthcoming about the quotas and how to deal with them. I’ve made the assumption that quota limitations will eventually be replaced by paid units. The more serious limitations are the ones that are not clearly documented, and have more recently come to light, like the offset limit and the fact that the 1 MB limit doesn’t just apply to files, it also applies to data structures.
As this beta progresses, it becomes less and less clear what Google intends to limit as an inherent part of the business goals (and perhaps technical limitations) of the platform, and what they’re simply constraining in order to prevent their currently-free infrastructure from being voraciously gobbled up.
At present, Google App Engine remains a toy. A cool toy, but not something you can run your business on. Amazon, on the other hand, proved from the very beginning that EC2 was not a toy. Google needs to start doing the same, because you can bet that when Microsoft releases their cloud, they will pay attention to making it business-ready from the start.
The nameserver as CDN vantage point
I was just thinking about the nameserver as a vantage point in the Microsoft CDN study, and I remembered that for the CDNs themselves, the nameserver is normally their point of reference to the customer.
When a content provider uses a CDN, they typically use a DNS CNAME to alias a hostname to a hostname of the CDN provider. For instance, http://www.nbc.com maps to http://www.nbc.com.edgesuite.net; the edgesuite.net domain is owned by Akamai. That means that when a DNS resolver goes to try to figure out what the IP address of that hostname is, it’s going to query the CDN’s DNS servers for that answer. The CDN’s DNS server looks at the IP address of the querying nameserver, and tries to return a server that is good for that location.
Notably, the CDN’s DNS server does not know the user’s actual IP. That information is not present in the DNS query (RFC 1035 specifies the structure of queries).
Therefore, what nameserver you use, and its proximity to where you actually are on the network, will determine how good the CDN’s response actually is.
I did a little bit of testing, which has some interesting results. I’m using a combination of traceroute and IP geolocation to figure out where things are.
At home, I have my servers configured to use the UltraDNS “DNS Advantage” free resolvers. They return their own ad server rather than NXDOMAIN, which is an annoyance, but they are also very fast, and the speed difference makes a noticeable dent in the amount of time that my mail server spends in (SpamAssassin-based) anti-spam processing. But I can also use the nameservers provided to me by MegaPath; these are open-recursive.
UltraDNS appears to use anycast. The DNS server that it picks for me seems to be in New York. And http://www.nbc.com ends up mapping to an Akamai server that’s in New York City, 12 ms away.
MegaPath does not. Using the MegaPath DNS server, which is in the Washington DC area, somewhere near me, http://www.nbc.com ends up mapping to a server that’s directly off the MegaPath network, but which is 18 ms away. (IP geolocation says it’s in DC, but there’s a 13 ms hop between two points in the traceroute, which is either an awfully slow router or more likely, genuine distance.)
Now, let’s take my friend who lives about 20 miles from me and is on Verizon FIOS. Using Verizon’s DC-area nameserver, he gets the IP address of a server that seems to live off Comcast’s local network — and is a mere 6 ms from me.
For Limelight, I’m looking up http://www.dallascowboys.com. From UltraDNS in NYC, I’m getting a Limelight server that’s 14 ms away in NYC. Via MegaPath, I’m getting one in Atlanta, about 21 ms away. And asking my friend what IP address he gets off a Verizon lookup, I get a server here in Washington DC, 7 ms away.
Summing this up in a chart:
|My DNS / CDN Ping||Akamai||Limelight|
|UltraDNS||12 ms||14 ms|
|MegaPath||18 ms||21 ms|
|Verizon||6 ms||7 ms|
The fact that Verizon has local nameservers and the others don’t makes a big difference as to the quality of a CDN’s guess as to what server it ought to be using. Here’s a callout to service providers: Given the increasing amount of content, especially video, now served from CDNs, local DNS infrastructure is now really important to you. Not only will it affect your end-user performance, but it will also affect how much traffic you’re backhauling across your network or across your peers.
On the surface, this might make an argument for server selection via AnyCast, which is used by some lower-cost CDNs. Since you can’t rely upon a user’s nameserver actually being close to them, it’s possible that the crude BGP metric could return better results than you’d expect. AnyCast isn’t going to cut it if you’ve got lots of nodes, but for the many CDNs out there with a handful of nodes, it might not be that bad.
I went looking for other comparables. I was originally interested in Level 3, and dissected http://www.ageofconan.com (because there was a press release indicating an exclusive deal), but from that, discovered Funcom actually uses CacheFly for the website. funcom.cachefly.net returns the same IP no matter where you look it up from (I tried it locally, and from servers I have access to in Colorado and California). But traceroute clearly shows it’s going to different places, indicating an anycast implementation. Locally, I’ve got a CacheFly server a mere 6 ms away. From California, there’s also a local server, 13 ms away. Colorado, unfortunately, uses Chicago, a full 32 ms away. Unfortunately, this doesn’t tell us much, beyond the fact that CacheFly has limited footprint; we’d need to look at a CDN with enough footprint that uses AnyCast to see whether it actually return results better than the nameserver method does.
So here’s something for future researchers to explore: How well does resolver location correspond to user location? How much optimization is lost as a result? And how much better or worse would AnyCast be?
Assessing CDN performance
This is the fourth and probably final post in a series examining the Microsoft CDN study. The three previous posts covered measurement, the blind spots, and availability. This post wraps up with some conclusions.
The bottom line: The Microsoft study is very interesting reading, but it doesn’t provide any useful information about CDN performance in the real world.
The study’s conclusions are flawed to begin with, but what’s of real relevance to purchasers of CDN services is that even if the study’s conclusions were valid, its narrow focus on one element — one-time small-packet latency to the DNS servers and content servers — doesn’t accurately reflect the components of real-world CDN performance.
Cache hit ratios have a tremendous impact upon real-world CDNs. Moreover, the fallback mechanism on a cache miss is also important — does a miss require going back to the origin, or is there a middle tier? This will determine how much performance is impacted by a miss. The nature of your content and the CDN’s architecture will determine what those cache hit ratios look like, especially for long-tail content.
Throughput determines how quickly you get a file, and how well a CDN can sustain a bitrate for video. Throughput is affected by many factors, and can be increased through TCP/IP optimizations. Consistency of throughput also determines what your overall experience is; start-stop behavior caused by jittery performance can readily result in user frustration.
More broadly, the problem is that any method of testing CDNs from anything other than the edge of the network, using real end-user points, is flawed. Keynote and Gomez provide the best approximations on a day to day basis, but they’re only statistical samples. Gomez’s “Actual Experience” service uses an end-user panel, but that introduces uncontrolled variables into the mix if you’re trying to compare CDNs, and it’s still only sampling.
The holy grail of CDN measurement, of course, is seeing performance in real-time — knowing exactly what users are getting at any given instant from any particular geography. But even if a real-time analytics platform existed, you’d still have to try a bunch of different CDNs to know how they’d perform for your particular situation.
Bottom line: If you want to really test a CDN’s performance, and see what it will do for your content and your users, you’ve got to run a trial.
Then, once you’ve done your trials, you’ve got to look at the performance and the cost numbers, and then ask yourself: What is the business value of performance to me? Does better performance drive real value for you? You need to measure more than just the raw performance — you need to look at time spent on your site, conversion rate, basket value, page views, ad views, or whatever it is that tells you how successful your site is. Then you can make an intelligent decision.
In the end, discussions of CDN architecture are academically interesting, and certainly of practical interest to engineers in the field, but if you’re buying CDN services, architecture is only relevant to you insofar as it results in the quality of the user experience. If you’re a buyer, don’t get dragged into the rathole that is debating the merits of one architecture versus another. Look at real-world performance, and think short-term; CDN contract lengths are getting shorter and shorter, and if you’re a high-volume buyer, what you care about is performance right now and maybe in the next year.