Category Archives: Infrastructure

Akamai article in ACM Queue

My colleague Nick Gall pointed out an article in the ACM Queue that I’d missed: Improving Performance on the Internet, by Akamai’s chief scientist, Tom Leighton.

There is certainly some amount of marketing spin in that article, but it is nonetheless a very good read. If you are looking for a primer on why there are CDNs, or are interested in understanding how the application delivery network service works, this is a great article. Even if you’re not interested in CDNs, the section called “Highly Distributed Network Design” has a superb set of principles for fault-tolerant distributed systems, which I’ll quote here:

Ensure significant redundancy in all systems to facilitate failover.
Use software logic to provide message reliability.
Use distributed control for coordination.
Fail cleanly and restart.
Phase software releases.
Notice and proactively quarantine faults.

One niggle: The article says, “The top 30 networks combined deliver only 50 percent of end-user traffic, and it drops off quickly from there, with a very-long-tail distribution over the Internet’s 13,000 networks.” That statement needs a very important piece of context: the fact that most of those networks do not belong to network operators (i.e., carriers, cable companies, etc.). Many of them are simply “autonomous systems” (in Internet parlance) owned by enterprises, or which belong to Web hosters, and so forth. That’s why the top 30 account for so much of the traffic, and that percentage would be sharply increased if you allocated them the enterprises who buy transit from them. (Those interested in looking at data to do a deeper dive should check out the Routing Report site.)

Posted in Infrastructure

Leave a comment

Tags: CDN

COBOL comes to the cloud

Jan 15

Posted by Lydia Leong

In this year of super-tight IT budgets and focus on stretching what you’ve got rather than replacing it with something new, Micro Focus is bringing COBOL to the cloud.

Most vendor “support for EC2” announcements are nothing more than hype. Amazon’s EC2 is a Xen-virtualized environment. It supports the operating systems that run in that environment; most customers use Linux. Applications run no differently there than they do in your own internal data center. There’s no magical conveyance of cloud traits. Same old app, externally hosted in an environment with some restrictions.

But Micro Focus (which is focused around COBOL-based products) is actually launching its own cloud service, built on top of partner clouds — EC2, as well as Microsoft’s Azure (previously announced).

Micro Focus has also said it has tweaked its runtime for cloud deployment. They give the example of storing VSAM files as blobs in SQL. This is undoubtedly due to Azure not offering direct access to the filesystem. (For EC2, you can get persistent normal file storage with EBS, but there are restrictions.) I assume that similar tweaks were made wherever the runtime needs to do direct file I/O. Note that this still doesn’t magically convey cloud traits, though.

It’s interesting to see that Micro Focus has built its own management console around EC2, providing easy deployment of apps based on their technology, and is apparently making a commitment to providing this kind of hosted environment. Amidst all of the burgeoning interest in next-generation technologies, it’s useful to remember that most enterprises have a heavy burden of legacy technologies.

(Disclaimer: My husband was founder and CTO of LegacyJ, a Micro Focus competitor, whose products allow COBOL, including CICS apps, to be deployed within standard J2EE environments — which would include clouds. He doesn’t work there any longer, but I figured I should note the personal interest.)

Posted in Infrastructure

Leave a comment

Tags: Amazon, appdev, cloud

Cloud debate: GUI vs. CLI and API

Jan 12

Posted by Lydia Leong

In the greater blogosphere, as well as amongst the cloud analysts across the various research firms, there’s been an ongoing debate over the question, “Does a cloud have to have an API to be a cloud?”

Going beyond that question, though, there are two camps of cloud users emerging — those who prefer the GUI (control panel) approach to controlling their cloud, and those that prefer command-line interfaces and/or APIs. These two camps can probably be classified into the automated and the automators — those users who want easy access to pre-packaged automation, and those users who want to write automation of their own.

This distinction has long existed in the systems administration community — the split between those who rely on the administrator GUIs to do things, vs. those who do everything via the command line, editing config files, and their own scripts. But the advent of cloud computing and associated tools, with their relentless drive towards standardization and automation, is casting these preferences into an increasingly stark light. Moreover, the emerging body of highly sophisticated commercial tools for cloud management (virtual data center orchestration and everything that surrounds it) means that in the future, even those more sophisticated IT operations folks who are normally self-reliant, will end up taking advantage of those tools rather than writing stuff from scratch. That suggests that tools will also follow two paths — there will be tools that are designed to be customized via GUI, and tools that are readily decomposable into scriptable components and/or provide APIs.

I’ve previously asserted that cloud drives a skills shift in IT operations personnel, creating a major skills chasm between those who use tools, and those who write tools.

The emerging cloud infrastructure services seem to be pursuing one of two initial paths — exposure via API and thus highly scriptable by the knowledgeable (e.g., Amazon Web Services), and friendly control panel (e.g., Rackspace’s Mosso). While I’d expect that most public clouds will eventually offer both, I expect that both services and do-it-yourself cloud software will tend to emphasize capabilities one way or another, focusing on either the point-and-click crowd or the systems programmers.

(A humorous take on this, via an old Craigslist posting: Keep the shell people alive.)

Posted in Infrastructure

2 Comments

Tags: cloud, people

Touring Amazon’s management console

Jan 10

Posted by Lydia Leong

The newly-released beta of Amazon’s management console is reasonably friendly, but it is not going to let your grandma run her own data center.

I took a bit of a tour today. I’m running Firefox 3 on a Windows laptop, but everything else I’m doing out of a Unix shell — I have Linux and MacOS X servers at home. I already had AWS stuff set up prior to trying this out; I’ve previously used RightScale to get a Web interface to AWS.

The EC2 dashboard starts with a big friendly “Launch instances” button. Click it, and it takes you to a three-tab window for picking an AMI (your server image). There’s a tab for Amazon’s images, one for your own, and one for the community’s (which includes a search function). After playing around with the search a bit (and wishing that every community image came with an actual blurb of what it is), and not finding a Django image that I wanted to use, I decided to install Amazon’s Ruby on Rails stack.

On further experience, the “Select” buttons on this set of tabs seem to have weird issues; sometimes you’ll go to them and they’ll be grayed out and unclickable, sometimes you’ll click them and they’ll go gray but you won’t get the little “Loading, please wait” box that appears before going onto the next tab — and it will leave you stuck, leaving you to cancel the window and try again.

Once you select an image, you’re prompted to select how many instances you want to launch, your instance type, key pair (necessary to SSH into your server), and a security group (firewall config). More twiddly bits, like the availability zone, are hidden in advanced options. Pick your options, click “Launch”, and you’re good to go.

From the launch window, your options for the firewall default to having a handful of relevant ports (like SSH, webserver, MySQL) open to the world. You can’t get more granular with the rules than this there; you’ve got to use the Security Group config panel to add a custom rule. I wish that the defaults would be slightly stricter, like limiting the MySQL port to Amazon’s back-end.

Next, I went to create an EBS volume for user data. This, too, is simple, although initially I did something stupid, failing to notice that my instance had launched in us-east-1b. (Your EBS volume must reside in the same availability zone as your instance, in order for the instance to mount it.)

That’s when I found the next interface quirk — the second time I went to create an EBS volume, the interface continued to insist for fifteen minutes that it was still creating the volume. Normally there’s a very nice Ajax bit that automatically updates the interface when it’s done, but this time, even clicking around the whole management console and trying to come back wouldn’t get it to update the status and thus allow me to attach it to my instance. I had to close out the Firefox tab, and relaunch the console.

Then, I remember that the default key pair that I’d created had been done via RightScale, and I couldn’t remember where I’d stashed the PEM credentials. So that led me to a round of creating a new key pair via the management console (very easy), and having to terminate and launch a new instance using the new key pair (subject to the previously-mentioned interface quirks).

The same interface-somehow-gets-into-indeterminate-state also seems to be a problem for other things, like the console “Output” button for interfaces — you get a blank screen rather than the console dump.

That all dealt with, I log into my server via SSH, don’t see the EBS volume mounted, and remember that I need to actually make a filesystem and explicitly mount it. All creating an EBS volume does is allocate you an abstraction on Amazon’s SAN, essentially. This leads me to trying to find documentation for EBS, which leads to the reminder that access to docs on AWS is terrible. The search function on the site doesn’t index articles, and there are far too many articles to just click through the list looking for what you want. A Google search is really the only reasonable way to find things.

All that aside, once I do that, I have an entirely functional server. I terminate the instance, check out my account, see that this little experiment has cost me 33 cents, and feel reasonably satisfied with the world.

Posted in Infrastructure

Leave a comment

Tags: Amazon, cloud, hands-on

News round-up

Jan 9

Posted by Lydia Leong

A handful of quick news-ish takes:

Amazon has released the beta of its EC2 management console. This brings point-and-click friendliness to Amazon’s cloud infrastructure service. A quick glance through the interface makes it clear that effort was made to make it easy to use, beginning with big colorful buttons. My expectation is that a lot of the users who might otherwise have gone to RightScale et.al. to get the easy-to-use GUI will now just stick with Amazon’s own console. Most of those users would have just been using that free service, but there’s probably a percentage that would otherwise have been upsold who will stick with what Amazon has.

Verizon is courting CDN customers with the “Partner Port Program”. It sounds like this is a “buy transit from us over a direct peer” service — essentially becoming explicit about settlement-based content peering with content owners and CDNs. I imagine Verizon is seeing plenty of content dumped onto its network by low-cost transit providers like Level 3 and Cogent; by publicly offering lower prices and encouraging content providers to seek paid peering with it, it can grab some revenue and improve performance for its broadband users.

Scott Cleland blogged about the “open Internet” panel at CES. To sum up, he seems to think that the conversation is now being dominated by the commercially-minded proponents. That would certainly seem to be in line with Verizon’s move, which essentially implies that they’re resigning themselves to the current peering ecosystem and are going to compete directly for traffic rather than whining that the system is unfair (always disengenuous, given ILEC and MSO complicity in creating the current circumstances of that ecosystem).

I view arrangements that are reasonable from a financial and engineering standpoint, that do not seek to discriminate based on the nature of the actual content, to be the most positive interpretation of network neutrality. And so I’ll conclude by noting that I heard an interesting briefing today from Anagran, a hardware vendor offering flow-based traffic management (i.e., it doesn’t care what you’re doing, it’s just managing congestion). It’s being positioned as an alternative or supplement to Sandvine and the like, offering a way to try to keep P2P traffic manageable without having to do deep-packet inspection (and thus explicit discrimination).

Posted in Infrastructure

Leave a comment

Tags: Amazon, CDN, net neutrality, news

Scaling limits and friendly failure

Dec 30

Posted by Lydia Leong

I’m on vacation, and I’ve been playing World of Goo (possibly the single-best construction puzzle game since 1991’s Lemmings by Psygnosis). I was reading the company’s blog (2D Boy), when I came across an entry about BlueHost’s no-notice termination of 2D Boy’s hosting.

And that got me thinking about “unlimited” hosting plans, throttling, limits, and the other challenges of running mass-market hosting — all issues also directly applicable to cloud computing.

BlueHost is a large and reputable provider of mass-market shared hosting. Their accounts are “unlimited”, and their terms of service essentially says that you can consume resources until you negatively impact other customers.

Now, in practice there are limits, and customers are sort of expected to know whether or not their needs fit shared hosting. Most people plan accordingly — although there have been some spectacular failures to do so, such as Sk*rt, a female-focused Digg competitor launched using BlueHost, prompting vast wonder at what kind of utter lack of thought results in trying to launch a high-traffic social networking site on a $7 hosting plan. Unlike Sk*rt, though, it was reasonable for 2D Boy to expect that shared hosting would cover their needs — hosting a small corporate site and blog. They were two guys who were making an indie garage game getting a gradual traffic ramp thanks to word-of-mouth, not an Internet company doing a big launch.

Limits are necessary, but no-notice termination of a legitimate company is bad customer service, however you slice it. Moreover, it’s avoidable bad customer service. Whatever mechanism is used to throttle, suspend service, etc. ought to be adaptable to sending out a warning alert: the “hey, if you keep doing this, you will be in violation of our policies and we’ll have to terminate you” note. Maybe even a, “hey, we will continue to serve your traffic for $X extra, and you have Y time to find a new host or reduce your traffic to normal volumes”. BlueHost does not sell anything beyond its $7 plan, so it has no upsell path; a provider with an upgrade path would hopefully have tried to encourage a migration, rather than executing a cold-turkey cut-off. (By the way, I have been on the service provider side of this equation, so I have ample sympathy for the vendor’s position against a customer whose usage is excessive, but I also firmly believe that no-notice termination of legitimate businesses is not the way to go.)

Automated elastic scaling is the key feature of a cloud, and consequently, limits and the way that they’re enforced technically and managed from a customer service standpoint, will be one of the ways that cloud infrastructure providers differentiate their services.

A vendor’s approach to limits has to be tied to their business goals. Similarly, what a customer desires out of limits must also be tied to their business goals. The customer wants reliable service within a budget. The vendor wants to be fairly compensated and ensure that his infrastructure remains stable.

Ideally, on cloud infrastructure, a customer scales seamlessly and automatically until the point where he is in danger of exceeding his budget. At that point, the system should alert him automatically, allowing him to increase his budget. If he doesn’t want to pay more, he will experience degraded service; degradation should mean slower or lower-priority service, or an automatic switch to a “lite” site, rather than outright failure.

Perhaps when you get right down to it, it’s really about what the failure mode is. Fail friendly. A vendor has a lot more flexibility in imposing limits if it can manage that.

Posted in Infrastructure

4 Comments

Tags: cloud, contracts, customers, hosting

Pricing transparency and CDNs

Dec 22

Posted by Lydia Leong

It is possible that I am going to turn out to be mildly wrong about something. I predicted that neither Amazon’s CloudFront CDN nor the comparable Rackspace/Limelight offering (Mosso Cloud Files) would really impact the mainstream CDN market. I am no longer as certain that’s going to be the case, as it appears that behavioral economics play into these decisions more than one might expect. The impact is subtle, but I think it’s there.

I’m not talking about the giant video deals, mind you; those guys already get prices well below that of the cloud CDNs. I’m talking about the classic bread-and-butter of the CDN market, the e-commerce and enterprise customers, significant B2B and B2C brands that have traditionally been Akamai loyalists, or been scattered with smaller players like Mirror Image.

Simply put, the cloud CDNs put indirect pressure on mainstream CDN prices, and will absorb some new mainstream (enterprise but low-volume) clients, for a simple reason: Their pricing is transparent. $0.22/GB for Rackspace/Limelight. $0.20/GB for SoftLayer/Internap. $0.17/GB for Amazon CloudFront. And so on.

Transparent pricing forces people to rationalize what they’re buying. If I can buy Limelight service on zero commit for $0.22/GB, there’s a fair chance that I’m going to start wondering just what exactly Akamai is giving me that’s worth paying $2.50/GB for on a multi-TB commit. Now, the answer to that might be, “DSA Secure that speeds up my global e-commerce transactions and is invaluable to my business”, but that answer might also be, “The same old basic static caching I’ve been doing forever and have been blindly signing renewals for.” It is going to get me to wonder things like, “What are the actual competitive costs of the services I am using?” and, “What is the business value of what I’m buying?” It might not alter what people buy, but it will certainly alter their perception of value.

Since grim October, businesses have really cared about what things cost and what benefit they’re getting out of them. Transparent pricing really amps up the scrutiny, as I’m discovering as I talk to clients about CDN services. And remember that people can be predictably irrational.

While I’m on the topic of cloud CDNs: There have been two recent sets of public performance measurements for Rackspace (Mosso) Cloud Files on Limelight. One is part of a review by Matthew Sacks, and the other is Rackspace’s own posting of Gomez metrics comparing Cloud Files with Amazon CloudFront. The Limelight performance is, unsurprisingly, overwhelmingly better.

What I haven’t seen yet is a direct performance comparison of regular Limelight and Rackspace+Limelight. The footprint appears to be the same, but differences in cache hit ratios (likely, given that stuff on Cloud Files will likely get fewer eyeballs) and the like will create performance differences on a practical level. I assume it creates no differences for testing purposes, though (i.e., the usual “put a 10k file on two CDNs”), unless Limelight prioritizes Cloud Files requests differently.

Posted in Infrastructure

3 Comments

Tags: AKAM, CDN, cloud, LLNW, RAX

Google’s pricing for App Engine

Dec 19

Posted by Lydia Leong

Google made a number of App Engine-related announcements earlier this week. The most notable of these was a preview of the future paid service, which allows you to extend App Engine’s quotas. Google has previously hinted at pricing, and at their developer conference this past May, they asserted that effectively, the first 5 MPV (million page views) are free, and thereafter, it’d be about $40 per MPV.

The problem is not the price. It’s the way that the quotas are structured. Basically, it looks like Google is going to allow you to raise the quota caps, paying for however much you go over, but never to exceed the actual limit that you set. That means Google is committing itself to a quota model, not backing away from it.

Let me explain why quotas suck as a way to run your business.

Basically, the way App Engine’s quotas work is like this: As you begin to approach the limit (currently Google-set, but eventually set by you), Google will start denying those requests. If you’re reaching the limit of a metered API call, when your app tries to make that call, Google will return an exception, which your app can catch and handle; inelegant, but at least something you can present to the user as a handled error. However, if you’re reaching a more fundamental limit, like bandwidth, Google will begin returning page requests with a the 403 HTTP status code. 403 is an error that prevents your user from getting the page at all, and there’s no elegant way to handle it in App Engine (no custom error pages).

As you approach quota, Google tries to budget your requests so that only some of them fail. If you get a traffic spike, it’ll drop some of those requests so that it still has quota left to serve traffic later. (Steve Jones’ SOA blog chronicles quite a bit of empirical testing, for those who want to see what this “throttling” looks like in practice.)

The problem is, now you’ve got what are essentially random failures of your application. If you’ve got failing API calls, you’ve got to handle the error and your users will probably try again — exacerbating your quota problem and creating an application headache. (For instance, what if I have to make two database API calls to commit data from an operation, and the first succeeds but the second fails? Now I have data inconsistency, and thanks to API calls continuing to fail, quite possibly no way to fix it. Google’s Datastore transactions are restricted to operations on the same entity group, so transactions will not deal with all such problems.) Worse still, if you’ve got 403 errors, your site is functionally down, and your users are getting a mysterious error. As someone who has a business online, do you really want, under circumstances of heavy traffic, your site essentially failing randomly?

Well, one might counter, if you don’t want that to happen, just set your quota limits really really high — high enough that you never expect a request to fail. The problem with that, though, is that if you do it, you have no way to predict what your costs actually will be, or to throttle high traffic in a more reasonable way.

If you’re on traditional computing infrastructure, or, say, a cloud like Amazon EC2, you decide how many servers to provision. Chances are that under heavy traffic, your site performance would degrade — but you would not get random failures. And you would certainly not get random failures outside of the window of heavy traffic. The quota system under use by Google means that you could get past the spike, have enough quota left to serve traffic for most of the rest of the day, but still cross the over-quota-random-drop threshold later in the day. You’d have to go micro-manage (temporarily adjusting your allowable quota after a traffic spike, say) or just accept a chance of failure. Either way, it is a terrible way to operate.

This is yet another example of how Google App Engine is not and will not be near-term ready for prime-time, and how more broadly, Google is continuing to fail to understand the basic operational needs of people who run their businesses online. It’s not just risk-averse enterprises who can’t use something with this kind of problem. It’s the start-ups, too. Amazon has set a very high bar for reliability and understanding of what you need to run a business online, and Google is devoting lots of misdirected technical acumen to implementing something that doesn’t hit the mark.

Posted in Infrastructure

2 Comments

Tags: cloud, Google, hosting

Aflexi, a new CDN aggregator

Dec 19

Posted by Lydia Leong

Aflexi has announced its launch, which is slated for January of 2009.

Aflexi is a CDN aggregator, targeting small Web hosters much in the same way that Velocix’s Metro product targets broadband providers. (What’s old is new again: remember Content Bridge and CDN peering, a hot idea back in 2001?)

Here’s how it works: Aflexi operates a marketplace and CDN routing infrastructure (i.e., the DNS-based brain servers that tell an end-user client what server to pull content from), plus has Linux-based CDN server software.

Web hosters can pay a nominal fee of $150 to register with Aflexi, granting them the right to deploy unlimited copies of Aflexi’s CDN server software. (Aflexi is recommending a minimum of a dual-core server with 4 GB of RAM and 20-30 GB of storage, for these cache servers. That is pretty much “any old hardware you have lying around.”) A hoster can put these servers wherever he likes, and is responsible for their connectivity and so forth. The Web hoster then registers his footprint, desired price for delivering a GB of traffic, and any content restrictions (like “no adult content”) on Aflexi’s marketplace portal.

Content owners can come to the portal to shop for CDN services. If they’re going through one of Aflexi’s hosting partners, they may be limited in their choices, at the hoster’s discretion. The content owner chooses which CDNs he wants to aggregate. Then, he can simply go live; Aflexi will serve his content only over the CDNs he’s chosen. Currently, the content routing is based upon the usual CDN performance metrics; Aflexi plans to offer price/performance routing late next year. Aflexi takes a royalty of 0.8 cents per GB (thus, under a penny); the remainder of the delivery fee goes to whatever hoster served a particular piece of content. Customers will typically be billed through their hoster; Aflexi integrates with the Parallels control panel (they’re packaging in the APS format).

Broadly, although the idea of aggregation isn’t new, the marketplace is an interesting take on it. This kind of federated model raises significant challenges in terms of business concerns — the ability to offer an SLA across a diversified base, and ensuring that content is not tampered with, are likely at the forefront of those concerns. Also, a $150 barrier to entry is essentially negligible, which means there will have to be some strenuous efforts to keep out bad actors in the ecosystem.

Aflexi sees the future of the CDN market as being hosters. I disagree, given that most hosters don’t own networks. However, I do believe that hosting and CDN are natural matches from a product standpoint, and that hosters need to have some form of CDN strategy. It’s clear that Aflexi wants to target small Web hosters and their small-business customers. They’re going to occupy a distinct niche, but I wonder how well that approach will hold up against Rackspace-plus-Limelight and Amazon’s CloudFront, which have solid credibility and are targeted at small customers. But the existence of Aflexi will offer small hosters a CDN option beyond pure resale.

Aflexi says its initial launch hosters will include ThePlanet. That in and of itself is an interesting tidbit, as ThePlanet (which is one of the largest providers of simple dedicated hosting in the world) currently resells EdgeCast.

One more odd little tidbit: The CEO is Whei Meng Wong, previously of UltraUnix, but also, apparently, previously of an interesting SpamHaus ROKSO record (designating a hoster who is a spam haven — willing to host the sites that spammers advertise). Assuming that it’s the same person, which it appears to be, that reputation could have significant effects upon Aflexi’s ability to attract legitimate customers — either hosters or content owners.

The company is funded through a Malaysian government grant. The CTO is Wai-Keen Woon; the VP of Engineering is Yuen-Chi Lian. Neither of them appears to have executive experience, or indeed, much experience period — the CTO’s Facebook profile says he’s an ’07 university graduate. The CEO’s blog seems to indicate he is also an ’07 graduate. So this is apparently a fresh-out-of-college group-of-buddies company — notably, without either a Sales or Marketing executive that they deemed worth mentioning in their launch presentation.

Bottom line, though: This is another example of CDN services moving up a level towards software overlays. The next generation of providers own software infrastructure and the CDN routing brain, but don’t deploy a bunch of servers and network capacity themselves.

Posted in Infrastructure

Leave a comment

Tags: CDN

CloudPundit: Massive-Scale Computing

the business of Internet infrastructure, cloud computing, and data centers

Category Archives: Infrastructure

Akamai article in ACM Queue

News round-up

Aflexi, a new CDN aggregator

About the Author

Search

Categories

Archives

Recent Posts

Recent Comments

More Comments

Top Clicks

Meta