Monthly Archives: April 2011

Huawei’s cloud computing ambitions

I was recently in China, visiting Gartner clients and prospects in Beijing and Shanghai, and attending Huawei’s analyst summit.

Why Huawei? I am an analyst covering services, after all, not equipment. But the hardware and software vendors that enable the cloud — the entirety of the ecosystem, so to speak — has become more and more important to my coverage, particularly as service providers try to figure out what technology they should use in their cloud. And so, when Huawei extended an invitation to come hear about their plans in cloud computing, I agreed to go fly halfway around the world to listen.

For those of you who are not acquainted with Huawei, they’re a roughly $23B networking equipment manufacturer — the largest supplier of telecom operator gear in the world, having recently suprassed Ericsson for that position.

At their analyst summit, Huawei announced a number of grand ambitions — to become one of the major global device manufacturers (Huawei’s phones and tablets are based on Android), to aggressively grow into the enterprise networking equipment business, and to become an all-in-one solutions provider for cloud computing. That includes Huawei’s product portfolio of modular and container-based data center solutions, servers, storage (via the Huawei-Symantec JV), data center and wide-area networking equipment, and the “cloud stack” of software needed to offer cloud IaaS, whether for an enterprise building a private cloud, or a service provider building a highly scalable public cloud. It also includes a suite of content delivery network (CDN) enablement solutions, targeted at network operators, and integrated into the cloud offering.

This is obviously a grandly ambitious plan, considering that it comes from a vendor that most non-carriers have never heard of, and which faces considerable prejudice in the United Sates. (Huawei recently got into a kerfuffle over its acquisition of the assets of 3Leaf Systems — the US government recently recommended rejecting their buying the patent portfolio and hiring some people out of a defunct Silicon Valley start-up with no customers to speak of, on national security grounds, a ridiculous tempest in a teapot if there ever was one.)

Huawei will join HP, IBM, Dell, and the VCE coalition, among others, in competing to deliver turnkey cloud infrastructure solutions. The product portfolio they claim to have, and the product portfolio they’re developing, are all highly ambitious and R&D-driven, although I believe that the technical problems that Huawei is tackling are genuinely difficult and therefore due caution needs to be exercised, as there is no proof that Huawei’s cloud technology scales as claimed.

Huawei will also face a significant barrier in the United States, given the political climate and suspicions about Huawei’s ties to the Chinese government, particularly the military and state security apparatus. This gets us back to the demand of cloud customers to know the underlying components of their solutions. If a service provider chooses to build on Huawei’s technology, will customers trust the solution?

Still, it’s an interesting entry into the cloud-building market from what to me, at least, was an unexpected quarter, and Huawei will clearly be a company to watch going forward — they have a track record of aggressive revenue growth, and plenty of money to throw at R&D.

What CenturyLink is Getting with Savvis

I scribbled off a quick blog post on the CenturyLink acquisition of Savvis but didn’t have time to delve into it in detail at the time. This is a bit of a follow-up.

Savvis has three core businesses:

  • Coloation: Savvis has carrier-diverse (though not strictly-speaking carrier neutral), high-quality colocation in data centers around the world. It is one of the most significant players in retail colocation for enterprises. It also has a substantial financial vertical play in its proximity hosting for low-latency trading.
  • Managed Hosting: Savvis is among the market share leaders in managed hosting. It is a highly capable provider, ranked for years as a Leader (and at or near the top of the pack) in Gartner’s Magic Quadrant for the market.
  • Networking: Although Savvis has a history as an ISP, their significant acquisition of networking assets came with the acquisition of Cable and Wireless North America’s assets (which included a substantial amount of MCI assets that had to be divested in the MCI-WorldCom acquisition), which they did in order to get Exodus. The networking business has been in slow decline for years, although it has some usefulness in competing with BT Radiance, in the proximity hosting context.

As part of their managed hosting business, Savvis has built a significant portfolio of cloud IaaS products. Savvis has historically had a tendency to overcomplicate their product lines, and cloud has been no exception to this rule. The most significant elements in the portfolio are Virtual Intelligent Hosting (utility managed hosting, equivalent to Terremark Infinistructure, AT&T Synaptic Hosting, etc.), and Symphony VPDC (self-service public cloud, equivalent to Terremark Enterprise Cloud, AT&T Synaptic CaaS, Verizon CaaS, NaviSite NaviCloud, etc.), which is divided into tiers of service quality. Savvis also has an array of private cloud services.

We consider Savvis to be highly competitive in enterprise-class cloud IaaS. They do not necessarily have the best service, featurewise, and they are a relatively expensive option, but they have done a credible job of incorporating security into their architecture, emphasized in RFP responses, in a way that customers respond to very strongly. Savvis has also done a good job layering managed services on top of their cloud offerings, and has begun to compete quite aggressively in the cloud-enabled data center outsourcing market segment, targeting mid-market companies.

In short, CenturyLink is buying a very high-quality set of assets. Qwest has a colocation business, but it is marred by poor customer service (it’s pretty hard to deliver poor customer service in a business as simple as colocation, but Qwest has historically managed to do this, although quality varies per-data-center). Qwest also has a managed hosting business, but it’s historically been sub-par to the market and well behind the hosting businesses of AT&T and Verizon. Qwest’s forays into cloud computing are embryonic. Consequently, CenturyLink is vastly accelerating its entry into this business with the Savvis acquisition. Also, given the capabilities gap between CenturyLink and Savvis, customers can probably expect little if any disruption from the acquisition.

It’s clear that carriers, even the less visionary ones, now feel that they need to have solutions to address data center needs, not just networking needs. While some carriers were able to articulate a vision around this relatively early on — AT&T notably — all the other network operators are quickly falling in line, albeit with varying degrees of vision and commitment.

CenturyLink buys Savvis

Ever since Verizon bought Terremark, and Time Warner Cable bought NaviSite earlier this year, people have speculated about the fate of Savvis. Well, today we know: CenturyLink is buying Savvis.

CenturyLink is a carrier, which rolled up a lot of rural telco assets before going on to digest Qwest. Acquiring Savvis signals its cloud computing ambitions — few carriers can afford to be without a cloud strategy, and apparently CenturyLink has decided to buy one rather than to build one. CenturyLink didn’t have much in the way of hosting assets pre-Qwest, and Qwest’s hosting assets were weak; with the exception of pure-plays like Amazon, nearly everyone in the cloud IaaS business has a hosting background. Moreover, while Qwest has been trying to get into the cloud, they are not a player to speak of.

With the Savvis buy, if CenturyLink is smart, Savvis gets largely left alone to continue what’s been a pretty successful colocation, hosting, and cloud IaaS business, Savvis incorporates the Qwest data center assets and kicks their hosting business to the curb (migrating the customers onto Savvis managed hosting), and Savvis stops fooling with a networking business save for what’s necessary to deliver proximity hosting. CenturyLink has announced that they’ll be consolidating their hosting assets with Savvis and having their current CEO run the unit, so that’s a good sign, at least.

I do believe that early industry consolidation may be bad for cloud innovation, but there’s a certain inevitability to the big network operators picking up leading cloud IaaS providers.

The really interesting question now is if Rackspace is a target. Their model doesn’t fit anywhere near as well into a carrier, given their focus on customer service — arguably their culture would be annihiiated in just about any merger with a likely buyer. Moreover, they have focused upon commodity cloud, while carriers are typically far more interested in enterprise cloud. But that doesn’t mean they’re not a takeover target anyway.

Why transparency matters in the cloud

A number of people have asked if the advice that Gartner is giving to clients about the cloud, or about Amazon, has changed as a result of Amazon’s outage. The answer is no, it hasn’t.

In a nutshell:

1. Every cloud IaaS provider should be evaluated individually. They’re all different, even if they seem to be superficially based off the same technology. The best provider for you will be dependent upon your use case and requirements. You absolutely can run mission-critical applications in the cloud — you just need to choose the right provider, right solution, and architect your application accordingly.

2. Just like infrastructure in your own data center, cloud IaaS requires management, governance, and a business continuity / disaster recovery plan. Know your risks, and figure out what you’re going to do to mitigate them.

3. If you’re using a SaaS vendor, you need to vet their underlying infrastructure (regardless of whether it’s their own data center, colo, hosting, or cloud).

The irony of the cloud is that you’re theoretically just buying something as a service without worrying about the underlying implementation details — but most savvy cloud computing buyers actually peer at the underlying implementation in grotesquely more detail than, say, most managed hosting customers ever look at the details of how their environment implemented by the provider. The reason for this is that buyers lack adequate trust that the providers will actually offer the availability, performance, and security that they claim they will.

Without transparency, buyers cannot adequately assess their risks. Amazon provides some metrics about what certain services are engineered to (S3 durability, for instance), but there are no details for most of them, and where there are metrics, they are usually for narrow aspects of the service. Moreover, very few of their services actually carry SLAs, and those SLAs are narrow and specific (as everyone discovered recently in this last outage, since it was EBS and RDS that were down and neither have SLAs, with EC2 technically unaffected, so nobody’s going to be able to claim SLA credits).

Without objectively understanding their risks, buyers cannot determine what the most cost-effective path is. Your typical risk calculation multiplies the probability of downtime by the cost of downtime. If the cost to mitigate the risk is lower than this figure, then you’re probably well-advised to go do that thing; if not, then, at least in terms of cold hard numbers, it’s not worth doing (or you’re better off thinking about a different approach that alters the probability of downtime, the cost of downtime, or the mitigation strategy).

Note that this kind of risk calculation can go out the window if the real risk is not well understood. Complex systems — and all global-class computing infrastructures are enormously complex under the covers — have nondeterministic failure modes. This is a fancy way of saying, basically, that these systems can fail in ways that are entirely unpredictable. They are engineered to be resilient to ordinary failure, and that’s the engineering risk that a provider can theoretically tell you about. It’s the weird one-offs that nobody can predict, and are the things that are likely to result in lengthy outages of unknown, unknowable length.

It’s clear from reading Amazon customer reactions, as well as talking to clients (Amazon customers and otherwise) over the last few days, that customers came to Amazon with very different sets of expectations. Some were deep in rose-colored-glasses land, believing that Amazon was sufficiently resilient that they didn’t have to really invest in resiliency themselves (and for some of them, a risk calculation may have made it perfectly sane for them to run just as they were). Others didn’t trust the resiliency, and used Amazon for non-mission-critical workloads, or, if they viewed continuous availability as critical, ran multi-region infrastructures. But what all of these customers have in common is the simple fact that they don’t really know how much resiliency they should be investing in, because Amazon doesn’t reveal enough details about its infrastructure for them to be able to accurately judge their risk.

Transparency does not necessarily mean having to reveal every detail of underlying implementation (although plenty of buyers might like that). It may merely mean releasing enough details that people can make calculations. I don’t have to know the details of the parts in a disk drive to be able to accept a mean time between failure (MTBF) or annualized failure rate (AFR) from the manufacturer, for instance. Transparency does not necessarily require the revelation of trade secrets, although without trust, transparency probably includes the involvement of external auditors.

Gartner clients may find the following research notes helpful:

and also some older notes on critical questions to ask your SaaS provider, covering the topics of infrastructure, security, and recovery.

Amazon outage and the auto-immune vulnerabilities of resiliency

Today is Judgment Day, when Skynet becomes self-aware. It is, apparently, also a very, very bad day for Amazon Web Services.

Lots of people have raised questions today about what Amazon’s difficulties today mean for the future of cloud IaaS. My belief is that this doesn’t do anything to the adoption curve — but I do believe that customers who rely upon Amazon to run their businesses will, and should, think hard about the resiliency of their architectures.

It’s important to understand what did and did not happen today. There’s been a popular impression that “EC2 is down”. It’s not. To understand what happened, though, some explanation of Amazon’s infrastructure is necessary.

Amazon divides its infrastructure into “regions”. You can think of a region as basically analogous to “a data center”. For instance, US-East-1 is Amazon’s Northern Virginia data center, while US-West-1 is Amazon’s Silicon Valley data center. Each region, in turn, is divided into multiple “availability zones” (AZs). You can think of an AZ is basically analogous to “a cluster” — it’s a grouping of physical and logical resources. Each AZ is designated by letters — for instance, US-East-1a, US-East-1b, etc. However, each of these designations are customer-specific (which is why Amazon’s status information cannot easily specify which AZ is affected by a problem).

Amazon’s virtual machine offering is the Elastic Compute Cloud (EC2). When you provision an EC2 “instance” (Amazon’s term for a VM), you also get an allocation of “instance storage”. Instance storage is transient — it exists only as long as the VM exists. Consequently, it’s not useful for storing anything that you actually want to keep. To get persistent storage, you use Amazon’s Elastic Block Store (EBS), which is basically just network-attached storage. Many people run databases on EC2 that are backed by EBS, for instance. Because that’s such a common use case, Amazon offers the Relational Database Service (RDS), which is basically an EC2 instance running MySQL.

Amazon’s issues today are with EBS, and with RDS, both in the US-East-1 region. (My guess is that the issues are related, but Amazon has not specifically stated that they are.) Customers who aren’t in the US-East-1 region aren’t affected (customers always choose which region and specific AZs they run in). Customers who don’t use EBS or RDS are also unaffected. However, use of EBS is highly commonplace, and likely just about everyone using EC2 for a production application or Web site is reliant upon EBS. Consequently, even though EC2 itself has been running just fine, the issues have nevertheless had a major impact on customers. If you’re storing your data on EBS, the issues with EBS have made your data inaccessible, or they’ve made access to that data slow and unreliable. Ditto with RDS. Obviously, if you can’t get to your data, you’re not going to be doing much of anything.

In order to get Amazon’s SLA for EC2, you, as a customer, have to run your application in multiple AZs within the same region. Running in multiple AZs is supposed to isolate you from the failure of any single AZ. In practice, of course, this only provides you so much protection — since the AZs are typically all in the same physical data center, anything that affects that whole data center would probably affect all the AZs. Similarly, the AZs are not totally isolated from one another, either physically or logically.

However, when you create an EBS volume, you place it in a specific availability zone, and you can only attach that EBS volume to EC2 instances within that same availability zone. That complicates resiliency, since if you wanted to fail over into another AZ, you’d still need access to your data. That means if you’re going to run in multiple AZs, you have to replicate your data across multiple AZs.

One of the ways you can achieve this is with the Multi-AZ option of RDS. If you’re running a MySQL database and can do so within the constraints of RDS, the multi-AZ option lets you gain the necessary resiliency for your database without having to replicate EBS volumes between AZs.

As one final caveat, data transfer within a region is free and fast — it’s basically over a local LAN, after all. By contrast, Amazon charges you for transfers between regions, which goes over the Internet and has the attendant cost and latency.

Consequently, there are lots of Amazon customers who are running in just a single region. A lot of those customers may be running in just a single AZ (because they didn’t architect their app to easily run in multiple AZs). And of the ones who are running in multiple AZs, a fair number are reliant upon the multi-AZ functionality of RDS.

That’s why today’s impacts were particularly severe. US-East-1 is Amazon’s most popular region. The problems with EBS impacted the entire region, as did the RDS problems (and multi-AZ RDS was particularly impacted), not just a single AZ, so if you were multiple-AZ but not multi-region, the resiliency you were theoretically getting was of no help to you. Today, people learned that it’s not necessarily adequate to run in multiple AZs. (Justin Santa Barbara has a good post about this.)

My perspective on this is pretty much exactly what I would tell a traditional Web hosting customer who’s running only in one data center: If you want more resiliency, you need to run in more than one data center. And on Amazon, if you want more resiliency, you need to not only be multi-AZ but also multi-region.

Amazon’s SLA for EC2 is 99.95% for multi-AZ deployments. That means that you should expect that you can have about 4.5 hours of total region downtime each year without Amazon violating their SLA. Note, by the way, that this outage does not actually violate their SLA. Their SLA defines unavailability as a lack of external connectivity to EC2 instances, coupled with the inability to provision working instances. In this case, EC2 was just fine by that definition. It was EBS and RDS which weren’t, and neither of those services have SLAs.

So how did Amazon end up with a problem that affected all the AZs within the US-East-1 region? Well, according to their status dashboard, they had some sort of network problem last night in their east coast data center. That problem resulted in their automated resiliency mechanisms attempting to re-mirror a large number of EBS volumes. This impacted one of the AZs, but it also overloaded the control infrastructure for EBS in that region. My guess is that RDS also uses this same storage infrastructure, so the capacity shortages and whatnot created by all of this activity ended up also impacting RDS.

My colleague Jay Heiser, who follows, among other things, risk management, calls this “auto-immune disease” — i.e., resiliency mechanisms can sometimes end up causing you harm. (We’ve seen auto-immune problems happen before in a prior Amazon S3 outage, as well as a Google Gmail outage.) The way to limit auto-immune damage is isolation — ensuring limits to the propagation.

Will some Amazon customers pack up and leave? Will some of them swear off the cloud? Probably. But realistically, we’re talking about data centers, and infrastructure, here. They can and do fail. You have to architect your app to have continuous availability across multiple data centers, if it can never ever go down. Whether you’re running your own data center, running in managed hosting, or running in the cloud, you’re going to face this issue. (Your problems might be different — i.e., your own little data center isn’t going to have the kind of complex problem that Amazon experienced today — but you’re still going to have downtime-causing issues.)

There are a lot of moving parts in cloud IaaS. Any one of them going wrong can bork your entire site/application. Your real problem is appropriate risk mitigation — the risk of downtime and its attendant losses, versus the complications and technical challenges and costs created by infrastructure redundancy.

Cloud as a Business Executive Forum

On Wednesday, April 13th, I will be speaking at Joyent’s Cloud as a Business Executive Forum, an all-day event that they’re holding at the Le Meridien Hotel in San Francisco.

I’ll be delivering a presentation on the cloud computing opportunity for service providers, as follows:

Service providers face unprecedented opportunity to grow their revenues and profits and deepen their customer relationships as more and more SMBs and enterprises consider moving their applications to the cloud. But how big, exactly, is the cloud market? From which market segments is the growth coming? How can service providers capitalize most effectively on the growth in the market? How can service providers maximize their margins as cloud products and services rapidly commoditize?

(Per Gartner’s rules for analysts speaking at these kinds of events, my presentation is a market view only, and does not advocate for Joyent.)

The event also includes presentations by Joyent CEO David Young, and chief scientist Jason Hoffman, along with a roundtable discussion with Joyent’s customers and a look at Joyent’s SmartDataCenter software.

If you’re a service provider and are interested in attending, please contact Joyent Sales directly.

The event also anchors my usual monthly visit to the Bay Area, so if you’re interested in meeting in person, please contact your Gartner account executive. (I think my meeting slots have mostly been taken, but these are always somewhat fluid as executive schedules change, and even if I don’t see you this time around, I’ll be back in May.)

Bookmark and Share