Category Archives: Industry

Do Amazon’s APIs matter?

For those who have been wondering where I personally stand in the brouhaha over Amazon, Citrix, Eucalyptus, CloudStack, OpenStack, Rackspace, HP, and so on, along with the broader competitive market that includes VMware, Microsoft, and the Four Horsemen of management tools… I should state up-front that I hold the optimistic viewpoint that I want everyone to be successful as possible — service providers, commercial vendors, open-source projects, and the customers and users that depend upon them.

I feel that the more competent the competition in a market, the more that everyone in the ecosystem is motivated to do better, and the more customers benefit as a result. Customers benefit from better technology, lower costs, more responsive sales, and differentiated approaches to the market. Clearly, competition can hurt companies, but especially with emerging technology markets, competition often results in making the pie bigger for everyone, by expanding the range of customers that can be served — although yes, sometimes weaker competitors will be culled from the herd.

I believe that companies are best served by being the best they can be — you can target a competitor by responding on a tactical basis, and sometimes you want to, but for your optimal long-term success, you should strive to be great yourself. Obsessing over what your competitors are doing can easily distract companies from doing the right thing on a long-term strategic basis.

That said:

Dan Woods over on Forbes has written a blog post about questions around Amazon’s API strategy, and Jim Plamondon (Rackspace Developer Relations) has posted a comment on my blog about Amazon ecosystem zombiefication.

I’ve been thinking about the implications of Amazon API compatibility, and the degree to which it is or isn’t to Amazon’s advantage to encourage other people to build Amazon-compatible clouds.

I think it comes down to the following: If Amazon believes that they can innovate faster, drive lower costs, and deliver better service than all of their competitors that are using the same APIs (or, for that matter, enterprises who are using those same APIs), then it is to their advantage to encourage as many ways to “on-ramp” onto those APIs as possible, with the expectation that they will switch onto the superior Amazon platform over time.

But I would also argue that all this nattering about the basic semantics of provisioning bare resource elements is largely a waste of time for most people. None of the APIs for provisioning compute and storage (whether EC2/S3/EBS or their counterparts in other clouds) are complicated things at their core. They’re almost always wrappered with an abstraction layer, third-party library, or management tool. However, APIs may matter to people who are building clouds because they implicitly express the underlying conceptual framework of the system, though, and the richness of the API semantics constrain what can be expressed and therefore, what can be controlled via the API; the constraints of the Amazon APIs forces everyone else to express richer concepts in some other way.

But the battle will increasingly not be fought at this very basic level of ‘how do I get raw resources’. I recognize that building a cloud infrastructure platform at scale and with a lot of flexibility is a very difficult problem (although a simple and rigid one is not an especially difficult problem, as you can see from the zillion CMPs out in the market). But it’s not where value is ultimately created for users.

Value for users is ultimately created at the layers above the core infrastructure. Everyone has to get core infrastructure right, but the real question is: How quickly can you build value-added services, and how well does the adaptibility of your core infrastructure allow you to serve a broad range of use cases (or serve a narrow range of use cases in a fashion superior to everyone else) and to deliver new capabilities to your users?

Advertisements

Ecosystems in conflict – Amazon vs. VMware, and OpenStack

Citrix contributing CloudStack to the Apache Software Foundation isn’t so much a shot at OpenStack (it just happens to get caught in the crossfire), as it’s a shot against VMware.

There are two primary ecosystems developing in the world: VMware and Amazon. Other possibilities, like Microsoft and OpenStack, are completely secondary to those two. You can think of VMware as “cloud-out” and Amazon as “cloud-in” approaches.

In the VMware world, you move your data center (with its legacy applications) into the modern era with virtualization, and then you build a private cloud on top of that virtualized infrastructure; to get additional capacity, business agility, and so forth, you add external cloud IaaS, and hopefully do so with a VMware-virtualized provider (and, they hope, specifically a vCloud provider who has adopted the stack all the way up to vCloud Director).

In the Amazon world, you build and launch new applications directly onto cloud IaaS. Then, as you get to scale and a significant amount of steady-state capacity, you pull workloads back into your own data center, where you have Amazon-API-compatible infrastructure. Because you have a common API and set of tools across both, where to place your workloads is largely a matter of economics (assuming that you’re not using AWS capabilities beyond EC2, S3, and EBS). You can develop and test internally or externally, though if you intend to run production on AWS, you have to take its availability and performance characteristics into account when you do your application architecture. You might also adopt this strategy for disaster recovery.

While CloudStack has been an important CMP option for service providers — notably competing against the vCloud stack, OnApp, Hexagrid, and OpenStack — in the end, these providers are almost a decoration to the Amazon ecosystem. They’re mostly successful competing in places that Amazon doesn’t play — in countries where Amazon doesn’t have a data center, in the managed services / hosting space, in the hypervisor-neutral space (Amazon-style clouds built on top of VMware’s hypervisor, more specifically), and in a higher-performance, higher-availability market.

Where CloudStack has been more interesting is in its use to be a “cloud-in” platform for organizations who are using AWS in a significant fashion, and who want their own private cloud that’s compatible with it. Eucalyptus fills this niche as well, although Eucalyptus customers tend to be smaller and Eucalyptus tends to compete in the general private-cloud-builder CMP space targeted at enterprises — against the vCloud stack, Abiquo, HP CloudSystem, BMC Cloud Lifecycle Manager, CA’s 3Tera AppLogic, and so on. CloudStack tends to be used by bigger organizations; while it’s in the general CMP competitive space, enterprises that evaluate it are more likely to be also evaluating, say, Nimbula and OpenStack.

CloudStack has firmly aligned itself with the Amazon ecosystem. But OpenStack is an interesting case of an organization caught in the middle. Its service provider supporters are fundamentally interested in competing against AWS (far more so than with the VMware-based cloud providers, at least in terms of whatever service they’re building on top of OpenStack). Many of its vendor contributors are afraid of a VMware-centric world (especially as VMware moves from virtualizing compute to also virtualizing storage and networks), but just as importantly they’re afraid of a world in which AWS becomes the primary way that businesses buy infrastructure. It is to their advantage to have at least one additional successful widely-adopted CMP in the market, and at least one service provider successfully competing strongly against AWS. Yet AWS has established itself as a de facto standard for cloud APIs and for the way that a service “should” be designed. (This is why OpenStack has an aptly named “Nova Feature Parity Team” playing catch-up to AWS, after all, and why debates about the API continue in the OpenStack community.)

But make no mistake about it. This is not about scrappy free open-source upstarts trying to upset an established vendor ecosystem. This is a war between vendors. As Simon Wardley put it, beware of geeks bearing gifts. CloudStack is Citrix’s effort to take on VMware and enlist the rest of the vendor community in doing so. OpenStack is an effort on the part of multiple vendors — notably Rackspace and HP — to pool their engineering efforts in order to take on Amazon. There’s no altruism here, and it’s not coincidental that the committers to the projects have an explicit and direct commercial interest — they are people working full-time for vendors, contributing as employees of those vendors, and by and large not individuals contributing for fun.

So it really comes down to this: Who can innovate more quickly, and choose the right ways to innovate that will drive customer adoption?

Ladies and gentlemen, place your bets.

Citrix, CloudStack, OpenStack, and the war for open-source clouds

There are dozens upon dozens of cloud management platforms (CMPs), sometimes known as “cloud stacks” or “cloud operating systems”, out in the wild, both commercial and open source. Two have been in the news recently — Eucalyptus and CloudStack — with implications for the third, OpenStack.

Last week, Eucalyptus licensed Amazon’s API, and just yesterday, Wired extolled the promise of OpenStack.

Now, today, Citrix has dropped a bombshell into the open-source CMP world by announcing that it is contributing CloudStack (the Amazon-API-compatible CMP it acquired via its staggeringly expensive Cloud.com acquisition) to the Apache Software Foundation (ASF). This includes not just the core components, which are already open-source, but also all of the currently closed-source commercial components (except any third-party things that were licensed from other technology companies under non-Apache-compatible licenses).

I have historically considered CloudStack a commercial CMP that happens to have a token open-source core, simply because anyone considering a real deployment of CloudStack buys the commercial version to get all the features — you just don’t really see people adopting the non-commercial version, which I consider a litmus test of whether or not an open-core approach is really viable. This did change with Citrix, and the ASF move truly puts the whole thing out there as open source, so adopters have a genuine choice about whether or not they want to pay for commercial support, and it should spur more contributions from people and organizations that were opposed to the open-core model.

What makes this big news is the fact that OpenStack is a highly immature platform (it’s unstable and buggy and still far from feature-complete, and people who work with it politely characterize it as “challenging”), but CloudStack is, at this point in its evolution, a solid product — it’s production-stable and relatively turnkey, comparable to VMware’s vCloud Director (some providers who have lab-tested both even claim stability and ease of implementation are better than vCD). Taking a stable, featureful base, and adding onto it, is far easier for an open-source community to do than trying to build complex software from scratch.

Also, by simply giving CloudStack to the ASF, Citrix explicitly embraces a wholly-open, committer-driven governance model for an open-source CMP. Eucalyptus has already wrangled with its community over its open-core closed-extensions approach, and Rackspace is still strugging with governance issues even though it’s promised to put OpenStack into a foundation, because of the proposed commercial sponsorship of board seats. CloudStack is also changing from GPLv3 to the Apache license, which should remove some concerns about contributing. (OpenStack also uses the Apache license.)

Citrix, of course, stands to benefit indirectly — most people who choose to use CloudStack also choose to use Xen, and often purchase XenServer, plus Citrix will continue to provide commercial support for CloudStack. (It will just be a commercial distribution and support, though, without any additional closed-soure code.) And they rightfully see VMware as the enemy, so explicitly embracing the Amazon ecosystem makes a lot of sense. (Randy Bias has more thoughts on Citrix; read James Urquhart’s comment, too.)

Citrix has also explicitly emphasized Amazon compatibility with this announcement. OpenStack’s community has been waffling about whether or not they want to continue to support an Amazon-compatible API; at the moment, OpenStack has its own API but also secondarily supports Amazon compatibility. It’s an ecosystem question, as well as potentially an intellectual property issue if Amazon ever decides to get tetchy about its rights. (Presumably Citrix isn’t being this loud about compatibility without Amazon quietly telling them, “No, we’re not going to sue you.”)

I think this move is going to cause a lot of near-term soul-searching amongst the major commercial contributors to OpenStack. While clearly there’s value in working on multiple projects, each of the vendors still needs to place bets on where their engineering time and budgets are best spent. Momentum is with OpenStack, but it’s also got a long way to go.

HP has effectively recently doubled down on OpenStack; it’s not too late for them to change their mind, but for the moment, they’re committed in an OpenStack direction both for their public developer-centric cloud IaaS, and where they’re going with their hybrid cloud and management software strategy. No doubt they’ll end up supporting every major CMP that sees significant success, but HP is typically a slow mover, and it’s taken them this long to get aligned on a strategy; I’m not personally expecting them to shift anytime soon.

But the other vendors are largely free to choose — likely to support both for the time being, but there may be a strong argument for primarily backing an ASF project that’s already got a decent core codebase and is ready for mainstream production use, over spending the next year to two years (depending on who you talk to) trying to get OpenStack to the point where it’s a real commercial product (defined as meeting enterprise expectations for stable, relatively maintenance-free software).

The absence of major supporting vendor announcements along with the Citrix announcement is notable, though. Most of the big vendors have made loud commitments to OpenStack, commitments that I don’t expect anyone to back down on, in public, even if I expect that there could be quiet repositioning of resources in the background. I’ve certainly had plenty of confidential conversations with a broad array of technology vendors around their concerns for the future of OpenStack, and in particular, when it will reach commercial readiness; I expect that many of them would prefer to put their efforts behind something that’s commercially ready right now.

There will undoubtedly be some people who say that Citrix’s move basically indicates that CloudStack has failed to compete against OpenStack. I don’t think that’s true. I think that CloudStack is gaining better “real world” adoption than OpenStack, because it’s actually usable in its current form without special effort (i.e., compared to other commercial software) — but the Rackspace marketing machine has done an outstanding job with hyping OpenStack, and they’ve done a terrific job building a vendor community, whereas CloudStack’s primary committers have been, to date, almost solely Cloud.com/Citrix.

Both OpenStack and CloudStack can co-exist in the market, but if Citrix wants to speed up the creation of Amazon-compatible clouds that can be used in large-scale production by enterprises trying to do Amazon hybrid clouds (or more precisely, who want freedom to easily choose where to place their workloads), it needs to persuade other vendors to devote their efforts to enhancing CloudStack rather than pouring more time into OpenStack.

Note that with this announcement, Citrix also cancels Project Olympus, its planned OpenStack commercial distribution, although it intends to continue contributing to OpenStack. (Certainly they need to, if they’re going to support folks like Rackspace who are trying to do XenServer with OpenStack; the OpenStack deployments to date have been KVM for stability reasons.)

But it’s certainly going to be interesting out there. At this stage of the CMP evolution, I think that the war is much more for corporate commitment and backing with engineers paid to work on the projects, than it is for individual committers from the broader world — although certainly individual engineers (the open-source talent, so to speak) will choose to join the companies who work on their preferred projects.

The Amazon-Eucalyptus partnership

Eucalyptus, a commercial open-source cloud management platform (“CMP”, software used to build cloud infrastructure), recently announced that it had signed a partnership with Amazon.

Eucalyptus began life as a university project to build a CMP that would create Amazon-API-compatible cloud infrastructure, but eventually turned into a commercial effort. However, like all other CMPs offering Amazon compatibility, Eucalyptus has always lived under the shadow of the threat that Amazon might someday try to enforce intellectual property rights related to its API.

With this partnership, Eucalyptus has formally licensed the Amazon API. There’s been a lot of speculation on what this means. My understanding is the following:

This is a non-exclusive technology partnership. Eucalyptus now has a formal license to build products that are compatible with the AWS APIs; at the moment, that’s EC2, S3, and EBS, but Eucalyptus can adopt the other APIs as well if they choose to. Amazon may enter into similar licensing agreements with others, enter into different sorts of partnerships, and so forth; this is a non-restrictive deal. Furthermore, this partnership is not a signal that Amazon is changing its stance towards other products/services with Amazon-compatible APIs, where it has to date adopted a laissez-faire attitude.

This is an API licensing deal, not a technology licensing deal. Amazon will provide Eucalyptus with API specifications, including related engineering specifications not provided in the public user-level documentation. However, Amazon will not be giving any technology away to Eucalyptus — this is not engineering assistance with the actual implementation. Eucalyptus will still need to do all of its own product engineering.

There is no coupling of Amazon and Eucalyptus’s development cycles. While Amazon will try to inform Eucalyptus of planned API changes so that Eucalyptus is able to release its own updates in a timely manner, Eucalyptus is on its own — if it can keep up with Amazon, fine, if it can’t, too bad. Eucalyptus is not obliged to remain Amazon-compatible, nor is Amazon obliged to ensure that it’s feasible for Eucalyptus to remain compatible.

Some people think that this deal with give Eucalyptus some much-needed life, since it has met with limited commercial interest, and its developer community has yet to really recover from the rifts created by a past licensing change.

I personally don’t agree. With people increasingly writing to libraries, or using third-party tools (RightScale, enStratus, etc.), developers tend to care less about what’s under the hood as long as their favorite tool supports it. Yes, Amazon’s API has become a de facto standard, but I haven’t seen Eucalyptus be the Amazon-compatible CMP of choice; instead, I see serious adopters choose CloudStack (Citrix, from the Cloud.com acquisition), and the vendors who want to be part of an open-source cloud project put their support primarily behind OpenStack. I’m not convinced that this licensing deal, however interesting, is going to significantly either shift buyer desires towards Eucalyptus, or improve their community support.

The challenge of hiring development teams

A recent blog post on Forbes by Venkatesh Rao, The Rise of Developernomics, has ignited a lot of controversy around the concept that some developers are as much as 10x more productive than others. It’s not a new debate; the assertion that some developers are 20x more productive than others has been around forever, and folks like Jole Spolsky have asserted that it’s not just a matter of productivity, but also a developer’s ability to hit the high notes of real breakthrough achievement that makes for greatness.

Worth reading out of all of these threads: Avichal Garg of Spool’s blog post on building 10x teams, which has a very nice dissection of the composition of great teams.

Also, for those of you who haven’t read it: Now Discover Your Strengths is a fantastic way to look at what people’s work-related strengths are, since it takes into account a broad range of personal and interpersonal traits. Rackspace turned me onto it a number of years ago; they actually hang a little sign with each employee’s strengths on their cube. (See mine for an example.)

Jon Evans of TechCrunch wrote a good blog post a few months ago, Why the New Guy Can’t Code, which illustrates the challenges of hiring good developers. (There are shocking numbers of developers out there who have never really produced significant code in their jobs. Indeed, I once interviewed a developer with five years of experience who had never written code in a work context — he kept being moved from project to project that was only in the formal requirements phase, so all he had was his five-years-stale student efforts from his CS degree.)

Even with the massive pile of unemployed developers out there, it’s still phenomenally challenging to hire good people. And if your company requires a narrow and specific set of things that the developer must have worked with before, rather than hiring a smart Swiss army knife of a developer who can pick up anything given a few days, you will have an even bigger problem, especially if you require multiple years of experience with brand-new technologies like AWS, NoSQL, Hadoop, etc.

With more and more Web hosters, systems integrators, and other infrastructure-specialist companies transforming themselves into cloud providers, and sometimes outright buying software companies (such as Terremark buying CloudSwitch, and Virtustream buying Enomaly), serious software development chops are becoming a key for a whole range of service providers who never really had significant development teams in the past. No one should underestimate how much of a shortage there is for great talent.

As a reminder, Gartner is hiring!

Performance can be a disruptive competitive advantage

All of us are used to going to travel sites, especially for airline tickets, and waiting a while for the appropriate results to be extracted and displayed to us. I recently saw Google Flight Search for the first time and was astonished by its raw speed — essentially completely instant.

I frequently talk to customers about acceleration solutions, and discuss the business value of performance. Specifically, this is a look at business metrics that measure the success of a website or application — time spent on your site, conversion rate, shopping basket value, page views, ad views, transactions processed, employee productivity, decline in call center volume, and so forth. You compare the money associated with these metrics, against the cost of the solutions, to look at comparative ROI.

The business value of performance is usually tied to industry in a narrow and specific way, because users have a particular set of expectations and needs. For instance, for travel sites, a certain amount of performance is necessary in order to make the site usable, but the long waits for searches are things that users are conditioned to, making their overall performance expectations relatively low. Travel sites usually discover that generalized site responsiveness improve the user experience and cause revenue per site visit to increase — but only up to a certain point, at which point in time it plateaus, as the site has enough responsiveness that users aren’t discouraged from using it, and they’re going to buy what they came to buy.

Google Flight Search proves that you can “break through” the performance ceiling to actually entirely change the user experience, though. This is not the kind of incremental improvement you can achieve through acceleration techniques, though; instead, it’s a core change that affects the thing that is slowest, which is generally the back-end database and business logic, not the network. This can actually be a disruptive competitive advantage.

I typically ask my CDN clients, “What are the factors that make your site slow?” In many cases, they need to do something that goes beyond what edge caching or even network optimization (dynamic acceleration) can achieve. They need to reduce their page weight, or write better pages (and may benefit from front-end optimization techniques), or to improve the back-end responsiveness. Acceleration techniques are often used to band-aid a core problem with performance, just like CDN professional services to make a site cacheable are often used to band-aid a core problem with site structure. At some point in time it becomes more cost-effective to fix the core problem.

Too few businesses design their websites and applications with speed in mind.

Would you like to run Apache Wave, Grandma?

As many people already know, Google is sunsetting Google Wave. This has led to Google sending an email to people who previously signed up for Wave. The bit in the email that caught my eye was this:

If you would like to continue using Wave, there are a number of open source projects, including Apache Wave. There is also an open source project called Walkaround that includes an experimental feature that lets you import all your Waves from Google.

For an email sent to Joe Random Consumer, it’s remarkably clueless as to what consumers actually can comprehend. Grandma is highly unlikely to understand what the heck that means.

Tip for everyone offering a product or service to consumers: Any communication with consumers about that product or service should be in language that Grandma can understand.

Why developers make superior operators

Developers who deeply understand the arcana of infrastructure, and operators who can code and understand the interaction of applications and infrastructure, are better than developers and operators who understand only their own discipline. But it’s typically easier, from the perspective of training, for a developer to learn operations, than for an operator to learn development.

While there are fair number of people who teach themselves on-the-job, most developers still come out of formal computer science backgrounds. The effectiveness of formal education in CS varies immensely, and you can get a good understanding by reading on your own, of course, if you read the right things — it’s the knowledge that matters, not how you got it. But ideally, a developer should accumulate the background necessary to understand the theory of operating systems, and then have a deeper knowledge of the particular operating system that they primarily work with, as well as the arcana of the middleware. It’s intensely useful to know how the abstract code you write, actually turns out to run in practice. Even if you’re writing in a very high-level programming language, knowing what’s going on under the hood will help you write better code.

Many people who come to operations from the technician end of things never pick up this kind of knowledge; a lot of people who enter either systems administration or network operations do so without the benefit of a rigorous education in computer science, whether from college or self-administered. They can do very well in operations, but it’s generally not until you reach the senior-level architects that you commonly find people who deeply understand the interaction of applications, systems, and networks.

Unfortunately, historically, we have seen this division in terms of relative salaries and career paths for developers vs. operators. Operators are often treated like technicians; they’re often smart learn-on-the-job people without college degrees, but consequently, companies pay accordingly and may limit advancement paths accordingly, especially if the company has fairly strict requirements that managers have degrees. Good developers often emerge from college with minimum competitive salary requirements well above what entry-level operations people make.

Silicon Valley has a good collection of people with both development and operations skills because so many start-ups are founded by developers, who chug along, learning operations as they go, because initially they can’t afford to hire dedicated operations people; moreover, for more than a decade, hypergrowth Internet start-ups have deliberately run devops organizations, making the skillset both pervasive and well-paid. This is decidedly not the case in most corporate IT, where development and operations tend to have a hard wall between them, and people tend to be hired for heavyweight app development skills, more so than capabilities in systems programming and agile-friendly languages.

Here are my reasons for why developers make better operators, or perhaps more accurately, an argument for why a blended skillset is best. (And here I stress that this is personal opinion, and not a Gartner research position; for official research, check out the work of my esteemed colleagues Cameron Haight and Sean Kenefick. However, as someone who was formally educated as a developer but chose to go into operations, and who has personally run large devops organizations, this is a strongly-held set of opinions for me. I think that to be a truly great architect-level ops person, you also have to have a developer’s skillset, and I believe it’s important to mid-level people as well, which I recognize as a controversial opinions.)

Understanding the interaction of applications and infrastructure leads to better design of both. This is an architect’s role, and good devops understand how to look at applications and advise developers how they can make them more operations-friendly, and know how to match applications and infrastructure to one another. Availability, performance, and security are all vital to understand. (Even in the cloud, sharp folks have to ask questions about what the underlying infrastructure is. It’s not truly abstract; your performance will be impacted if you have a serious mismatch between the underlying infrastructure implementation and your application code.)

Understanding app/infrastructure interactions leads to more effective troubleshooting. An operator who can CTrace, DTrace, sniff networks, read application code, and know how that application code translates to stuff happening on infrastructure, is in a much better position to understand what’s going wrong and how to fix it.

Being able to easily write code means less wasted time doing things manually. If you can code nearly as quickly as you can do something by hand, you will simply write it as a script and never have to think about doing it by hand again — and neither will anyone else, if you have a good method for script-sharing. It also means that forever more, this thing will be done in a consistent way. It is the only way to truly operate at scale.

Scripting everything, even one-time tasks, leads to more reliable operations. When working in complex production environments (and arguably, in any environment), it is useful to write out every single thing you are going to do, and your action plan for any stage you deem dangerous. It might not be a formal “script”, but a command-by-command plan can be reviewed by other people, and it means that you are not making spot decisions under the time pressure of a maintenance window. Even non-developers can do this, of course, but most don’t.

Converging testing and monitoring leads to better operations. This is a place where development and operations truly cross. Deep monitoring converges into full test coverage, and given the push towards test-driven development in agile methodologies, it makes sense to make production monitoring part of the whole testing lifecycle.

Development disciplines also apply to operations. The systems development lifecycle is applicable to operations projects, and brings discipline to what can otherwise be unstructured work; agile methodologies can be adapted to operations. Writing the tests first, keeping things in a revision control system, and considering systems holistically rather than as a collection of accumulated button-presses are all valuable.

The move to cloud computing is a move towards software-defined everything. Software-defined infrastructure and programmatic access to everything inherently advantages developers, and it turns the hardware-wrangling skills into things for low-level technicians and vendor field engineering organizations. Operations becomes software-oriented operations, one way or another, and development skills are necessary to make this transition.

It is unfortunately easier to teach operations to developers, than it is to teach operators to code. This is especially true when you want people to write good and maintainable code — not the kind of script in which people call out to shell commands for the utilities that they need rather than using the appropriate system libraries, or splattering out the kind of program structure that makes re-use nigh-impossible, or writing goop that nobody else can read. This is not just about the crude programming skills necessary to bang out scripts; this is about truly understanding the deep voodoo of the interactions between applications, systems, and networks, and being able to neatly encapsulate those things in code when need be.

Devops is a great place for impatient developers who want to see their code turn into results right now; code for operations often comes in a shorter form, producing tangible results in a faster timeframe than the longer lifecycles of app development (even in agile environments). As an industry, we don’t do enough to help people learn the zen of it, and to provide career paths for it. It’s an operations specialty unto itself.

Devops is not just a world in which developers carry pagers; in fact, it doesn’t necessarily mean that application developers carry pagers at all. It’s not even just about a closer collaboration between development and operations. Instead, it can mean that other than your most junior button-pushers and your most intense hardware specialists, your operations people understand both applications and infrastructure, and that they write code as necessary to highly automate the production environment. (This is more the philosophy of Google’s Site Reliability Engineering, than it is Amazon-style devops, in other words.)

But for traditional corporate IT, it means hiring a different sort of person, and paying differently, and altering the career path.

A little while back, I had lunch with a client from a mid-market business, which they spent telling me about how efficient their IT had become, especially after virtualization — trying to persuade me that they didn’t need the cloud, now or ever. Curious, I asked how long it typically took to get a virtualized server up and running. The answer turned out to be three days — because while they could push a button and get a VM, all storage and networking still had to be manually provisioned. That led me to probe about a lot of other operations aspects, all of which were done by hand. The client eventually protested, “If I were to do the things you’re talking about, I’d have to hire programmers into operations!” I agreed that this was precisely what was needed, and the client protested that they couldn’t do that, because programmers are expensive, and besides, what would they do with their existing do-everything-by-hand staff? (I’ve heard similar sentiments many times over from clients, but this one really sticks in my mind because of how shocked this particular client was by the notion.)

Yes. Developers are expensive, and for many organizations, it may seem alien to use them in an operations capacity. But there’s a cost to a lack of agility and to unnecessarily performing tasks manually.

But lessons learned in the hot seat of hypergrowth Silicon Valley start-ups take forever to trickle into traditional corporate IT. (Even in Silicon Valley, there’s often a gulf between the way product operations works, and the way traditional IT within that same company works.)

To become like a cloud provider, fire everyone here

A recent client inquiry of mine involved a very large enterprise, who informed me that their executives had decided that IT should become more like a cloud provider — like Google or Facebook or Amazon. They wanted to understand how they should transform their organization and their IT infrastructure in order to do this.

There were countless IT people on this phone consultation, and I’d received a dizzying introducing to names and titles and job functions, but not one person in the room was someone who did real work, i.e., someone who wrote code or managed systems or gathered requirements from the business, or even did higher-level architecture. These weren’t even people who had direct management responsibility for people who did real work. They were part of the diffuse cloud of people who are in charge of the general principle of getting something done eventually, that you find everywhere in most large organizations (IT or not).

I said, “If you’re going to operate like a cloud provider, you will need to be willing to fire almost everyone in this room.”

That got their attention. By the time I’d spent half an hour explaining to them what a cloud provider’s organization looks like, they had decidedly lost their enthusiasm for the concept, as well as been poleaxed by the fundamental transformations they would have to make in their approach to IT.

Another large enterprise client recently asked me to explain Rackspace’s organization to them. They wanted to transform their internal IT to resemble a hosting company’s, and Rackspace, with its high degree of customer satisfaction and reputation for being a good place to work, seemed like an ideal model to them. So I spent some time explaining the way that hosting companies organize, and how Rackspace in particular does — in a very flat, matrix-managed way, with horizontally-integrated teams that service a customer group in a holistic manner, coupled with some shared-services groups.

A few days later, the client asked me for a follow-up call. They said, “We’ve been thinking about what you’ve said, and have drawn out the org… and we’re wondering, where’s all the management?”

I said, “There isn’t any more management. That’s all there is.” (The very flat organization means responsibility pushed down to team leads who also serve functional roles, a modest number of managers, and a very small number of directors who have very big organizations.)

The client said, “Well, without a lot of management, where’s the career path in our organization? We can’t do something like this!”

Large enteprise IT organizations are almost always full of inertia. Many mid-market IT organizations are as well. In fact, the ones that make me twitch the most are the mid-market IT directors who are actually doing a great job with managing their infrastructure — but constrained by their scale, they are usually just good for their size and not awesome on the general scale of things, but are doing well enough to resist change that would shake things up.

Business, though, is increasingly on a wartime footing — and the business is pressuring IT, usually in the form of the development organization, to get more things done and to get them done faster. And this is where the dissonance really gets highlighted.

A while back, one of my clients told me about an interesting approach they were trying. They had a legacy data center that was a general mess of stuff. And they had a brand-new, shiny data center with a stringent set of rules for applications and infrastructure. You could only deploy into the new shiny data center if you followed the rules, which gave people an incentive to toe the line, and generally ensured that anything new would be cleanly deployed and maintained in a standardized manner.

It makes me wonder about the viability of an experiment for large enterprise IT with serious inertia problems: Start a fresh new environment with a new philosophy, perhaps a devops philosophy, with all the joy of having a greenfield deployment, and simply begin deploying new applications into it. Leave legacy IT with the mess, rather than letting the morass kill every new initiative that’s tried.

Although this is hampered by one serious problem: IT superstars rarely go to work in enterprises (excepting certain places, like some corners of financial services), and they especially don’t go to work in organizations with inertia problems.

IT Operations and button-pushing

The fine folks at Nodeable gave me an informal introductory briefing today; they’ve got a pretty cool concept for a cloud-oriented monitoring and management SaaS-based tool that’s aimed at DevOps.

I’ve been having stray thoughts on DevOps and the future of IT Operations in the couple of hours that have passed since then, and reflecting on the following problem:

At an awful lot of companies, IT Operations, especially lower-level folks, are button-pushing monkeys — specifically, they are people who know how to use the vendor-supplied GUI to perform particular tasks. They may know the vendor-recommended ways to do things with a particular bit of hardware or software. But only a few of them have architect-level knowledge, the deep understanding of the esoterica of systems and how this stuff is actually built and engineered. (Some of this is a reflection of education; a lot of IT Operations people don’t come from a computer science background, but have what they’ve needed to know on the job.)

Today’s DevOps person is likely to have a skillset that we used to call systems programming. They understand systems architecture, they understand operating systems, they can write system-level code, including the scripting necessary for automation. The programmatic access to infrastructure exemplified by cloud IaaS providers has moved this up a layer of abstraction, so that you don’t have to be a deep-voodoo guy to do this kind of thing.

We’re moving towards a world where you have really low-level button-pushers — possibly where the button-pushing is so simple that you don’t need a specialist to do it any longer, anyone reasonably technical can do it — and senior architects whoo design things, and systems programmers who automate things. Whether those systems programmers work in application development and are “DevOps”, or whether those systems programmers work in IT Operations and just happen to be systems guys who program (mostly scripting), doesn’t really matter — the era of the button-pusher is drawing towards its close either way, at least for organizations who are going to efficiently increase IT Operations efficiency.

I want to share a story. It is, in some ways, a story about cruelty and unprofessionalism, but it’s funny in its own way.

About fifteen years ago, I was working as an engineer at Digex (the first real managed hosting company). We had a pretty highly skilled group of engineers there, and we never did anything using a GUI. We had hundreds of customers on dedicated Sun servers, and you’d either SSH into the systems or, in a pinch, go to the data center and log in on console. We were also the kind of people who would fix issues by making kernel modifications — for instance, the day that the SYN flood attack showed up, a bunch of customers went down hard, meaning that we could not afford to wait for Sun to come up with a patch, since we had customer SLAs to meet, so one of our security engineers rewrote the kernel’s queueing code for TCP accepts.

We were without a manager for some time, and they finally hired a guy who was supposedly a great Sun sysadmin. He didn’t actually get a technical interview, but he had a good work history of completed projects and happy teams and so forth. He was supposed to be both the manager and the technical lead for the team.

The problem was that he had no idea how to do anything that wasn’t in Sun’s administrator GUI. He didn’t even know how to attach a console cable to a server, much less log in remotely to a system. Since we did absolutely nothing with a GUI, this was a big problem. An even bigger problem was that he didn’t understand anything about the underlying technologies we were supporting. If he had a problem, he was used to calling Sun and having them tell him what to do. This, clearly, is a big problem in a managed hosting environment where you’re the first line of support for your customers, who may do arbitrary wacky things.

He also worked a nine-to-five day at a startup where engineers routinely spent sixteen hours at work. His team, and the other engineers at the company, had nothing but contempt for him. And one night, having dinner at 10 pm as a break before going right back into work, someone had an idea.

“Let’s recompile his kernel without mouse support.” (Like all the engineers, he had a Sun workstation at his desk.)

And so when he came to work the next morning, his mouse didn’t work — and every trace of the intrusion had been covered, thanks to the complicity of one of the security engineers.

Someone who had an idea of what he was doing wouldn’t have been phazed; they’d have verified the mouse wasn’t working, then done an L1-A to put the workstation into PROM mode, and easily done troubleshooting from there (although admittedly, nobody thinks, “I wonder if somebody recompiled my kernel without mouse support after I went home last night”). This poor guy couldn’t do anything other than pick up his mouse to make sure the underside hadn’t gotten dirty. It turned out that he had no idea how to do anything with the workstation if he couldn’t log in via the GUI.

It proved to be a remarkably effective demonstration to management that this guy was a yahoo and needed to be fired. (Fortunately, there were plenty of suspect engineers, and management never found out who was responsible. Earl Galleher, who ran that part of the business at the time, and is the chairman at Basho now, probably still wonders… It wasn’t me, Earl.)

But it makes me wonder what is the future of all the GUI masters in IT Operations, because the world is evolving to be more like the teams that I had before I came to Gartner — systems programmers with strong systems and operations skills, who could also code.

DevOps: Now you know how to deal with the IT Operations guy who can only use a GUI…

%d bloggers like this: