Why developers make superior operators

Developers who deeply understand the arcana of infrastructure, and operators who can code and understand the interaction of applications and infrastructure, are better than developers and operators who understand only their own discipline. But it’s typically easier, from the perspective of training, for a developer to learn operations, than for an operator to learn development.

While there are fair number of people who teach themselves on-the-job, most developers still come out of formal computer science backgrounds. The effectiveness of formal education in CS varies immensely, and you can get a good understanding by reading on your own, of course, if you read the right things — it’s the knowledge that matters, not how you got it. But ideally, a developer should accumulate the background necessary to understand the theory of operating systems, and then have a deeper knowledge of the particular operating system that they primarily work with, as well as the arcana of the middleware. It’s intensely useful to know how the abstract code you write, actually turns out to run in practice. Even if you’re writing in a very high-level programming language, knowing what’s going on under the hood will help you write better code.

Many people who come to operations from the technician end of things never pick up this kind of knowledge; a lot of people who enter either systems administration or network operations do so without the benefit of a rigorous education in computer science, whether from college or self-administered. They can do very well in operations, but it’s generally not until you reach the senior-level architects that you commonly find people who deeply understand the interaction of applications, systems, and networks.

Unfortunately, historically, we have seen this division in terms of relative salaries and career paths for developers vs. operators. Operators are often treated like technicians; they’re often smart learn-on-the-job people without college degrees, but consequently, companies pay accordingly and may limit advancement paths accordingly, especially if the company has fairly strict requirements that managers have degrees. Good developers often emerge from college with minimum competitive salary requirements well above what entry-level operations people make.

Silicon Valley has a good collection of people with both development and operations skills because so many start-ups are founded by developers, who chug along, learning operations as they go, because initially they can’t afford to hire dedicated operations people; moreover, for more than a decade, hypergrowth Internet start-ups have deliberately run devops organizations, making the skillset both pervasive and well-paid. This is decidedly not the case in most corporate IT, where development and operations tend to have a hard wall between them, and people tend to be hired for heavyweight app development skills, more so than capabilities in systems programming and agile-friendly languages.

Here are my reasons for why developers make better operators, or perhaps more accurately, an argument for why a blended skillset is best. (And here I stress that this is personal opinion, and not a Gartner research position; for official research, check out the work of my esteemed colleagues Cameron Haight and Sean Kenefick. However, as someone who was formally educated as a developer but chose to go into operations, and who has personally run large devops organizations, this is a strongly-held set of opinions for me. I think that to be a truly great architect-level ops person, you also have to have a developer’s skillset, and I believe it’s important to mid-level people as well, which I recognize as a controversial opinions.)

Understanding the interaction of applications and infrastructure leads to better design of both. This is an architect’s role, and good devops understand how to look at applications and advise developers how they can make them more operations-friendly, and know how to match applications and infrastructure to one another. Availability, performance, and security are all vital to understand. (Even in the cloud, sharp folks have to ask questions about what the underlying infrastructure is. It’s not truly abstract; your performance will be impacted if you have a serious mismatch between the underlying infrastructure implementation and your application code.)

Understanding app/infrastructure interactions leads to more effective troubleshooting. An operator who can CTrace, DTrace, sniff networks, read application code, and know how that application code translates to stuff happening on infrastructure, is in a much better position to understand what’s going wrong and how to fix it.

Being able to easily write code means less wasted time doing things manually. If you can code nearly as quickly as you can do something by hand, you will simply write it as a script and never have to think about doing it by hand again — and neither will anyone else, if you have a good method for script-sharing. It also means that forever more, this thing will be done in a consistent way. It is the only way to truly operate at scale.

Scripting everything, even one-time tasks, leads to more reliable operations. When working in complex production environments (and arguably, in any environment), it is useful to write out every single thing you are going to do, and your action plan for any stage you deem dangerous. It might not be a formal “script”, but a command-by-command plan can be reviewed by other people, and it means that you are not making spot decisions under the time pressure of a maintenance window. Even non-developers can do this, of course, but most don’t.

Converging testing and monitoring leads to better operations. This is a place where development and operations truly cross. Deep monitoring converges into full test coverage, and given the push towards test-driven development in agile methodologies, it makes sense to make production monitoring part of the whole testing lifecycle.

Development disciplines also apply to operations. The systems development lifecycle is applicable to operations projects, and brings discipline to what can otherwise be unstructured work; agile methodologies can be adapted to operations. Writing the tests first, keeping things in a revision control system, and considering systems holistically rather than as a collection of accumulated button-presses are all valuable.

The move to cloud computing is a move towards software-defined everything. Software-defined infrastructure and programmatic access to everything inherently advantages developers, and it turns the hardware-wrangling skills into things for low-level technicians and vendor field engineering organizations. Operations becomes software-oriented operations, one way or another, and development skills are necessary to make this transition.

It is unfortunately easier to teach operations to developers, than it is to teach operators to code. This is especially true when you want people to write good and maintainable code — not the kind of script in which people call out to shell commands for the utilities that they need rather than using the appropriate system libraries, or splattering out the kind of program structure that makes re-use nigh-impossible, or writing goop that nobody else can read. This is not just about the crude programming skills necessary to bang out scripts; this is about truly understanding the deep voodoo of the interactions between applications, systems, and networks, and being able to neatly encapsulate those things in code when need be.

Devops is a great place for impatient developers who want to see their code turn into results right now; code for operations often comes in a shorter form, producing tangible results in a faster timeframe than the longer lifecycles of app development (even in agile environments). As an industry, we don’t do enough to help people learn the zen of it, and to provide career paths for it. It’s an operations specialty unto itself.

Devops is not just a world in which developers carry pagers; in fact, it doesn’t necessarily mean that application developers carry pagers at all. It’s not even just about a closer collaboration between development and operations. Instead, it can mean that other than your most junior button-pushers and your most intense hardware specialists, your operations people understand both applications and infrastructure, and that they write code as necessary to highly automate the production environment. (This is more the philosophy of Google’s Site Reliability Engineering, than it is Amazon-style devops, in other words.)

But for traditional corporate IT, it means hiring a different sort of person, and paying differently, and altering the career path.

A little while back, I had lunch with a client from a mid-market business, which they spent telling me about how efficient their IT had become, especially after virtualization — trying to persuade me that they didn’t need the cloud, now or ever. Curious, I asked how long it typically took to get a virtualized server up and running. The answer turned out to be three days — because while they could push a button and get a VM, all storage and networking still had to be manually provisioned. That led me to probe about a lot of other operations aspects, all of which were done by hand. The client eventually protested, “If I were to do the things you’re talking about, I’d have to hire programmers into operations!” I agreed that this was precisely what was needed, and the client protested that they couldn’t do that, because programmers are expensive, and besides, what would they do with their existing do-everything-by-hand staff? (I’ve heard similar sentiments many times over from clients, but this one really sticks in my mind because of how shocked this particular client was by the notion.)

Yes. Developers are expensive, and for many organizations, it may seem alien to use them in an operations capacity. But there’s a cost to a lack of agility and to unnecessarily performing tasks manually.

But lessons learned in the hot seat of hypergrowth Silicon Valley start-ups take forever to trickle into traditional corporate IT. (Even in Silicon Valley, there’s often a gulf between the way product operations works, and the way traditional IT within that same company works.)

Posted on November 21, 2011, in Industry and tagged , . Bookmark the permalink. 3 Comments.

  1. Academically, your points are sound, but in practice, there are many reasons why developers are the worst operators.

    1) Developers are focused on the creative process. Operations is all about repetition and process. Creative individuals feel stymied by the need to live within a regimented, redundant and routine environment.

    2) Developers never want to break their own code. This is a broad generalization and some developers are awesome at unit testing. However, most unit tests barely test more than 35% of the possible conditions. The truth is, as hard as we try, no one wants to fail. It’s the reason why we have QA. Operations is a production environment. Using developers for operations opens the door to circumvent the testing process and put broken code into production. It seems there is a philosophy in the SaaS universe that due to its size, the only real way to test is to put it in production, cross your fingers and be able to roll back quickly. When in actuality, what we need is to have access to environments like SOASTA, which provides large-scale stress testing.

    3) An operations professional will not let anything into their production environment that might produce less than optimal outcome. That level of control is critical to ensuring that the production environment is as pristine as possible. This includes patches and software updates. The separation of “church and state” here must be maintained in an enterprise environment.

    4) Most hardcore software engineers that I have worked with, and I include myself among that group as well as have hired over 75 in my career, would not enjoy automation scripting aspects of operations. It’s fine from the perspective of deploying their software to reduce repetitive tasks as part of a build environment, but there’s a reason why successful development teams have a separate individual managing their build environment

    I’m sure that others could expand this list ad infinitum, but the key point is that using developers to manage operations is a mistake. Driving better collaboration between development and operations staff from project inception through operations management is critical to success.

    Like

  2. Because I’ve actually done this before, I think I have a different perspective on this than you do.

    I’m not talking about converging app/product development and operations. I’ve had to manage in that kind of environment, and I think that it sucked — developers hate carrying pagers, operators hate weird stuff happening in production, and so forth. I am talking about putting people with development skills into operations. This is, as I noted, the style of something like Google’s Site Reliability Engineering.

    You are not going to hire the same kind of developer into Operations that you do into Applications Development. They are going to have a different skill set, and a different set of interests, than your traditional applications developer. Their strength is going to be systems programming, with at least one deep strength in a scripting language, as well as the ability to at least read C. (Python plus C/C++ is probably ideal, but you could substitute Ruby or a similar language for Python.)

    This person is not an operator in the traditional sense of the word. His job is not repetitive, because it’s his role to make sure, in fact, that he doesn’t have to do things twice — indeed, that he doesn’t need to do mundane work at all, because the tools he writes allows that work to devolve to junior operators.

    This person is also not someone who does work on whatever your application is. So you don’t have issues with developers pushing broken code into production, circumventing QA, etc. There is still a hard wall between what is in development, what is being staged, and what is being pushed into production.

    This person is a resource to be called upon when stuff breaks in ways that other people can’t figure out — at that intersection of applications, systems, and networks — but their job is not firefighting. Their job is to ensure that they write code to ensure that things do not break in production, or if they do, that they get automatically corrected. And their job is to make everyone else more efficient, by automating everything that can possibly be automated (which may include integrating commercial as well as open-source tools).

    Like

  1. Pingback: Why operations will be transitioning to managing clouds and not applications… - Jared Wray's Blog

Leave a comment