Author Archives: Lydia Leong
Introducing the new Magic Quadrant for Public Cloud IaaS
I’m happy to announce that the new Gartner Magic Quadrant for Public Cloud Infrastructure as a Service has been published. (Client-only link. Non-clients can read a reprint.)
This is a brand-new Magic Quadrant; our previous Magic Quadrant has essentially been split into two MQs, this new Public Cloud IaaS MQ that focuses on self-service, and an updated and more focused iteration of the previous MQ, focused on managed services, called the Managed Hosting and Cloud IaaS MQ.
It’s been a long and interesting and sometimes controversial journey. Threaded throughout this whole Magic Quadrant are the fundamental dichotomies of the market, like IT Operations vs. developer buyers, new applications vs. existing workloads, “virtualization plus” vs. the fundamental move towards programmatic infrastructure, and so forth. We’ve tried hard to focus on a pragmatic view of the immediate wants and needs of Gartner clients, which also reflect these dichotomies.
This is a Magic Quadrant unlike the ones we have historically done in our services research; it is focused upon capabilities and features, in a manner that is much more comparable to the way that we compare software companies, than it is to things like network services or managed hosting or data center outsourcing. This reflects that public cloud IaaS goes far beyond just self-service VMs, creating significant disparities in provider capabilities.
In fact, for this Magic Quadrant, we tried just about every provider hands-on, which is highly unusual for Gartner’s evaluation approach. However, because Gartner’s general philosophy isn’t to do the kind of lab evaluations that we consider to be the domain of journalists, the hands-on stuff was primarily to confirm that providers had particular features and the specifics of what they had, without having to constantly pepper them with questions. Consequently this also involved in reading a lot of documentation, community forums, etc. This wasn’t full-fledged serious trialing. (The expense of the trials was paid on my personal credit card. Fortunately, since this was the cloud, it amounted to less than $150 all told.)
However, like all Magic Quadrants, there’s a heavy emphasis on business factors and not just technology — we are evaluating the positions of companies in the market, which are a composite of many things not directly related to comparable functionality of the services.
Like other Magic Quadrants, this one is targeted at the typical Gartner client — a mid-market company or an enterprise, but also our many tech company clients who range from tiny start-ups to huge monoliths. We believe that cloud IaaS, including the public cloud, is being used to run not only new applications, but also existing workloads. We don’t believe that public cloud IaaS is only for apps written specifically for the cloud, and we certainly don’t believe that it’s only for start-ups or leading-edge companies. It’s a nascent market, yes, but companies can use it productively today as long as they’re thoughtful about their use cases and deployment approach. We also don’t believe that cloud IaaS is solely the province of mass-scale providers; multi-tenancy can be cost-effectively delivered on a relatively small scale, as long as most of the workloads are steady-state (which legacy workloads often are).
Service features, sales, and marketing are all impacted by the need to serve two different buying constituencies, IT Operations and developers. Because we believe that developers are the face of business buyers, though, we believe that addressing this audience is just as important as it is addressing the traditional IT Operations audience. We do, however, emphasize a fundamentally corporate audience — this is definitely not an MQ aimed at, say, an individual building an iPhone app, or even non-technology small businesses.
Nowhere are those dichotomies better illustrated than two of the Leaders in this MQ — Amazon Web Services and CSC. Amazon excels at addressing a developer audience and new applications; CSC excels at addressing a mid-market IT Operations audience on the path towards data center transformation and automation of IT operations management, by migrating to cloud IaaS. Both companies address audiences and use cases beyond that expertise, of course, but they have enormously different visions of their fundamental value proposition, that are both valid. (For those of you who are going, “CSC? Really?” — yes, really. And they’ve been quietly growing far faster than any other VMware-based provider, so for all you vendors out there, if they’re not on your competitive radar screen, they should be.)
Of course, this means that no single provider in the Magic Quadrant is a fantastic fit for all needs. Furthermore, the right provider is always dependent upon not just the actual technical needs, but also the business needs and corporate culture, like the way that the company likes to engage with its vendors, its appetite for risk, and its viewpoint on strategic vs. tactical vendors.
Gartner has asked its analysts not to debate published research in public (per our updated Public Web Participation policy), especially Magic Quadrants. Consequently, I’m willing to engage in a certain amount of conversation about this MQ in public, but I’m not going to get into the kinds of public debates that I got into last year.
If you have questions about the MQ or are looking for more detail than is in the text itself, I’m happy to discuss. If you’re a Gartner client, please schedule an inquiry. If you’re a journalist, please arrange a call through Gartner’s press office. Depending on the circumstances, I may also consider a discussion in email.
This was a fascinating Magic Quadrant to research and write, and within the limits of that “no public debates” restriction, I may end up blogging more about it in the future. Also, as this is a fast-moving market, we’re highly likely to target an update for the middle of next year.
Five reasons you should work at Gartner with me
Gartner is hiring again! We’ve got a number of open positions, actually, and somewhat flexible about how we use the headcount; we’re looking for great people and the jobs can adapt to some extent based on what they know. This also means we’re flexible on seniority level — anywhere from about five years of experience to “I have been in the industry forever” is fine. We’re very flexible on background, too; as long as you have a solid grasp of technology, with an understanding of business, we don’t care if you’re currently an engineer, IT manager, product manager, marketing person, journalist, etc.
First and foremost, we’re looking for an analyst to cover the colocation market, and preferably also data center leasing. Someone who knows one or more other adjacent spaces as well would be great — peering, IP transit, hosting, cloud IaaS, content delivery networks, network services, etc.
We could also use an analyst who can cover some of the things that I cover — cloud IaaS, managed hosting, CDNs, and general Internet topics (managed DNS, domain registration, peering, and so on).
These positions will primarily serve North American clients, but we don’t care where you’re located as long as you can accomodate normal US time zones; these positions are work-from-home.
I love my job. You’ve got to have the right set of personality traits to enjoy it, but if the following five things sound awesome to you, you should come work at Gartner:
1. It is an unbeatably interesting job for people who thrive on input. You will spend your days talking to IT people from an incredibly diverse array of businesses around the globe, who all have different stories to tell about their environments and needs. Vendors will tell you about the cool stuff that they’re doing. You will be encouraged to inhale as much information as you can, reading and researching on your own. You will have one-on-one meetings with hundreds of clients each year (our busiest analysts do over 1,500 one-on-one interactions!), and get to meet countless more in informal interactions. Many of the people you talk to will make you smarter, and all of them will make you more knowledgeable.
2. You get to help people in bite-sized chunks. People will tell you their problems and you will try your best to help them in thirty minutes. After those thirty minutes, their problem is no longer yours; they’re the ones who are going to have to go back and fight through their politics and tangled snarl of systems to get things done. It’s hugely satisfying if you enjoy that kind of thing, especially since you do often get long-term feedback about how much you helped them. You’ll help IT buyer clients choose the right strategy, pick the right vendors, and save tons of money by smart contract negotiation. You’ll help vendors with their strategy, design better products, understand the competition, and serve their customers better. You’ll help investors understand markets and companies and trends, which translates directly into helping them make money. Hopefully, you’ll get to influence the market in a way that’s good for everyone.
3. You get to work with great colleagues. Analysts here are smart and self-motivated. There’s no real hierarchy; we work collaboratively and as equals, regardless of our titles, with ad-hoc leadership as needed. Also, analysts are articulate, witty, and opinionated, which always makes for fun interactions. Your colleagues will routinely provide you with new insights, challenge your thinking, and provide amazing amounts of expertise in all kinds of things. There’s almost always someone who is deeply expert in whatever you want to talk about. Analysts are Gartner’s real product; research and events are a result of the people. Our turnover is extremely low.
4. Your work is self-directed. Nobody tells you what to do here beyond some general priorities and goals; there’s very little management. You’re expected to figure out what you need to do with some guidance from your manager and input from your peers, manage your time accordingly, and go do it. You mostly get to figure out how to cover your market, and aim towards what clients are interested in. Your research agenda and coverage are flexible, and you can expand into whatever you can be expert in. You set your own working hours. Most people work from home.
5. We don’t do any pay-for-play. Integrity is a core value at Gartner, so you won’t be selling your soul. About 80% of our revenue comes from IT buyers, not vendors. Unlike most other analyst firms, we don’t do commissioned white papers, or anything else that could be perceived as an endorsement of a vendor; also, unlike some other analyst firms, analysts don’t have any sales responsibility for bringing in vendor sales or consulting engagements, or being quoted in press releases, etc. You will neither need to know nor care which vendors are clients or what they’re paying (any vendor can do briefings, though only clients get inquiry). Analysts must be unbiased, and management fiercely defends your right to write and say anything you want, as long as it’s backed up by solid evidence and is presented professionally, no matter how upset it makes a vendor. (Important downside: We don’t allow side work like participation in expert nets, and we don’t allow you or your immediate family to have any financial interest in the areas you cover, including employment or stock ownership in related companies. If your spouse works in tech, this can be a serious limiter.)
Poke me if you’re interested. I have a keen interest in seeing great people hired into these roles fast, since they’re going to be taking a big chunk of my current workload.
Beware misleading marketing of “private clouds”
Many cloud IaaS providers have been struggling to articulate their differentiation for a while now, and many of them labor under the delusion that “not being Amazon” is differentiating. But it also tends to lead them into misleading marketing, especially when it comes to trying to label their multi-tenant cloud IaaS “private cloud IaaS”, to distinguish it from Those Scary And Dangerous Public Cloud Guys. (And now that we have over four dozen newly-minted vCloud Powered providers in the early market-entrance stage, the noise is only going to get worse, as these providers thrash about trying to differentiate.)
Even providers who are clear in their marketing material that the offering is a public, multi-tenant cloud IaaS, sometimes have salespeople who pitch the offering as private cloud. We also find that customers are sometimes under the illusion that they’ve bought a private cloud, even when they haven’t.
I’ve seen three common variants of provider rationalization for why they are misleadingly labeling a multi-tenant cloud IaaS as “private cloud”:
We use a shared resource pool model. These providers claim that because customers buy by the resource pool allocation (for instance, “100 vCPUs and 200 GB of RAM”) and can carve that capacity up into VMs as they choose, that capacity is therefore “private”, even though the infrastructure is fully multi-tenant. However, there is always still contention for these resources (even if neither the provider nor the customer deliberately oversubscribes capacity), as well as any other shared elements, like storage and networking. It also doesn’t alter any of the risks of multi-tenancy. In short, a shared resource pool, versus a pay-by-the-VM model, is largely just a matter of the billing scheme and management convenience, possibly including the nice feature of allowing the customer to voluntarily self-oversubscribe his purchased resources. It’s certainly not private. (This is probably the situation that customers most commonly confuse as “private”, even after long experience with the service — a non-trivial number of them think the shared resource pool is physically carved off for them.)
Our customers don’t connect to us over the Internet. These providers claim that private networking makes them a private cloud. But in fact, nearly all cloud IaaS providers offer multiple networking options other than plain old Internet, ranging from IPsec VPN over the Internet to a variety of private connectivity options from the carrier of your choice (MPLS, Ethernet, etc.). This has been true for years, now, as I noted when I wrote about Amazon’s introduction of VPC back in 2009. Even Amazon essentially offers private connectivity these days, since you can use Amazon Direct Connect to get a cross-connect at select Equinix data centers, and from there, buy any connectivity that you wish.
We don’t allow everyone to use our cloud, so we’re not really “public”. These providers claim to have a “private cloud” because they vet their customers and only allow “real businesses”, however they define that. (The ones who exclude net-native companies as not being “real businesses” make me cringe.) They claim that a “public cloud” would allow anyone to sign up, and it would be an uncontrolled environment. This is hogwash. It can also lead to a false sense of complacency, as I’ve written before — the assumption that their customers are good guys means that they might not adequately defend against customer compromises or customer employees who go rogue.
The NIST definition of private cloud is clear: “Private cloud. The cloud infrastructure is provisioned for exclusive use by a single organization comprising multiple consumers (e.g., business units). It may be owned, managed, and operated by the organization, a third party, or some combination of them, and it may exist on or off premises.” In other words, NIST defines private cloud as single-tenant.
Given the widespread use of NIST cloud definitions, and the reasonable expectation that customers have that a provider’s terminology for its offering will conform to those definitions, calling a multi-tenant offering “private cloud” is misleading at best. And at some point in time, the provider is going to have to fess up to the customer.
I do fully acknowledge that by claiming private cloud, a provider will get customers into the buying cycle that they wouldn’t have gotten if they admitted multi-tenancy. Bait-and-switch is unpleasant, though, and given that trust is a key component of provider relationships as businesses move into the cloud, customers should use providers that are clear and up-front about their architecture, so that they can make an accurate risk assessment.
Cloud IaaS is not magical, and the Amazon reboot-a-thon
Randy Bias has blogged about Amazon mandating instance reboots for hundreds, perhaps thousands, of instances (Amazon’s term for VMs). Affected instances seem to be scheduled for reboots over the next couple of weeks. Speculation is that the reboots are to patch a recently-reported vulnerability in the Xen hypervisor, which is the virtualization technology that underlies Amazon’s EC2. The GigaOm story gives some links, and the CRN story discusses customer pain.
Maintenance reboots are not new on Amazon, and are detailed on Amazon’s documentation about scheduled maintenance. The required reboots this time are instance reboots, which are easily accomplished — just point-and-click to reboot on your own schedule rather than Amazon’s (although you cannot delay past the scheduled reboot). Importantly, instance reboots do not result in a change of IP address nor do they erase the data in instance storage (which is normally non-persistent).
For some customers, of course, a reboot represents a headache, and it results in several minutes of downtime for that instance. Also, since this is peak retail season, it is already a sensitive, heavy-traffic time for many businesses, so the timing of this widespread maintenance is problematic for many customers.
However, cloud IaaS isn’t magical. If these customers were using dedicated hosting, they would still be subject to mandated reboots for security patches — hosting providers generally offer some flexibility on scheduling such reboots, but not aa lot (and sometimes none at all if there’s an exploit in the wild). If these customers were using a provider that uses live migration technology (like VMotion on a VMware-virtualized cloud), they might be spared reboots for system reasons, but they might still be subject to reboots for mandated operating system patches.
Given that what’s underlying EC2 are ordinary physical servers running virtualization without a live migration technology in use, customers should reasonably expect that they will be subject to reboots — server-level (what Amazon calls a system reboot), as well as instance-level — and also anticipate that they may sometimes need to reboot for their own guest OS patches and the like (assuming that they don’t simply patch their AMIs and re-launch their instances, arguably a more “cloudy” way to approach this problem).
What makes this rolling scheduled maintenance remarkable is its sheer scale. Hosting providers typically have a few hundred customers and a few thousand servers. Mass-market VPS hosters have lots of VPS containers, but there’s a roughly 1:1 VPS:customer ratio and a small-business-centricity that doesn’t lead to this kind of hullabaloo. Amazon’s largest competitor is estimated to be around the 100,000 VM mark. Only the largest cloud IaaS providers have more than 2,000 VMs. Consequently, this involves a virtually unprecedented number of customers and mission-critical systems.
Amazon has actually been very good about not taking down its cloud customers for extended maintenance windows. (I can think of one major Amazon competitor that took down one whole data center for an eight-hour maintenence evidently involving a total outage this past weekend, and which regularly has long-downtime maintenance windows in general.) A reboot is an inconvenience, but if you are running production infrastructure, you should darn well think about how to handle the occasional reboot, including reboots that affect a significant percentage of your infrastructure, because reboots are not likely to go away in IaaS anytime soon.
To hammer on the point again: Cloud IaaS is not magical. It still requires management, and it still has some of the foibles of both physical servers and non-cloud virtualization. Being able to push a button and get infrastructure is nice, but the responsibility to manage that infrastructure doesn’t go away — it’s just that many cloud customers manage to delay the day of reckoning when the attention they haven’t paid to management comes back to bite them.
If you run infrastructure, regardless of whether it’s in your own data center, in hosting, or in cloud IaaS, you should have a plan for “what happens if I need to mass-reboot my servers?” because it is something that will happen. And add “what if I have to do that immediately?” to the list, because that is also something that will happen, because mass exploits and worms certainly have not gone away.
(Gartner clients only: Check out a note by my security colleagues, “Address Concentration Risk in Public Cloud Deployments and Shared-Service Delivery Models to Avoid Unacceptable Losses“.)
Cloud IaaS feature sets and target buyers
As I noted previously, cloud IaaS is a lot more than just self-service VMs. As service providers strive to differentiate themselves from one another, they enter a software-development rat race centered around “what other features can we add to make our cloud more useful to customers”.
However, cloud IaaS providers today have to deal with two different constituencies — developers (developers are the face of business buyers) and IT Operations. These two groups have different priorities and needs, and sometimes even different use cases for the cloud.
IaaS providers may be inclined to cater heavily towards one group or the other, and selectively add features that are critical to the other group, in order to ease buying frictions. Others may decide to try to appeal to both — a strategy likely to be available only to those with a lot of engineering resources at their disposal. Over time (years), there will be convergence in the market, as all providers reach a certain degree of feature parity on the critical bits, and then differentiation will be on smaller bits of creeping featurism.
Take a feature like role-based access control (RBAC). For the needs of a typical business buyer — where the developers are running the show on a project basis — RBAC is mostly a matter of roles on the development team, likely in a fairly minimalistic way, but fine-grained security may be desired on API keys so that any script’s access to the API is strictly limited to just what that script needs to do. For IT Operations, though, RBAC needs tend to get blown out into full-fledged lab management — having to manage a large population of users (many of them individual developers) who need access to their own pools of infrastructure and who want to be segregated from one another.
Some providers like to think of the business buyer vs. IT Operations buyer split as a “new applications” vs. “legacy applications” split instead. I think there’s an element of truth to that, but it’s often articulated as “commodity components that you can self-assemble if you’re smart enough to know how to architect for the cloud” vs. “expensive enterprise-class gear providing a safe familiar environment”. This latter distinction will become less and less relevant as an increasing number of providers offer multi-tiered infrastructure at different price points within the same cloud. Similarly, the “new vs. legacy apps” distinction will fade with feature-set convergence — a broad-appeal cloud IaaS offering should be able to support either type of workload.
But the buying constituencies themselves will remain split. The business and IT will continue to have different priorities, despite the best efforts of IT to try to align itself closer to what the business needs.
Performance can be a disruptive competitive advantage
All of us are used to going to travel sites, especially for airline tickets, and waiting a while for the appropriate results to be extracted and displayed to us. I recently saw Google Flight Search for the first time and was astonished by its raw speed — essentially completely instant.
I frequently talk to customers about acceleration solutions, and discuss the business value of performance. Specifically, this is a look at business metrics that measure the success of a website or application — time spent on your site, conversion rate, shopping basket value, page views, ad views, transactions processed, employee productivity, decline in call center volume, and so forth. You compare the money associated with these metrics, against the cost of the solutions, to look at comparative ROI.
The business value of performance is usually tied to industry in a narrow and specific way, because users have a particular set of expectations and needs. For instance, for travel sites, a certain amount of performance is necessary in order to make the site usable, but the long waits for searches are things that users are conditioned to, making their overall performance expectations relatively low. Travel sites usually discover that generalized site responsiveness improve the user experience and cause revenue per site visit to increase — but only up to a certain point, at which point in time it plateaus, as the site has enough responsiveness that users aren’t discouraged from using it, and they’re going to buy what they came to buy.
Google Flight Search proves that you can “break through” the performance ceiling to actually entirely change the user experience, though. This is not the kind of incremental improvement you can achieve through acceleration techniques, though; instead, it’s a core change that affects the thing that is slowest, which is generally the back-end database and business logic, not the network. This can actually be a disruptive competitive advantage.
I typically ask my CDN clients, “What are the factors that make your site slow?” In many cases, they need to do something that goes beyond what edge caching or even network optimization (dynamic acceleration) can achieve. They need to reduce their page weight, or write better pages (and may benefit from front-end optimization techniques), or to improve the back-end responsiveness. Acceleration techniques are often used to band-aid a core problem with performance, just like CDN professional services to make a site cacheable are often used to band-aid a core problem with site structure. At some point in time it becomes more cost-effective to fix the core problem.
Too few businesses design their websites and applications with speed in mind.
Cotendo’s potential acquisition
Thus far, merger-watchers eyeing the rumored bidding for Cotendo seem to be asking: Why this high a valuation compared to the rest of the CDN industry? Who are the potential suitors and why? What if anything does Cotendo offer that other CDNs don’t? How do the various dynamic offerings in the market compare? Who else might be ripe for acquisition? What is the general trend of M&A activity in the CDN industry going forward? Do I agree with Dan Rayburn’s commentary on this deal?
However, for various reasons, I am not currently publicly commenting further on Twitter or my blog, or really in general with non-Gartner-clients, regarding the potential acquisition of Cotendo by Akamai (or AT&T, or Juniper, or anyone else who might be interested in buying them).
If you are a Gartner client, and you want to discuss the topic, you may request a written response or a phone call through the usual mechanisms for inquiry.
Private clouds aren’t necessarily more secure
Eric Domage, an analyst over at IDC, is being quoted as saying, “The decision in the next year or two will only be about the private cloud. The bigger the company, the more they will consider the private cloud. The enterprise cloud is locked down and totally managed. It is the closest replication of virtualisation.” The same article goes on to quote Domage as cautioning against the dangers of the public cloud, and, quoting the article: “He urged delegates at the conference to ‘please consider more private cloud than public cloud.'”
I disagree entirely, and I think Domage’s comments ignore the reality of what is going on within enterprises, both in terms of their internal private cloud initiatives, as well as their adoption of public cloud IaaS. (I assume Domage’s commentary is intended to be specific to infrastructure, or it would be purely nonsensical.)
While not all IaaS providers build to the same security standards, nearly all build a high degree of security into their offering. Furthermore, end-to-end encryption, which Domage claims is unavailable in public cloud IaaS, is available in multiple offerings today, presuming that it refers to both end-to-end network encryption, along with encryption of both storage in motion and storage at rest. (Obviously, computation has to occur either on unencrypted data, or your app has to treat encrypted data like a passthrough.)
And for the truly paranoid, you can adopt something like Harris Trusted Cloud — private or public cloud IaaS built with security and compliance as the first priority, where each and every component is checked for validity. (Wyatt Starnes, the guiding brain behind this, founded Tripwire, so you can guess where the thinking comes from.) Find me an enterprise that takes security to that level today.
I’ve found that the bigger the company, the more likely they are to have already adopted public cloud IaaS. Yes, it’s tactical, but their businesses are moving faster than they can deploy private clouds, and the workloads they have in the public cloud are growing every day. Yes, they’ll also build a private cloud (or in many cases, already have), but they’ll use both.
The idea that the enterprise cloud is “locked down and totally managed” is a fantasy. Not only do many enterprises struggle with managing the security within private clouds, many of them have practically surrendered control to the restless natives (developers) who are deploying VMs within that environment. They’re struggling with basic governance, and often haven’t extended their enterprise IT operations management systems successfully into that private cloud. (Assuming, as the article seems to imply, that private cloud is being used to refer to self-service IaaS, not merely virtualized infrastructure.)
The head-in-the-sand “la la la public cloud is too insecure to adopt, only I can build something good enough” position will only make an enterprise IT manager sound clueless and out of touch both with reality and with the needs of the business.
Organizations certainly have to do their due diligence — hopefully before, and not after, the business is asking what cloud infrastructure solutions can be used right this instant. But the prudent thing to do is to build expertise with public cloud (or hosted private cloud), and for organizations which intend to continue running data centers long-term, simultaneously building out a private cloud.
Would you like to run Apache Wave, Grandma?
As many people already know, Google is sunsetting Google Wave. This has led to Google sending an email to people who previously signed up for Wave. The bit in the email that caught my eye was this:
If you would like to continue using Wave, there are a number of open source projects, including Apache Wave. There is also an open source project called Walkaround that includes an experimental feature that lets you import all your Waves from Google.
For an email sent to Joe Random Consumer, it’s remarkably clueless as to what consumers actually can comprehend. Grandma is highly unlikely to understand what the heck that means.
Tip for everyone offering a product or service to consumers: Any communication with consumers about that product or service should be in language that Grandma can understand.
Why developers make superior operators
Developers who deeply understand the arcana of infrastructure, and operators who can code and understand the interaction of applications and infrastructure, are better than developers and operators who understand only their own discipline. But it’s typically easier, from the perspective of training, for a developer to learn operations, than for an operator to learn development.
While there are fair number of people who teach themselves on-the-job, most developers still come out of formal computer science backgrounds. The effectiveness of formal education in CS varies immensely, and you can get a good understanding by reading on your own, of course, if you read the right things — it’s the knowledge that matters, not how you got it. But ideally, a developer should accumulate the background necessary to understand the theory of operating systems, and then have a deeper knowledge of the particular operating system that they primarily work with, as well as the arcana of the middleware. It’s intensely useful to know how the abstract code you write, actually turns out to run in practice. Even if you’re writing in a very high-level programming language, knowing what’s going on under the hood will help you write better code.
Many people who come to operations from the technician end of things never pick up this kind of knowledge; a lot of people who enter either systems administration or network operations do so without the benefit of a rigorous education in computer science, whether from college or self-administered. They can do very well in operations, but it’s generally not until you reach the senior-level architects that you commonly find people who deeply understand the interaction of applications, systems, and networks.
Unfortunately, historically, we have seen this division in terms of relative salaries and career paths for developers vs. operators. Operators are often treated like technicians; they’re often smart learn-on-the-job people without college degrees, but consequently, companies pay accordingly and may limit advancement paths accordingly, especially if the company has fairly strict requirements that managers have degrees. Good developers often emerge from college with minimum competitive salary requirements well above what entry-level operations people make.
Silicon Valley has a good collection of people with both development and operations skills because so many start-ups are founded by developers, who chug along, learning operations as they go, because initially they can’t afford to hire dedicated operations people; moreover, for more than a decade, hypergrowth Internet start-ups have deliberately run devops organizations, making the skillset both pervasive and well-paid. This is decidedly not the case in most corporate IT, where development and operations tend to have a hard wall between them, and people tend to be hired for heavyweight app development skills, more so than capabilities in systems programming and agile-friendly languages.
Here are my reasons for why developers make better operators, or perhaps more accurately, an argument for why a blended skillset is best. (And here I stress that this is personal opinion, and not a Gartner research position; for official research, check out the work of my esteemed colleagues Cameron Haight and Sean Kenefick. However, as someone who was formally educated as a developer but chose to go into operations, and who has personally run large devops organizations, this is a strongly-held set of opinions for me. I think that to be a truly great architect-level ops person, you also have to have a developer’s skillset, and I believe it’s important to mid-level people as well, which I recognize as a controversial opinions.)
Understanding the interaction of applications and infrastructure leads to better design of both. This is an architect’s role, and good devops understand how to look at applications and advise developers how they can make them more operations-friendly, and know how to match applications and infrastructure to one another. Availability, performance, and security are all vital to understand. (Even in the cloud, sharp folks have to ask questions about what the underlying infrastructure is. It’s not truly abstract; your performance will be impacted if you have a serious mismatch between the underlying infrastructure implementation and your application code.)
Understanding app/infrastructure interactions leads to more effective troubleshooting. An operator who can CTrace, DTrace, sniff networks, read application code, and know how that application code translates to stuff happening on infrastructure, is in a much better position to understand what’s going wrong and how to fix it.
Being able to easily write code means less wasted time doing things manually. If you can code nearly as quickly as you can do something by hand, you will simply write it as a script and never have to think about doing it by hand again — and neither will anyone else, if you have a good method for script-sharing. It also means that forever more, this thing will be done in a consistent way. It is the only way to truly operate at scale.
Scripting everything, even one-time tasks, leads to more reliable operations. When working in complex production environments (and arguably, in any environment), it is useful to write out every single thing you are going to do, and your action plan for any stage you deem dangerous. It might not be a formal “script”, but a command-by-command plan can be reviewed by other people, and it means that you are not making spot decisions under the time pressure of a maintenance window. Even non-developers can do this, of course, but most don’t.
Converging testing and monitoring leads to better operations. This is a place where development and operations truly cross. Deep monitoring converges into full test coverage, and given the push towards test-driven development in agile methodologies, it makes sense to make production monitoring part of the whole testing lifecycle.
Development disciplines also apply to operations. The systems development lifecycle is applicable to operations projects, and brings discipline to what can otherwise be unstructured work; agile methodologies can be adapted to operations. Writing the tests first, keeping things in a revision control system, and considering systems holistically rather than as a collection of accumulated button-presses are all valuable.
The move to cloud computing is a move towards software-defined everything. Software-defined infrastructure and programmatic access to everything inherently advantages developers, and it turns the hardware-wrangling skills into things for low-level technicians and vendor field engineering organizations. Operations becomes software-oriented operations, one way or another, and development skills are necessary to make this transition.
It is unfortunately easier to teach operations to developers, than it is to teach operators to code. This is especially true when you want people to write good and maintainable code — not the kind of script in which people call out to shell commands for the utilities that they need rather than using the appropriate system libraries, or splattering out the kind of program structure that makes re-use nigh-impossible, or writing goop that nobody else can read. This is not just about the crude programming skills necessary to bang out scripts; this is about truly understanding the deep voodoo of the interactions between applications, systems, and networks, and being able to neatly encapsulate those things in code when need be.
Devops is a great place for impatient developers who want to see their code turn into results right now; code for operations often comes in a shorter form, producing tangible results in a faster timeframe than the longer lifecycles of app development (even in agile environments). As an industry, we don’t do enough to help people learn the zen of it, and to provide career paths for it. It’s an operations specialty unto itself.
Devops is not just a world in which developers carry pagers; in fact, it doesn’t necessarily mean that application developers carry pagers at all. It’s not even just about a closer collaboration between development and operations. Instead, it can mean that other than your most junior button-pushers and your most intense hardware specialists, your operations people understand both applications and infrastructure, and that they write code as necessary to highly automate the production environment. (This is more the philosophy of Google’s Site Reliability Engineering, than it is Amazon-style devops, in other words.)
But for traditional corporate IT, it means hiring a different sort of person, and paying differently, and altering the career path.
A little while back, I had lunch with a client from a mid-market business, which they spent telling me about how efficient their IT had become, especially after virtualization — trying to persuade me that they didn’t need the cloud, now or ever. Curious, I asked how long it typically took to get a virtualized server up and running. The answer turned out to be three days — because while they could push a button and get a VM, all storage and networking still had to be manually provisioned. That led me to probe about a lot of other operations aspects, all of which were done by hand. The client eventually protested, “If I were to do the things you’re talking about, I’d have to hire programmers into operations!” I agreed that this was precisely what was needed, and the client protested that they couldn’t do that, because programmers are expensive, and besides, what would they do with their existing do-everything-by-hand staff? (I’ve heard similar sentiments many times over from clients, but this one really sticks in my mind because of how shocked this particular client was by the notion.)
Yes. Developers are expensive, and for many organizations, it may seem alien to use them in an operations capacity. But there’s a cost to a lack of agility and to unnecessarily performing tasks manually.
But lessons learned in the hot seat of hypergrowth Silicon Valley start-ups take forever to trickle into traditional corporate IT. (Even in Silicon Valley, there’s often a gulf between the way product operations works, and the way traditional IT within that same company works.)