dev2ops: delivering application change

Tuesday, September 29, 2009

dev2ops blog has a new home!

We've moved the dev2ops blog to a new service!

http://dev2ops.org is now the main address for the blog. If you've been accessing the blog via that address, the change should be seamless for you.

(The dev2ops RSS feed is still being provided through FeedBurner and the feed address will not change)

Monday, September 28, 2009

Q&A: Lee Thompson, former Chief Technologist of E*TRADE Financial

I recently caught up with Lee Thompson to discuss a variety of Dev to Ops topics including deployment, testing, and the conflict between dev and ops.

Lee recently left E*TRADE Financial where he was VP & Chief Technologist. Lee's 13 years at E*TRADE saw two major boom and bust cycles and dramatic expansion in E*TRADE's business.

Damon:
You've had large scale ops and dev roles... what lessons have you've learned the required the perspective of both?

Lee:
I was heavy into ops during the equity bubble of 1998 to 2000 and during that time we scaled E*TRADE from 40,000 trades a day to 350,000 trades a day. After going through that experience, all my software designs changed. Operability and deployability became large concerns for any kind of infrastructure I was designing. Your development and your architecture staff have to hand the application off to your production administrators so the architects can get some sleep. You don't want your developers involved in running the site. You want them building the next business function for the company. The only way that is going to happen is to have the non-functional requirements --deployability, scalability, operability-- already built into your design. So that experience taught me the importance of non-functional requirements in the design process.

Damon:
You use the phrase "wall of confusion" a lot... can you explain the nature of the dev and ops conflict?

Lee:
When dealing with a large distributed compute infrastructure you are going to have applications that are difficult to run. The operations and systems engineering staff who is trying to keep the business functions running is going to say "oh these developers don't get it". And then back over on the developer side they are going to say "oh these ops guys don't get it". It's just a totally different mindset. The developers are all about changing things very quickly and the ops team is all about stability and reliability. One company, but two very different mindsets. Both want the company to succeed, but they just see different sides of the same coin. Change and stability are both essential to a company's success.

Damon:
How can we break down the wall of confusion and resolve the dev and ops conflicts?

Lee:
The first step is being clear about non-functional requirements and avoid what I call "peek-a-boo" requirements.

Here's a common "peek-a-boo" scenario:
Development (glowing with pride in the business functions they've produced): "Here's the application"
Operations: "Hey, this doesn't work"
Development: "What do you mean? It works just fine"
Operations: "Well it doesn't run with our deployment stack"
Development: "What deployment stack?"
Operations: "The stuff we use to push all of the production infrastructure"

The non-functional requirements become late cycle "peek-a-boo" requirements when they aren't addressed early in development. Late cycle requirements violates continuous integration and agile development principles. The production tooling and requirements have to be accounted for in the development environment but most enterprises don't do that. Since the deployment requirements aren't covered in dev, what ends up happening is that the operations staff receiving the application has to do innovation through necessity and they end up writing a number of tools that over time become bigger and bigger and more of a problem which Alex described last year in the Stone Axe post. Deployability is a business requirement and it needs to be accounted for in the development environment just like any other business requirement.

Damon:
Deployment seems to be topics of discussion that are increasing in popularity... why is that?

Lee:
Deployability and the other non-functional requirements have always been there, they were just often overlooked. You just made do. But a few things have happened.

1. Complexity and commodity computing, both in the hardware and software, has meant that your deployment is getting to the point where automation is mandatory. When I started at E*TRADE there were 3 Solaris boxes. When I left the number of servers was orders and orders of magnitude larger (the actual number is proprietary). Since the operations staff can't afford to log into every box, they end up writing tools because the developers didn't give them any.

2. Composite applications, where applications are composed of other applications, mean that every application has to be deployed and scaled independently. Versioning further complicates matters. Before the Internet, in the PC and mainframe industries, you were used to delivering and maintaining multiple versions of software simultaneously. In the early days of the Internet, a lot of that went away and you only had two versions -- the one deployed and the one about to be deployed. Now with the componentization of services and the mixing and matching of components you'll find that typically you have several versions of a piece of infrastructure facing different business functions. So you might be running three or four independently deployed and managed versions of the same application component within your firewall.

3. Cloud computing takes both complexity and the need for deployability up another notch. Now you are looking to remote a portion of your infrastructure into the Internet. Almost everyone who is starting up a company right now is not talking about building a datacenter, they are all talking about pushing to the cloud. So deployability is very much at the forefront of their thinking about how to deliver their business functions. And the cloud story is only beginning. For example, what happens when you get a new generation of requirements like the ability to automate failover between cloud vendors?

Damon:
Testing is one of those things that everyone knows is good, but seems to rarely get adequately funded or properly executed. Why is that?

Lee:
Well, like many things it's often simply a poor understanding of what goes into doing it right and an oversimplification of what the business value really is.

Just like deployment infrastructure, proper testing infrastructure is a distributed application in of itself. You have to coordinate the provisioning of a mocked up environment that mimics your production conditions and then boot up a distributed application that actually runs the tests. The level of thought and effort that has to go into properly doing that can't be overlooked. Well, not if you are serious about delivering on quality of service.

While integration testing should be a very important piece of your infrastructure, the importance of antagonistic testing also can't be overlooked. For example, the CEO is going to want to know what happens when you double the load on your business. The only way to really know that is to have a good facsimile of your business application mocked-up and those exact scenarios tested. That is a large scale application in of itself and takes investment.

Beyond service quality, there is business value in proper testing infrastructure that is often overlooked. When you start to build up a large quantity of high fidelity tests those tests actually represent knowledge about your business. Knowledge is always an asset to your business. It's pretty clear that businesses who know a lot about themselves tend to do well and those who lack that knowledge tend not to be very durable.

Damon:
The culture of any large company is going to restrictive. Large financial institutions are, by their very nature, going to be more restrictive. Coming out of that culture, what are you most excited to do?

Lee:
Punditry, blogging and using social media to start! You really can't do that from behind the firewall in the FI world. There are legitimate reasons for the restrictions, and I understand that. Because you have to contend with a lot of regulatory concerns, you just aren't going to see a lot of financial institution technologist ranting online about what is going on behind the firewall. So I'm excited about becoming a producer of social media content rather than just a consumer.

I also find consulting exciting. It's been fun getting out there and seeing a variety of companies and how similar their problems are to each other and to what I worked on at E*TRADE. It reminds me how advanced E*TRADE really is and what we had to contend with. I enjoy applying the lessons I've learned over my career to helping other companies avoid pitfalls and helping them position their IT organization for success.

Friday, September 25, 2009

"Stability anti-patterns" highlight importance of tracking non-functional requirements

Michael Nygard, author of the influential Release It!, delivered a fantastic speech at QCon London where he dives into the various flavors of "stability anti-patterns" that can (and probably will) plague your web operations.

The explicit lessons alone are worth watching his presentation. However, there is also a more subtle lesson delivered over the course of the speech. Whether it was intentional or not, Michael illustrates the importance (and difficulty) of tracking non-functional requirements across the application lifecycle.

Some of the knowledge of where your next failure could come from lives in Development. Some of it lives in Operations. Sometimes it even lives in the well-intentioned plans of your marketing department.

What are you doing about sharing and tracking that knowledge? If you are like most organizations, you are probably relying on tribal knowledge to avoid these pitfalls. Development handles what they know about, Operations handles what they know about, and everyone crosses their fingers and hopes it all works out. Unlike the well understood processes and tooling used to track business requirements, non-functional requirements all too often fly under the radar or are afterthoughts to be handled during production deployment.

Monday, August 31, 2009

Are sys admins soon to be relics?

One of the ideas that can be extrapolated from the positions of the "infrastructure as code" crowd, is that the future of systems administration will look dramatically different than it does today.*

The extreme view of the future is that you'll have a set of domain experts (application architects/developers, database architects, storage architects, performance management, platform security, etc.) who produce the infrastructure code and everything else happens automatically. The image of today's workhorse, pager wearing, fire extinguishing sys admin doesn't seem to have a role in that world.

Of course, the reality will be somewhere in the pragmatic middle. But a glimpse of that possible future should make sys admins question which direction they are taking their job skills.

I finally got around to digging into the conference wrap up report that O'Reilly publishes after its annual web operations conference, Velocity. Most of it was the standard self-serving kudos. However, the table below really caught my eye and inspired me to write this post.

Attendee Job Titles (multiple answers accepted)
Software Developer 60%
IT Management/Sys Admin 27%
Computer Programmer 20%
CXO/Business Strategist 19%
Web/UI Design 17%
Business Manager 16%
Product Manager 10%
Consultant 9%
Entrepreneur 8%
Business Development 4%
Community Activist 3%
Marketing Professional 2%
Other 5%

Now of course you have to look at this data with a cautious eye. People were asked to self-describe, you could select multiple titles, some people where attending to learn about design tricks for faster page load times, and most people blow through marketing surveys without much care. However, it did catch my eye that somewhere between 60 - 80% described themselves as having a development role. Only 27% described themselves as having a sys admin role.

Now is it a big leap to point to this data as an early warning signal of the demise of the traditional sys admin role? Probably... but it fully jibes with the anecdotal evidence we saw around the conference halls. From large .com employees (Facebook, Twitter, Google, Flickr, Shopzilla, etc..) to the open source tools developers, the thought (and action) leaders were developers who happened to focus on systems administration, not systems administrators who happened to have development skills.

* Disclosure: I'm a member of the infrastructure as code crowd

Friday, July 3, 2009

Tools are easy. Changing your operating culture is hard.

Did you ever notice that our first inclination is to reach for a tool when we want to change something? What we always seem to forget is that web operations, as a discipline, is only partially about technology.

The success of your web operations depends more on the processes and culture your people work within than it does on any specific tooling choices. Yet, as soon as you start talking about changing/improving operations the conversation quickly devolves into arguments about the merits of various tools.

We see this repeatedly in our consulting business. Time after time we are called in to do a specific automation project and wind up spending the bulk of the effort as counselors and coaches helping the organization make the cultural shift that was the real intention of the automation project.

This article from InfoQ on how difficult it is to get development organizations to adopt an agile culture is a superb encapsulation of the difficulty of cultural change. Switch the word "development" to "web operations" and switch "Agile" to any cultural change you want to make and the article still holds up.

This condition really shouldn't be a surprise to any of us. After all, how much time do we really spend, as an industry, discussing cultural, process, and organizational issues? Compare the number of books and articles written about the people side of web operations vs. the number of books and articles written about the technical side of web operations. The ratio is abysmal, especially when you compare it to other types of business operations (manufacturing, finance, service industries, etc.)

UPDATE: The Agile Executive brings up the point that tools are valuable for enforcing new behavior. I definitely agree with that... but still maintain that new tools without a conscious decision to change behavior and culture is, more often than not, an approach that will fail.

Friday, June 26, 2009

Automated Infrastructure enables Agile Operations

"Agile" been applied to such unanticipated domains as enterprises, start ups, investing, etc. Agile encompasses several generic common sense principles (eg: simple design, sustainable pace, many incremental changes, action over bureaucracy, etc.) so the desire to bestow its virtues on all kinds of endeavors is understandable.

But why contemplate the idea of Agile Operations? Why would Agile Operations even make sense?

Let's start by playing devils advocate. Some of the Agile principles appear to contradict well established and accepted systems administration goals, namely stability and availability. Traditional culture in operations leans towards risk-aversion and stasis in an attempt to assure maximum service levels. Many operations groups play a centralized role serving multiple business lines and have evolved to follow a top-down directed, command and control style management structure that wants to limit access to change. From their point of view, change is the enemy of stability and availability. With stability and availability being the primary goals of operations, it's easy to see where the skepticism towards Agile Operations comes from.

The calls for Agile Operations has initially been driven by product development groups that employ Agile practices . These groups churn out frequent, small improvements to their software systems on a daily basis. The difference in change management philosophy has been the cause of a growing clash between development and operations. The clash intensifies when the business wants to drive these rapid product development iterations all the way through to production (even 10+ times a day).

So, if operations is to avoid being a bottleneck to this Agile empowered flow of product changes, how can they do it in a way that won't create unmanageable chaos?

To apply Agile to the world of operations, one must first see all infrastructure as programmable. Rather than see infrastructure as islands of equipment that were setup by reading a manual and typing commands into a terminal, one sees infrastructure as a set of components that are bootstrapped and maintained through programs. In other words, infrastructure is managed by executing code not by directly applying changes manually at the keyboard.

Replacing manual tasks with executable code is the crucial enabler to sharing a common set of change management principles between development and operations. This alignment is truly the key first step in allying development and operations to support the business' time to market needs. This shared change management model also facilitates a few additional beneficial practices.

Shared code bases: Store and control application and infrastructure code in the same place so both dev and ops staff have clear visibility into everything needed to create a running service.
Collaborative configuration management: Application and infrastructure configuration management code can be jointly developed early in the development cycle and tested in development integration environments. Code and configuration become the currency between dev and ops.
Skill transfer: App and ops engineers can transfer knowledge about the inner workings of the runtime application system and develop skills around tooling to maintain them.
Reproducibility: Reproducing a running application from source and a build specification is vital to managing a business at scale. (http://www.itpi.org/home/visibleops.php)

While some may argue that "Agile" in its entirety does not completely apply to the world of operations, an automated infrastructure based on principles like code sharing as a form of collaboration between dev and ops is a sound basis to enable business agility.

Tuesday, June 23, 2009

10+ Deploys Per Day: Dev and Ops Cooperation at Flickr

The Flickr guys, John Allspaw and Paul Hammond gave an entertaining and validating presentation at OReilly Velocity (slides).

The talk began with a brief description about how Flickr's technology group enabled the business to deliver features and update their online service frequently (10+ deploys per day) but it really turned out to be a success story about how Dev and Ops can align and work together without falling into age-old traditional cross organizational conflicts.

Here's a few (paraphrased) quotes:

Ops' job is to enable the business. (not to keep the site stable and fast)

The business requires change... so lower the risk of change through tools and culture. Make more frequent, but smaller changes ... through push button (and hands off) deployment tools.

Plan fire drills to make sure everyone (junior guys included) knows how to solve production problems because failure will happen.

Ops who think like devs. Devs who think like ops

The talk really boiled down to two ingredients to enable the close dev and operations collaboration (tools + culture):

Tools

1. Automated infrastructure

2. Shared version control

3. One step build and deploy

4. Feature flags

5. Shared metrics

6. IRC and IM robots

Culture

1. Respect

2. Trust

3. Healthy attitude about failure

4. Avoiding blame

I think for some, the real validation was hearing that it's just as much making a cultural shift as it is a mixture of choosing and using the right kind of tools. Anybody who has worked in the trenches will realize that of course.

Monday, June 15, 2009

Continuous Deployment Really Means Continuous Testing

On Twitter and on web operations focused blogs, the concept of Continuous Deployment is a topic that is gaining momentum. Across our consulting clients, we've also seen a significant uptick in discussion around the concept of Continuous Deployment (some calling it "Agile Deployment").

The extreme example of Continuous Deployment that has sparked the most polarizing discussions is from Timothy Fitz's posts on doing production deployment's up to fifty times per day.

While it's a fascinating read, many people for whom the essay is their first exposure to the idea of Continuous Deployment overlook the real value. The value is not how Fitz gets code all the way into production on a sub-daily basis. The value is in achieving a state of continuous automated testing.

If you understand the concept of "the earlier you find a bug, the cheaper it is", the idea of continuous testing is as good as it gets. Every time a build executes, your full suite of unit, regression, user/functional, and performance tests are automatically run. In a mature operation this could quite literally mean millions of automated tests being executed every day. As your application development makes even the smallest moves forward, the application is being rigorously testing inside and out.

Another common misconception is that Continuous Deployment means that human-powered QA cycles are a thing of the past or are somehow less important. This belief is probably a byproduct of those extreme practitioners of Continuous Deployment who are doing hot deployments to production after every build. In most business scenarios there is not much benefit to continuous production deployment. The value of a human-powered QA team sensing if the look, feel, and functionality match the requirements can't, and shouldn't, be overlooked.

Most of our consulting clients just aren't interested in sub-daily deployments to live production environments. What they want out of Continuous Deployment is to have a constant state of broad automated testing and an always up-to-date QA environment for human-powered testing and business review.

In addition to deploying a broad suite of automated testing tools, Fully Automated Provisioning provides the linchpin that makes Continuous Deployment a reality.

Tuesday, May 12, 2009

Clouds, Virtualization, and Continuous Deployment all share the same achilles heal

Recently, there are 3 "hot trends" that we regularly get asked about in our day jobs as web operations consultants:

Cloud Computing (meaning elastic computing resources paid for on-demand)
Virtualization in Production (meaning using virtual machines for non-development or QA uses)
Continuous Deployment (meaning the ability to automatically deploy and test a full environment automatically after each and every build that comes out of their Continuous Integration driven build process)

There is a common thread that ties all of these together -- Fully Automated Provisioning**. You can't achieve the full benefit of any of these advances without Fully Automated Provisioning.

In this previous post, we covered the reasons why efforts to harness the power of cloud or virtualized infrastructure will fail without Fully Automated Provisioning.

Continuous Deployment suffers from a similar weakness. If you don't have Fully Automated Provisioning in place, the efforts required to provision your applications and sort out the resulting problems will outweigh whatever benefits you set out to gain.

IT automation may not be the sexiest field. However, IT automation (and specifically Fully Automated Provisioning) is the necessary foundation that lets you continually reap the benefits of the latest headline grabbing initiatives.

**To read about what the criteria is for achieving Fully Automated Provisioning, check out this blog post and whitepaper.

Tuesday, April 7, 2009

The CIO / Ops Perception Gap

Every IT manager should read this article: Making Business Service Management a Reality . I think the original title, "BSM Evolution - The CIO / Ops Perception Gap", more accurately reflects the essence of the issues it draws out.
* CIOs prematurely believing they have a handle around running their software services
* VPs of Ops afraid to admit that they've just begun a long journey that assumes continuous improvement approaches and no one time fixes
* No clear visibility from the biz level into the level of quality of service operations delivered by the CIO on down through the tech management ranks
* The need to focus on fundamentals

The article made me think of the strategies put forth in the Visible Ops book. But, even more so, it really indicates the need for true visibility into how ops is conducted at every level (no obfuscation tolerated).