Friday, July 3, 2009

Tools are easy. Changing your operating culture is hard.

Did you ever notice that our first inclination is to reach for a tool when we want to change something? What we always seem to forget is that web operations, as a discipline, is only partially about technology.

The success of your web operations depends more on the processes and culture your people work within than it does on any specific tooling choices. Yet, as soon as you start talking about changing/improving operations the conversation quickly devolves into arguments about the merits of various tools.



We see this repeatedly in our consulting business. Time after time we are called in to do a specific automation project and wind up spending the bulk of the effort as counselors and coaches helping the organization make the cultural shift that was the real intention of the automation project.

This article from InfoQ on how difficult it is to get development organizations to adopt an agile culture is a superb encapsulation of the difficulty of cultural change. Switch the word "development" to "web operations" and switch "Agile" to any cultural change you want to make and the article still holds up.

This condition really shouldn't be a surprise to any of us. After all, how much time do we really spend, as an industry, discussing cultural, process, and organizational issues? Compare the number of books and articles written about the people side of web operations vs. the number of books and articles written about the technical side of web operations. The ratio is abysmal, especially when you compare it to other types of business operations (manufacturing, finance, service industries, etc.)

UPDATE: The Agile Executive brings up the point that tools are valuable for enforcing new behavior. I definitely agree with that... but still maintain that new tools without a conscious decision to change behavior and culture is, more often than not, an approach that will fail.

Friday, June 26, 2009

Automated Infrastructure enables Agile Operations

"Agile" been applied to such unanticipated domains as enterprises, start ups, investing, etc. Agile encompasses several generic common sense principles (eg: simple design, sustainable pace, many incremental changes, action over bureaucracy, etc.) so the desire to bestow its virtues on all kinds of endeavors is understandable.

But why contemplate the idea of Agile Operations? Why would Agile Operations even make sense?

Let's start by playing devils advocate. Some of the Agile principles appear to contradict well established and accepted systems administration goals, namely stability and availability. Traditional culture in operations leans towards risk-aversion and stasis in an attempt to assure maximum service levels. Many operations groups play a centralized role serving multiple business lines and have evolved to follow a top-down directed, command and control style management structure that wants to limit access to change. From their point of view, change is the enemy of stability and availability. With stability and availability being the primary goals of operations, it's easy to see where the skepticism towards Agile Operations comes from.


The calls for Agile Operations has initially been driven by product development groups that employ Agile practices . These groups churn out frequent, small improvements to their software systems on a daily basis. The difference in change management philosophy has been the cause of a growing clash between development and operations. The clash intensifies when the business wants to drive these rapid product development iterations all the way through to production (even 10+ times a day).


So, if operations is to avoid being a bottleneck to this Agile empowered flow of product changes, how can they do it in a way that won't create unmanageable chaos?

To apply Agile to the world of operations, one must first see all infrastructure as programmable. Rather than see infrastructure as islands of equipment that were setup by reading a manual and typing commands into a terminal, one sees infrastructure as a set of components that are bootstrapped and maintained through programs. In other words, infrastructure is managed by executing code not by directly applying changes manually at the keyboard.


Replacing manual tasks with executable code is the crucial enabler to sharing a common set of change management principles between development and operations. This alignment is truly the key first step in allying development and operations to support the business' time to market needs. This shared change management model also facilitates a few additional beneficial practices.

  • Shared code bases: Store and control application and infrastructure code in the same place so both dev and ops staff have clear visibility into everything needed to create a running service.
  • Collaborative configuration management: Application and infrastructure configuration management code can be jointly developed early in the development cycle and tested in development integration environments. Code and configuration become the currency between dev and ops.
  • Skill transfer: App and ops engineers can transfer knowledge about the inner workings of the runtime application system and develop skills around tooling to maintain them.
  • Reproducibility: Reproducing a running application from source and a build specification is vital to managing a business at scale. (http://www.itpi.org/home/visibleops.php)
While some may argue that "Agile" in its entirety does not completely apply to the world of operations, an automated infrastructure based on principles like code sharing as a form of collaboration between dev and ops is a sound basis to enable business agility.

Tuesday, June 23, 2009

10+ Deploys Per Day: Dev and Ops Cooperation at Flickr

The Flickr guys, John Allspaw and Paul Hammond gave an entertaining and validating presentation at OReilly Velocity (slides).


The talk began with a brief description about how Flickr's technology group enabled the business to deliver features and update their online service frequently (10+ deploys per day) but it really turned out to be a success story about how Dev and Ops can align and work together without falling into age-old traditional cross organizational conflicts.



Here's a few (paraphrased) quotes:


Ops' job is to enable the business. (not to keep the site stable and fast)


The business requires change... so lower the risk of change through tools and culture. Make more frequent, but smaller changes ... through push button (and hands off) deployment tools.


Plan fire drills to make sure everyone (junior guys included) knows how to solve production problems because failure will happen.


Ops who think like devs. Devs who think like ops


The talk really boiled down to two ingredients to enable the close dev and operations collaboration (tools + culture):


Tools

1. Automated infrastructure

2. Shared version control

3. One step build and deploy

4. Feature flags

5. Shared metrics

6. IRC and IM robots


Culture

1. Respect

2. Trust

3. Healthy attitude about failure

4. Avoiding blame


I think for some, the real validation was hearing that it's just as much making a cultural shift as it is a mixture of choosing and using the right kind of tools. Anybody who has worked in the trenches will realize that of course.



Monday, June 15, 2009

Continuous Deployment Really Means Continuous Testing

On Twitter and on web operations focused blogs, the concept of Continuous Deployment is a topic that is gaining momentum. Across our consulting clients, we've also seen a significant uptick in discussion around the concept of Continuous Deployment (some calling it "Agile Deployment").

The extreme example of Continuous Deployment that has sparked the most polarizing discussions is from Timothy Fitz's posts on doing production deployment's up to fifty times per day.

While it's a fascinating read, many people for whom the essay is their first exposure to the idea of Continuous Deployment overlook the real value. The value is not how Fitz gets code all the way into production on a sub-daily basis. The value is in achieving a state of continuous automated testing.

If you understand the concept of "the earlier you find a bug, the cheaper it is", the idea of continuous testing is as good as it gets. Every time a build executes, your full suite of unit, regression, user/functional, and performance tests are automatically run. In a mature operation this could quite literally mean millions of automated tests being executed every day. As your application development makes even the smallest moves forward, the application is being rigorously testing inside and out.



Another common misconception is that Continuous Deployment means that human-powered QA cycles are a thing of the past or are somehow less important. This belief is probably a byproduct of those extreme practitioners of Continuous Deployment who are doing hot deployments to production after every build. In most business scenarios there is not much benefit to continuous production deployment. The value of a human-powered QA team sensing if the look, feel, and functionality match the requirements can't, and shouldn't, be overlooked.

Most of our consulting clients just aren't interested in sub-daily deployments to live production environments. What they want out of Continuous Deployment is to have a constant state of broad automated testing and an always up-to-date QA environment for human-powered testing and business review.

In addition to deploying a broad suite of automated testing tools, Fully Automated Provisioning provides the linchpin that makes Continuous Deployment a reality.

Tuesday, May 12, 2009

Clouds, Virtualization, and Continuous Deployment all share the same achilles heal

Recently, there are 3 "hot trends" that we regularly get asked about in our day jobs as web operations consultants:

  • Cloud Computing (meaning elastic computing resources paid for on-demand)
  • Virtualization in Production (meaning using virtual machines for non-development or QA uses)
  • Continuous Deployment (meaning the ability to automatically deploy and test a full environment automatically after each and every build that comes out of their Continuous Integration driven build process)

There is a common thread that ties all of these together -- Fully Automated Provisioning**. You can't achieve the full benefit of any of these advances without Fully Automated Provisioning.

In this previous post, we covered the reasons why efforts to harness the power of cloud or virtualized infrastructure will fail without Fully Automated Provisioning.

Continuous Deployment suffers from a similar weakness. If you don't have Fully Automated Provisioning in place, the efforts required to provision your applications and sort out the resulting problems will outweigh whatever benefits you set out to gain.



IT automation may not be the sexiest field. However, IT automation (and specifically Fully Automated Provisioning) is the necessary foundation that lets you continually reap the benefits of the latest headline grabbing initiatives.

**To read about what the criteria is for achieving Fully Automated Provisioning, check out this blog post and whitepaper.

Tuesday, April 7, 2009

The CIO / Ops Perception Gap

Every IT manager should read this article: Making Business Service Management a Reality . I think the original title, "BSM Evolution - The CIO / Ops Perception Gap", more accurately reflects the essence of the issues it draws out.
* CIOs prematurely believing they have a handle around running their software services
* VPs of Ops afraid to admit that they've just begun a long journey that assumes continuous improvement approaches and no one time fixes
* No clear visibility from the biz level into the level of quality of service operations delivered by the CIO on down through the tech management ranks
* The need to focus on fundamentals

The article made me think of the strategies put forth in the Visible Ops book. But, even more so, it really indicates the need for true visibility into how ops is conducted at every level (no obfuscation tolerated).

Monday, April 6, 2009

The real reason why enterprises aren't moving to the clouds

Visit any of the cloud obsessed blogs, discussion forums, or conferences and you'll hear the same "reasons" as to why cloud computing isn't catching on within enterprise IT shops. It's always something about interoperability, service level agreements, security risks, data formats, APIs, or hypothetical legal implications.

Interesting issues, but they are all red herrings.

The inability of enterprises to take advantage of the clouds isn't due to the shortcomings of the cloud offerings available today. The shortcoming is with the state of today's enterprise IT shops. The real reason is that those millions of applications currently running within enterprises are hardwired into a particular environment.

I'm not talking about the application code itself. After all, Linux is Linux and Windows is Windows no matter if it's running on native hardware, a local VM, or somewhere in a cloud. The true problem is with the way today's applications are configured, deployed, and managed. Very few folks in enterprise IT are willing to admit to the hairballs that decades of shortsighted IT Management techniques have created.

The often offered up excuse of there being "a lack of cloud skills within the enterprise" is really just code for a general abundance of poor to outrageously awful IT Management techniques.

Let's look at a basic and often quoted use case for cloud computing.

With server utilization rates in the 3-15% range, there is obviously room for significant reduction in capital expenditures by taking advantage of cloud-based elastic computing resources. Why isn't there a rush of enterprises running to take advantage of this? No, the answer isn't the often quoted fear of vendor lock-in. The answer is that these enterprises are locked into themselves.

Think of how difficult it is for an enterprise to switch datacenters. Months of effort go into planning and executing the move, but have you ever heard of one going smoothly? If it takes months to move between your own datacenters and you still can't get it right, what hope do you have for making it into the clouds?

It's no coincidence that the local virtualization vendors like VMware and Parallels are facing this same denial of such an obvious business case as they attempt to push their offerings out of development and testing environments and into production environments.

Rob England (TheITSkeptic) is one of the few pundits to point out the fundamental disconnect between enterprise IT and the clouds. He attributes the largest hurdle to migration costs. I would take it a couple of steps further. Migration is just a symptom of the problem. If migration was the only problem, using cloud infrastructure would be a slam dunk for new application projects. But, of course, the same old broken management techniques ingrained in an enterprise plague new and old projects alike.

The bottom line is that what passes for the status quo in IT Management is crippling enterprises. Enterprise IT can't take full advantage of such fundamental advances as virtualization and elastic computing until an "abstracted administration" paradigm becomes standard operating procedure.

Abstracted administration means the ability to work from a point of view that is independent of any particular server instances or specific software deployments. Within the abstracted administration paradigm, an administrator manages deployment and ongoing operations from a higher level and lets the underlying framework coordinate operations across the actual physical environment. Once you've achieved abstracted administration, moving datacenters or re-deploying an application to virtualized servers in the cloud is as simple as updating one part of the specification that drives the abstraction. Your tools will then handle the rest.

Of course, achieving abstracted administration means that the provisioning and management of your entire application stack -- from OS install to running integrated application services -- must be fully automated using tools that support this specification-driven, abstracted administration paradigm.

If you look at what passes for state of the art in many IT shops, it might seem like the ability to achieve abstracted administration and fully automated provisioning is a long ways off.

That is simply not the case. The tools to get this done are already here, they work well, and they are all open source. Below is a diagram of an open source toolchain that can provide fully automated provisioning.



Still not convinced it can be done? Check out this joint whitepaper from ControlTier and Reductive Labs(the team behind puppet). The paper lays out how fully automated provisioning can be (and has been) achieved using these standard open source tools.

If you are interested in more detailed explanation of the abstracted administration paradigm, check out this detailed post by Alex on the ControlTier Blog.

Sunday, March 22, 2009

Web Operations: the canary in the IT Management coal mine

Rob England (The IT Skeptic), recently wrote some very nice things about this blog.

After I got over the fact that one of my favorite bloggers is writing about this blog, I realized that his post does raise a good question: If good IT Management is good IT Management not matter what business you are in, why does this blog focus so much on the Web Operations perspective?

Part of the reason is that Web Operations is the world that Alex and I live in on a daily basis (via ControlTier... helping e-commerce and SaaS companies improve the efficiency and reliability of their operations).

The other part of the reason is that we see Web Operations as the canary in the coal mine for IT Operations. When a company's entire business is operating software as revenue producing service, the shortcomings and the successes of your IT Operations goes right to your bottom line. The tolerance for the status quo dissipates a lot quicker and there is stronger political will to think outside of the box.



Put it this way, pretend you're the CEO of a Fortune 100 size company that makes aircraft engines or automobiles. Where is improving the efficiency and reliability of your IT Operations going to fall on the list of things you worry about every day? 32 on a top 50 list might be generous.



Now pretend you are the CEO of an online company whose sole source of revenue comes from what you can generate through your website. Suddenly the efficiency and reliability of your IT Operations jumps to near the top of the list.




Update: While people point out to me that I'm stretching the "canary in a coal mine" metaphor a bit far... I'm loading The Police's Zenyatta Mondatta album into my iTunes.

Monday, March 2, 2009

Web Operations: Are you developing an asset or a liability?

"Buy vs. Build". It's a term you hear repeatedly with it comes to businesses weighing their options for application and systems management solutions. But as anyone who spends time in the web operations trenches knows, the reality is always something closer to "build vs. build". Buy something from a software vendor, use open source tools, develop something from scratch - in each situation there just isn't a one size fits all option and there is always going to be custom integration involved. This reality was previously covered in Alex's "Stone Axes" post.

So being resigned to the fact that there is a "build" aspect to any solution, the next critical choice then becomes what guidelines you impose on your organization to steer their design choices. The most pervasive design criteria seems to be technical completeness or elegance. From a technical architect's purist point of view this makes sense; but what this often fails to take into account is the business impact of those technical decisions.

While many technical design options might seem to have identical business impact on day 1 (they cost roughly x to develop and provide feature y), what are the true cost of those decisions down the road? Have those decisions put the company in a position to continuously leverage those design choices into increasingly greater returns? Or have those decisions placed an anchor around the company's neck that they will be weighted down by, and paying for, well into the future? To put it into loose economic terms: have you developed an asset or a liability for your company?

What would be an example of building asset? Using off the shelf open source tools and only developing thin layers of integration where they need to plug into your existing systems.

What would be an example of building a liability? Writing a custom system that mirrors the available functionality of existing off the shelf tools, thereby saddling your company with the sole responsibility for the forward progress of the design and maintenance of that tooling.

The asset vs. liability concept is one that obviously needs to be flushed out quite a bit more. In any case, it's shocking how infrequently companies actually analyze the long-term business impact of the technical design decisions made about their tooling.

(Note: Thanks to Lee Thompson for framing this as an asset vs liability debate)

Friday, December 19, 2008

Checklists: the most unsexy way to save millions




The New Yorker has a great article on the success of using checklists to tame extremely complex systems.

The primary example used in the article is intensive care units in hospitals. Anywhere you see the term "intensive care" substitute "data center" and anywhere you see a name of a medical procedure substitute the name of a technical procedure and the lessons are essentially the same.

What are the lessons?

1. Where checklists have been formalized and rigidly enforced (as a means of documenting and enforcing best practices), millions of dollars have been saved and many deaths (the ultimate "system outage") have been avoided.

2. The concept of checklists is so simple and unsexy that their awesome saving power is often overlooked. Admit it, your inner geek yawns just thinking about checklists.

How can checklists immediately improve IT operations?

First, agree on your best practices and document them. Second, strictly enforce the rule that all operations activities must follow those procedures. Third, record the completion of each step of the procedure for trouble shooting and analysis.

Sounds like such common sense, doesn't it? If it is then why do most IT operations fail at implementing such a simple culture of orderly change management?