DevOps at dealnews.com

Mon, Apr 5, 2010 08:00 AM
I was telling someone how we roll changes to production at dealnews and they seemed really amazed by it. I have never really thought it was that impressive. It just made sense. It has kind of happened organically here over the years. Anyhow, I thought I would share.

Version Control

So, to start with, everything is in SVN. PHP code, Apache configs, DNS and even the scripts we use to deploy code. That is huge. We even have a misc directory in SVN where we put any useful scripts we use on our laptops for managing our code base. Everyone can share that way. Everyone can see what changed when. We can roll things back, branch if we need to, etc. I don't know how anyone lives with out. We did way back when. It was bad. People were stepping on each other. It was a mess. We quickly decided it did not work.

For our PHP code, we have trunk and a production branch. There are also a couple of developers (me) that like to have their own branch because they break things for weeks at a time. But, everything goes into trunk from my branch before going into production. We have a PHP script that can merge from a developer branch into trunk with conflict resolution assistance built in. It is also capable of merging changes from trunk back into a branch. Once it is in trunk we use our staging environment to put it into production.

Staging/Testing

Everything has a staging point. For our PHP code, it is a set of test staging servers in our home office that have a checkout of the production branch. To roll code, the developer working on the project logs in via ssh to a staging server as a restricted user and uses a tool we created that is similar to the Python based svnmerge.py. Ours is written in PHP and tailored for our directory structure and roll out procedures. It also runs php -l on all .php and .html files as a last check for any errors. Once the merge is clean, the developer(s) use the staging servers just as they would our public web site. The database on the staging server is updated nightly from production. It is as close to a production view of our site as you can get without being on production. Assuming the application performs as expected, the developer uses the merge tool to commit the changes to the production branch. They then use the production staging servers to deploy.

Rolling to Production

For deploying code and hands on configuration changes into our production systems, we have a staging server in our primary data center. The developer (that is key IMO) logs in to the production staging servers, as a restricted user, and uses our Makefile to update the checkout and rsync the changes to the servers. Each different configuration environment has an accompanying nodes file that lists the servers that are to receive code from the checkout. This ensures that code is rolled to servers in the correct order. If an application server gets new markup before the supporting CSS or images are loaded onto the CDN source servers, you can get an ugly page. The Makefile is also capable of copying files to a single node. We will often do this for big changes. We can remove a node from service, check code out to it, and via VPN access that server directly to review how the changes worked.

For some services (cron, syslog, ssh, snmp and ntp) we use Puppet to manage configuration and to ensure the packages are installed. Puppet and Gentoo get along great. If someone mistakenly uninstalls cron, Puppet will put it back for us. (I don't know how that could happen, but ya never know). We hope to deploy more and more Puppet as we get comfortable with it.

Keeping Everyone in the Loop

Having everyone know what is going on is important. To do that, we start with Trac for ticketing. Secondly, we use OpenFire XMPP server throughout the company. The devops team has a channel that everyone is in all day. When someone rolls code to production, the scripts mentioned above that sync code out to the servers sends a message via an XMPP bot that we wrote using Ruby (Ruby has the best multi-user chat libraries for XMPP). It interfaces with Trac via HTTP and tells everyone what changesets were just rolled and who committed them. So, in 5 minutes if something breaks, we can go back and look at what just rolled.

In addition to bots telling us things, there is a cultural requirement. Often before a big roll out, we will discuss it in chat. That is the part than can not be scripted or programmed. You have to get your developers and operations talking to each other about things.

Final Thoughts

There are some subtle concepts in this post that may not be clear. One is that the code that is written on a development server is the exact same code that is used on a production server. It is not massaged in any way. Things like database server names, passwords, etc. are all kept in configuration files on each node. They are tailored for the data center that server lives in. Another I want to point out again is that the person that wrote the code is responsible all the way through to production. While at first this may make some developers nervous, it eventually gives them a sense of ownership. Of course, we don't hire someone off the street and give them that access.  But it is expected that all developers will have that responsibility eventually.
16 comments
Gravatar for Rehan

Rehan Says:

"We have a PHP script that can merge from a developer branch into trunk with conflict resolution assistance built in."

Sounds cool, any chance you could share it? Does it differ significantly from svnmerge.py?

Gravatar for Mark R

Mark R Says:

I guess that kind of thing might be acceptable in a startup or a free public web site where problems don't lose you actual customers.

However, without any kind of change control, it sounds like a nightmare in practice.

Having developers actually deploy stuff to production is definitely a recipe for disaster, and will distract them from doing development.

A proper system test cycle where tests for the actual changes in a release which has been scheduled and signed off by management sounds like a sound principle, at least.

For an important application, this is a joke of a release process.

Gravatar for Patrick Debois

Patrick Debois Says:

Approach Mark R: Give the engineers a fish (by saying approved by managed) and things will improve a bit.(to comply the process)

Approach Brian: Teach engineers how to fish (by being responsible for everything) and things will vastly improve.

To me Brian's team is way more efficient then Mark R's one. Way to go Brian!

Gravatar for John Allspaw

John Allspaw Says:

Mark R:

The process/culture/tools/methodology Brian is blogging about:
- Does work, and works well (very similar at Flickr)
- Does not at all prevent change control/management from happening
- Can (and does, proven elsewhere as well) actually *increase* availability and change-related incidents

Gravatar for Mark R

Mark R Says:

There are two basic ways of release management, in my understanding:

1. Commit changes into the head, then take a branch off head for a release, stabilise that release (backporting bugfixes as necessary), system test, and release. If hotfixes are required, make them on a release branch.

2. Commit changes to a release branch, and merge back into head just before a release.

Now you appear to be doing neither (unless I have misunderstood).

System testing the same build that you intend to roll to production seems like a reasonable idea, and apart from some sanity checks, your process does not seem to involve doing this.

As far as actually DOING a release of a tested build, this should be done by Ops engineers, not developers. Having developers throwing any old thing out without the knowledge or consent of Ops is going to make their life impossible. Ops engineers need to know what software set is currently running, and be aware of a change if trying to diagnose tricky operational issues, having things magically changing underneath them will not make life easier.

Mark

Gravatar for Brian Moon

Brian Moon Says:

That sounds like a great model for compiled, installed software. And that is where it comes from. The web simply does not work that way. Companies that try and shoe horn that model into the web will find themselves developing slowly and will always be behind their competitors.

Gravatar for John Allspaw

John Allspaw Says:

Mark:

I think you're under the assumption that developers pushing code equals keeping ops in the dark about it, and that code in that situation equals 'any old thing'. Neither of which is true at Flickr and Etsy, and I'm willing to be it's not true at Dealnews either.

In fact, part of this whole idea is to actually *increase* communication (about everything, really, not just change) between development and operations.

I'm agreeing with you that not communicating = bad. No one's suggesting that. :)

Gravatar for Michael Mucha

Michael Mucha Says:

I was in a "devops" environment at Exodus from '99 to 2003, without us knowing that it was some sort of unusual thing. It was just the organic, obvious right way to do things. So I'm scratching my head bit right now at it being called a "movement" and such. But I'm glad to see the notion gaining ground. No more throwing code over the wall to the next team.

Gravatar for br41n

br41n Says:

"Puppet and Gentoo get along great" ??
How come? I mean reading http://projects.puppetlabs.com/projects/puppet/wiki/Puppet_Gentoo it looks like there are some serious issues especially with the USE flags
So how do you handle that?
Personally i'm a bit surprised that after so much time they didn't found a way to make that work, but i guess not that many crazy admins that use Gentoo on their production systems like us :)

Gravatar for Brian Moon

Brian Moon Says:

@br41n Well, I guess it just works great for us. We don't use Puppet to install our systems or maintain portage. We take a little more care when doing world emerges. I am still a little wary of letting robots emerge stuff.

Comments are disabled for this post.