So, to start with, everything is in SVN. PHP code, Apache configs, DNS and even the scripts we use to deploy code. That is huge. We even have a misc directory in SVN where we put any useful scripts we use on our laptops for managing our code base. Everyone can share that way. Everyone can see what changed when. We can roll things back, branch if we need to, etc. I don't know how anyone lives with out. We did way back when. It was bad. People were stepping on each other. It was a mess. We quickly decided it did not work.
For our PHP code, we have trunk and a production branch. There are also a couple of developers (me) that like to have their own branch because they break things for weeks at a time. But, everything goes into trunk from my branch before going into production. We have a PHP script that can merge from a developer branch into trunk with conflict resolution assistance built in. It is also capable of merging changes from trunk back into a branch. Once it is in trunk we use our staging environment to put it into production.
Everything has a staging point. For our PHP code, it is a set of test staging servers in our home office that have a checkout of the production branch. To roll code, the developer working on the project logs in via ssh to a staging server as a restricted user and uses a tool we created that is similar to the Python based svnmerge.py. Ours is written in PHP and tailored for our directory structure and roll out procedures. It also runs php -l on all .php and .html files as a last check for any errors. Once the merge is clean, the developer(s) use the staging servers just as they would our public web site. The database on the staging server is updated nightly from production. It is as close to a production view of our site as you can get without being on production. Assuming the application performs as expected, the developer uses the merge tool to commit the changes to the production branch. They then use the production staging servers to deploy.
Rolling to Production
For deploying code and hands on configuration changes into our production systems, we have a staging server in our primary data center. The developer (that is key IMO) logs in to the production staging servers, as a restricted user, and uses our Makefile to update the checkout and rsync the changes to the servers. Each different configuration environment has an accompanying nodes file that lists the servers that are to receive code from the checkout. This ensures that code is rolled to servers in the correct order. If an application server gets new markup before the supporting CSS or images are loaded onto the CDN source servers, you can get an ugly page. The Makefile is also capable of copying files to a single node. We will often do this for big changes. We can remove a node from service, check code out to it, and via VPN access that server directly to review how the changes worked.
For some services (cron, syslog, ssh, snmp and ntp) we use Puppet to manage configuration and to ensure the packages are installed. Puppet and Gentoo get along great. If someone mistakenly uninstalls cron, Puppet will put it back for us. (I don't know how that could happen, but ya never know). We hope to deploy more and more Puppet as we get comfortable with it.
Keeping Everyone in the Loop
Having everyone know what is going on is important. To do that, we start with Trac for ticketing. Secondly, we use OpenFire XMPP server throughout the company. The devops team has a channel that everyone is in all day. When someone rolls code to production, the scripts mentioned above that sync code out to the servers sends a message via an XMPP bot that we wrote using Ruby (Ruby has the best multi-user chat libraries for XMPP). It interfaces with Trac via HTTP and tells everyone what changesets were just rolled and who committed them. So, in 5 minutes if something breaks, we can go back and look at what just rolled.
In addition to bots telling us things, there is a cultural requirement. Often before a big roll out, we will discuss it in chat. That is the part than can not be scripted or programmed. You have to get your developers and operations talking to each other about things.
There are some subtle concepts in this post that may not be clear. One is that the code that is written on a development server is the exact same code that is used on a production server. It is not massaged in any way. Things like database server names, passwords, etc. are all kept in configuration files on each node. They are tailored for the data center that server lives in. Another I want to point out again is that the person that wrote the code is responsible all the way through to production. While at first this may make some developers nervous, it eventually gives them a sense of ownership. Of course, we don't hire someone off the street and give them that access. But it is expected that all developers will have that responsibility eventually.