DevOps at dealnews.com

I was telling someone how we roll changes to production at dealnews and they seemed really amazed by it. I have never really thought it was that impressive. It just made sense. It has kind of happened organically here over the years. Anyhow, I thought I would share.

Version Control

So, to start with, everything is in SVN. PHP code, Apache configs, DNS and even the scripts we use to deploy code. That is huge. We even have a misc directory in SVN where we put any useful scripts we use on our laptops for managing our code base. Everyone can share that way. Everyone can see what changed when. We can roll things back, branch if we need to, etc. I don't know how anyone lives with out. We did way back when. It was bad. People were stepping on each other. It was a mess. We quickly decided it did not work.

For our PHP code, we have trunk and a production branch. There are also a couple of developers (me) that like to have their own branch because they break things for weeks at a time. But, everything goes into trunk from my branch before going into production. We have a PHP script that can merge from a developer branch into trunk with conflict resolution assistance built in. It is also capable of merging changes from trunk back into a branch. Once it is in trunk we use our staging environment to put it into production.

Staging/Testing

Everything has a staging point. For our PHP code, it is a set of test staging servers in our home office that have a checkout of the production branch. To roll code, the developer working on the project logs in via ssh to a staging server as a restricted user and uses a tool we created that is similar to the Python based svnmerge.py. Ours is written in PHP and tailored for our directory structure and roll out procedures. It also runs php -l on all .php and .html files as a last check for any errors. Once the merge is clean, the developer(s) use the staging servers just as they would our public web site. The database on the staging server is updated nightly from production. It is as close to a production view of our site as you can get without being on production. Assuming the application performs as expected, the developer uses the merge tool to commit the changes to the production branch. They then use the production staging servers to deploy.

Rolling to Production

For deploying code and hands on configuration changes into our production systems, we have a staging server in our primary data center. The developer (that is key IMO) logs in to the production staging servers, as a restricted user, and uses our Makefile to update the checkout and rsync the changes to the servers. Each different configuration environment has an accompanying nodes file that lists the servers that are to receive code from the checkout. This ensures that code is rolled to servers in the correct order. If an application server gets new markup before the supporting CSS or images are loaded onto the CDN source servers, you can get an ugly page. The Makefile is also capable of copying files to a single node. We will often do this for big changes. We can remove a node from service, check code out to it, and via VPN access that server directly to review how the changes worked.

For some services (cron, syslog, ssh, snmp and ntp) we use Puppet to manage configuration and to ensure the packages are installed. Puppet and Gentoo get along great. If someone mistakenly uninstalls cron, Puppet will put it back for us. (I don't know how that could happen, but ya never know). We hope to deploy more and more Puppet as we get comfortable with it.

Keeping Everyone in the Loop

Having everyone know what is going on is important. To do that, we start with Trac for ticketing. Secondly, we use OpenFire XMPP server throughout the company. The devops team has a channel that everyone is in all day. When someone rolls code to production, the scripts mentioned above that sync code out to the servers sends a message via an XMPP bot that we wrote using Ruby (Ruby has the best multi-user chat libraries for XMPP). It interfaces with Trac via HTTP and tells everyone what changesets were just rolled and who committed them. So, in 5 minutes if something breaks, we can go back and look at what just rolled.

In addition to bots telling us things, there is a cultural requirement. Often before a big roll out, we will discuss it in chat. That is the part than can not be scripted or programmed. You have to get your developers and operations talking to each other about things.

Final Thoughts

There are some subtle concepts in this post that may not be clear. One is that the code that is written on a development server is the exact same code that is used on a production server. It is not massaged in any way. Things like database server names, passwords, etc. are all kept in configuration files on each node. They are tailored for the data center that server lives in. Another I want to point out again is that the person that wrote the code is responsible all the way through to production. While at first this may make some developers nervous, it eventually gives them a sense of ownership. Of course, we don't hire someone off the street and give them that access.  But it is expected that all developers will have that responsibility eventually.

Logging with MySQL

I was reading a post by Cassandra is my NoSQL solution but..". In the post, Dathan explains that he uses Cassandra to store clicks because it can write a lot faster than MySQL. However, he runs into problems with the read speed when he needs to get a range of data back from Cassandra. This is the number one problem I have with NoSQL solutions.

SQL is really good at retrieving a set of data based on a key or range of keys. Whereas NoSQL products are really good at writing things and retrieving one item from storage. When looking at redoing our architecture a few years ago to be more scalable, I had to consider these two issues. For what it is worth, the NoSQL market was not nearly as mature as it is now. So, my choices were much more limited. In the end, we decided to stick with MySQL. It turns out that a primary or unique key lookup on a MySQL/InnoDB table is really fast. It is sort of like having a key/value storage system. And, I can still do range based queries against it.

But, back to Dathan's problem: clicks. We store clicks at dealnews. Lots of clicks. We also store views. We store more views than we do clicks. So, lots of views and lots of clicks. (Sorry for the vague numbers, company secrets and all. We are a top 1,000 Compete.com site during peak shopping season.) And we do it all in MySQL. And we do it all with one server. I should disclose we are deploying a second server, but it is more for high availability than processing power. Like Dathan, we only use about the last 24 hours of data at any given time. There are three keys for us doing logging like this in MySQL.

Use MyISAM

MyISAM supports concurrent inserts. Concurrent inserts means that inserts can add rows to the end of a table while selects are being performed on other parts of the data set. This is exactly the use case for our logging. There are caveats with range queries as pointed out by the MySQL Performance Blog.

Rotating tables

MySQL (and InnoDB in particular) really sucks at deleting rows. Like, really sucks. Deleting causes locks. Bleh. So, we never delete rows from our logging tables. Instead, nightly we rotate the tables. RENAME TABLE is an (near) atomic process in MySQL. So, we just create a new table.
create table clicks_new like clicks;
rename table clicks to clicks_2010032500001, clicks_new to clicks;

Tada! We now have an empty table for today's clicks. We now drop any table with a date stamp that is longer than x days old. Drops are fast, we like drops.

For querying these tables, we use UNION. It works really well. We just issue a SHOW TABLES LIKE 'clicks%' and union the query across all the tables. Works like a charm.

Gearman

So, I get a lot of flack at work for my outright lust for Gearman. It is my new duct tape. When you have a scalability problem, there is a good chance you can solve it with Gearman. So, how does this help with logging to MySQL? Well, sometimes, MySQL can become backed up with inserts. It happens to the best of us. So, instead of letting that pile up in our web requests, we let it pile up in Gearman. Instead of having our web scripts write to MySQL directly, we have them fire Gearman background jobs with the logging data in them. The Gearman workers can then write to the MySQL server when it is available. Under normal operating procedure, that is in near real time. But, if the MySQL server does get backed up, the jobs just queue up in Gearman and are processed when the MySQL server is available.

BONUS! Insert Delayed

This is our old trick before we used Gearman. MySQL (MyISAM) has a neat feature where you can have inserts delayed until the table is available. The query is sent to the MySQL server and it answers with success immediately to the client. This means your web script can continue on and not get blocked waiting for the insert. But, MySQL will only queue up so many before it starts erroring out. So, it is not as fool proof as a job processing system like Gearman.

Summary

To log with MySQL:
  • Use MyISAM with concurrent inserts
  • Rotate tables daily and use UNION to query
  • Use delayed inserts with MySQL or a job processing agent like Gearman
Happy logging!

PS: You may be asking, "Brian, what about Partitioned Tables?" I asked myself that before deploying this solution. More importantly, in IRC I asked Brian Aker about MySQL partitioned tables. I am paraphrasing, but he said that if I ever think I might alter that table, I would not trust it with the partitions in MySQL. So, that kind of turned me off of them.

Separating Apache logs by virtualhost with Lua

By default, most distributions use logrotate to rotate Apache logs. Or worse, they don't rotate them at all. I find the idea of a cron job restarting my web server every night to be very disturbing. So, years ago, we started using cronolog. Cronolog separates logs using a date/time picture. So, you get nice logs per day.

But, what if you are running 5 or 6 virtual hosts on the server? Do you really want all those logs in one file? You might. But, I don't. So, we ended up running a cronolog command per virtual host. At one time, this was 10 cronolog processes. Now, they are tiny at about 500k of resident memory used when running. But still, it seemed like a waste. Enter vlogger. Vlogger could take a virtual host name in its file name picture. And it would create the directories if they did not exist. So, now, we could have logs separated by virtual host and date. Alll was good.

But, vlogger has not been updated for a while. It started spitting out errors, right into my access logs. And I could not find a solution. The incoming log data did not change. My only assumption is that some Perl library it used changed and broke it. So, here I am again with cronolog.

I decided I could just write one. So, I started thinking about the problem. It needs to be small. PHP would be a stupid choice. One PHP process would be more than 10 cronolog processes. I decided on Lua.

"Lua is a powerful, fast, lightweight, embeddable scripting language." It is also usable as a shell scripting language, which is what I needed. So, I got to hacking and came up with a script that does the job quite well. When running, it uses about 800k of resident memory. You can download the script here on my site.

vlualogger - 3.7k

Developers should write code for production

Having development and staging environments that reflect production is a key component of DevOps.  An example for us is dealing with our CDN.

I can imagine in some dysfunctional, fragmented company, a developer works on a web application and sticks all the images in the local directory with his scripts. Then some operations/deployment guy has to first move the images where they need to be and then change all the code that references those images.  If he is lucky, he has a script that does it for him. This is a needless exercise. If you have a development environment that looks and acts like production, this is all handled for you.

Here is an example of how it works for us. We use a CDN for all images, javascript, CSS and more. Those files come from a set of domains: s1.dlnws.com - s5.dlnws.com. So, our dev environments have similar domains. somedev.s5.dlnws.com points to a virtual server. We then use mod_substitute in Apache to rewrite those URLs on the dev machine. Each developer and staging instances will have an Apache configuration such as:
Substitute "s|http://s1.dlnws.com|http://somedev.s1.dev.dlnws.com|in"
Substitute "s|http://s2.dlnws.com|http://somedev.s2.dev.dlnws.com|in"
Substitute "s|http://s3.dlnws.com|http://somedev.s3.dev.dlnws.com|in"
Substitute "s|http://s4.dlnws.com|http://somedev.s4.dev.dlnws.com|in"
Substitute "s|http://s5.dlnws.com|http://somedev.s5.dev.dlnws.com|in"
So our developers put the production URLs for images into our code. When they test on the development environment, they get URLs that point to their instance, not production. No fussing with images after the fact.

In addition to this, we use mod_proxy to emulate our production load balancers. Special request routing happens in production. We need to see that when developing so we don't deploy code that does not work in that circumstance. If the load balancers send all requests coming in to /somedir to a different set of servers, we have mod_proxy do the same thing to a different VirutalHost in our Apache configuration. It is not always perfect, but it gets us as close to production as we can get without buying very expensive load balancers for our development environments.

Of course, we did not come to this overnight. It took us years to get to this point. Maybe it won't take you that long. Keep in mind when creating your development environments to make them work like production. It is neat to be able to write code on your laptop. I did it for years. But, at some point before you send out code for production, the developer should run it on a production like environment. Then deploying should be much easier.

DevOps is the only way it has ever made sense

DevOps is the label being given to the way we have always done things. This is not the first time this has happened. As it says on my About Me page,

Brian Moon has been working with the LAMP platform since before it was called LAMP.

At some point, not sure when, someone came up with LAMP. I started working on what is now considered LAMP in 1996. I have seen lots of acronyms come and some go. We started using "Remote Scripting" after hearing Terry Chay talk about it at OSCON. The next OSCON, AJAX was all the rage. Technically, we never used AJAX. The X stands for XML. We didn't use XML. What made sense for us was to send back javascript arrays and objects that the javascript interpreter could deal with easily. We wrote a PHP function called to_javascript that converted a PHP array into a javascript array. Sound familiar? Yeah, two years later, JSON was all the rage.

We also have seen the same thing with how we run our development process.  We always considered our team to be an agile development team. That is agile with little a. Nowadays, "Agile" with the big A is usually all about how you develop software and not about actually delivering the software. So, I am always perplexed when people ask me if we use "Agile" development. Are they talking little a or big A?

Today I came across the term DevOps on twitter (there is no Wikipedia page yet). We have always had an integrated development and operations team. I could be writing code in the morning and configuring servers in the afternoon. Developers all have some level of responsibility over managing their development environment. They updated their Apache configurations from SVN and make changes as needed for their applications. The development environments are simulated as close as possible to production. Developers roll code to the production servers. It is their responsibility to make sure it works on production. They also roll it when it is ready rather than letting it sit around for days. This means if there is an unforeseen issue, the code is fresh on their minds and the problem can quickly be solved. We have done things this way since 1998. We are not the only ones. The great guys at Flickr gave a great talk last year at Velocity about their DevOps environment. People were amazed at how their teams worked together.

One of the huge benefits of being a DevOps team is that we can utilize the full stack in our application. If we can use the load balancers, Apache or our proxy servers to do something that offloads our application servers, we plan for that in the development cycle. It is a forethought instead of an afterthought. I see lots of PHP developers that do everything in code. Their web servers and hardware are just there to run their code. Don't waste those resources. They can do lots of things for you.

One cool thing about this is that I now have a label to use when people ask us about our team. I can now say we are an agile DevOps team. They can then go look that up and see what it means. Maybe it will lead to less explanation of how we work too. And if we are lucky, maybe we can find people to hire that have been in a similar environment.

So, I welcome all the new people into the "DevOps movement". Adopt the ideas and avoid any rules that books, blogs, "experts" may want to come up with. The first time I see someone list themselves as a DevOps management specialist, I will die a little on the inside. It is not a set of rules, it is a way of thinking, even a way of life. If the process prevents you from doing something, you are using it wrong, IMO.

State of the Browsers and ad blocking

In my last post about CSS layout and ads, a commenter brought up that the dealnews.com web site did not handle extensions like Ad Block very gracefully. To which I responded that I don't care. To which he responded with download counts. Well, the reason I don't care is that ad impressions when compared to page views on dealnews.com are within 2% of each other. So, at best, less than 2% of users are blocking ads. In reality, that is going to include some DNS failures, network issues, or something else. I would bet our logo graphic has about the same difference. The reality is that normal people don't block ads. In my opinion, if you make your money by working on the web, you shouldn't either. I should add that this site's (my geeky blog) ad views was about 16% lower than the recorded page views. So, geeks block ads more I guess. But, geeks have dominated the web for a long time.

This got me thinking that I had not look at the browser stats very much lately. dealnews has a very odd graph on browser statistics. We do not follow the industry averages. Our audience is dominantly tech savy (that does not mean geeks). Our users don't just use the stuff that is installed on the computer when they get it. This kind of proves my point about ad blocking even more. We have non-moron users and they still don't block ads.



Browser   % of Visits
Internet Explorer 42.34%
Firefox 36.94%
Safari 9.55%
Chrome 8.34%
Mozilla 1.46%
Opera 0.68%
Netscape 0.41%
Avant 0.08%
Camino 0.06%
IE Mobile 0.02%

As you can see, Firefox is very prevalent on our site. We generally test in IE7/8, Firefox 3, Safari and Chrome. I will occasionally test a major change in Opera. Typically, well formed HTML and CSS works fine in Opera so everything is all good.

As for operating systems, Windows still dominates, but we have more Macs than the average site I would guess.



OS   % of Visits
Windows 82.95%
Macintosh 11.27%
iPhone 3.80%
Linux 1.19%
Android 0.17%

Interesting that iPhone beats out Linux. That is just another sign to me that Linux is still not a real choice for real people. Be that a product issue from OEMs or user choice. That is debatable. It is notable that most of our company uses Macs. I don't think we make up a speck of that traffic though. If we did, our home state of Alabama would be our most dominant. It isn't. We are very typical in that regard, California is number one. We only have one employee there.

Tables for layout

I just read A case for table-based design and was thrilled to know I am not the only one that drowns in div soup from time to time.  I do not for the most part use tables for layout, but there are some cases where I just can't make a set of divs do my bidding. The classic example is having a two column layout where the left column LOADS FIRST and is elastic and the right column is a fixed size.  The "loads first" is important in a world where the rendering time of pages has become important.  Ideally with any page, the most important content would render first for the user. In my case, this fixed column is an ad. As a web developer, I don't care when the ad loads. The ads are a necessary evil in my page layout. I must ensure that they load in an acceptable time frame, but certainly not the first thing on the page. The specific layout I am talking about is that of the top of dealnews.com.  It has a fixed size 300x250 ad on the right of the page and the left side is elastic. I fiddled with divs for hours to get that to act the way I wanted it to act. We use the grid CSS from OOCSS.org. Wonderful piece of CSS that is. But, even with that in hand, I could not get the elements to behave, in all browsers, the way I could with a simple two column table where the left column's width is set to 100% and the right column contains a div of width 300 pixels. It was so easy to pull that off. Maybe CSS3 is going to solve this problem? I don't know. If you have the magic CSS that can do what this page does, let me know.

ob_start and HTTP headers

I was helping someone in IRC deal with some "headers already sent" issues and told them to use ob_start. Very diligently, the person went looking for why that was the right answer. He did not find a good explination. I looked around and I did not either. So, here is why this happens and why ob_start can fix it.

How HTTP works

HTTP is the communication protocol that happens between your web server and the user's browser.  Without too much detail, this is broken into two pieces of data: headers and the body.  The body is the HTML you send. But, before the body is sent, the HTTP headers are sent. Here is an example of an HTTP request response including headers:
HTTP/1.1 200 OK
Date: Fri, 29 Jan 2010 15:30:34 GMT
Server: Apache
X-Powered-By: PHP/5.2.12-pl0-gentoo
Set-Cookie: WCSESSID=xxxxxxxxxxxxxxxxxxxxxxxxxxxx; expires=Sun, 28-Feb-2010 15:30:34 GMT; path=/
Content-Encoding: gzip
Vary: Accept-Encoding
Keep-Alive: timeout=15, max=99
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Ramblings of a web guy</title>
.
.
So, all those lines before the HTML starts have to come first. HTTP headers are where things like cookies and redirection occur. When a PHP script starts to send HTML out to the browser, the headers are stopped and the body begins. When your code tries to set a cookie after this has started, you get the "headers already sent" error message.

How ob_start works

So, how does ob_start help? The ob in ob_start stands for output buffering. ob_start will buffer the output (HTML) until the page is completely done. Once the page is completely done, the headers are sent and then the output is sent. This means any calls to setcookie or the header function will not cause an error and will be sent to the browser properly. You do need to call ob_start before any output occurs. If you start output, it is too late.

The down side

The down side of doing this is that the output is buffered and sent all at once. That means that the time between the user request and the time the first byte gets back to the user is longer than it has to be. However, in modern PHP application design, this is often already the case. An MVC framework for example would do all the data gathering before any presentation is done. So, your application may not have any issue with this.

Another down side is that you (or someone) could get lazy and start throwing setcookie calls in any old place. This should be avoided. It is simply not good programming design. In a perfect world, we would not need output buffering to solve this problem for us.

Using ini files for PHP application settings

At dealnews we have three tiers of servers. First is our development servers, then staging and finally production. The complexity of the environment increases at each level. On a development server, everything runs on the localhost: mysql, memcached, etc. At the staging level, there is a dedicated MySQL server. In production, it gets quite wild with redundant services and two data centers.

One of the challenges of this is where and how to store the connection information for all these services. We have done several things in the past. The most common thing is to store this information in a PHP file. It may be per server or there could be one big file like:

<?php

if(DEV){
    $server = "localhost";
} else {
    $server = "10.1.1.25";
}

?>


This gets messy quickly. Option two is to deploy a single file that has the settings in a PHP array. And that is a good option. But, we have taken that one step further using some PHP ini trickeration. We use ini files that are loaded at PHP's startup and therefore the information is kept in PHP's memory at all times.

When compiling PHP, you can specify the --with-config-file-scan-dir to tell PHP to look in that directory for additional ini files. Any it finds will be parsed when PHP starts up. Some distros (Gentoo I know) use this for enabling/disabling PHP extensions via configuration. For our uses we put our custom configuration files in this directory. FWIW, you could just put the above settings into php.ini, but that is quite messy, IMO.

To get to this information, you can't use ini_get() as you might think.  No, you have to use get_cfg_var() instead. get_cfg_var returns you the setting, in php.ini or any other .ini file when PHP was started. ini_get will only return values that are registered by an extension or the PHP core. Likewise, you can't use ini_set on these variables. Also, get_cfg_var will always reflect the initial value from the ini file and not anything changed with ini_set.

So, lets look at an example.

; db.ini
[myconfig]
myconfig.db.mydb.db     = mydb
myconfig.db.mydb.user   = user
myconfig.db.mydb.pass   = pass
myconfig.db.mydb.server = host


This is our ini file. the group in the braces is just for looks. It has no impact on our usage. Because this is parsed along with the rest of our php.ini, it needs a unique namespace within the ini scope. That is what myconfig is for. We could have used a DSN style here, but it would have required more parsing in our PHP code.

<?php

/**
 * Creates a MySQLi instance using the settings from ini files
 *
 * @author     Brian Moon <brianm@dealnews.com>
 * @copyright  1997-Present dealnews.com, Inc.
 *
 */

class MyDB {

    /**
     * Namespace for my settings in the ini file
     */
    const INI_NAMESPACE = "dealnews";

    /**
     * Creates a MySQLi instance using the settings from ini files
     *
     * @param   string  $group  The group of settings to load.
     * @return  object
     *
     */
    public static function init($group) {

        static $dbs = array();

        if(!is_string($group)) {
            throw new Exception("Invalid group requested");
        }

        if(empty($dbs["group"])){

            $prefix = MyDB::INI_NAMESPACE.".db.$group";

            $db   = get_cfg_var("$prefix.db");
            $host = get_cfg_var("$prefix.server");
            $user = get_cfg_var("$prefix.user");
            $pass = get_cfg_var("$prefix.pass");

            $port = get_cfg_var("$prefix.port");
            if(empty($port)){
                $port = null;
            }

            $sock = get_cfg_var("$prefix.socket");
            if(empty($sock)){
                $sock = null;
            }

            $dbs[$group] = new MySQLi($host, $user, $pass, $db, $port, $sock);

            if(!$dbs[$group] || $dbs[$group]->connect_errno){
                throw new Exception("Invalid MySQL parameters for $group");
            }
        }

        return $dbs[$group];

    }

}

?>


We can now call DB::init("myconfig") and get a mysqli object that is connected to the database we want. No file IO was needed to load these settings except when the PHP process started initially.  They are truly constant and will not change while this process is running.

Once this was working, we created separate ini files for our different datacenters. That is now simply configuration information just like routing or networking configuration. No more worrying in code about where we are.

We extended this to all our services like memcached, gearman or whatever. We keep all our configuration in one file rather than having lots of them. It just makes administration easier. For us it is not an issue as each location has a unique setting, but every server in that location will have the same configuration.

Here is a more real example of how we set up our files.

[myconfig.db]
myconfig.db.db1.db         = db1
myconfig.db.db1.server     = db1hostname
myconfig.db.db1.user       = db1username
myconfig.db.db1.pass       = db1password

myconfig.db.db2.db         = db2
myconfig.db.db2.server     = db2hostname
myconfig.db.db2.user       = db2username
myconfig.db.db2.pass       = db2password

[myconfig.memcache]
myconfig.memcache.app.servers    = 10.1.20.1,10.1.20.2,10.1.20.3
myconfig.memcache.proxy.servers  = 10.1.20.4,10.1.20.5,10.1.20.6

[myconfig.gearman]
myconfig.gearman.workload1.servers = 10.1.20.20
myconfig.gearman.workload2.servers = 10.1.20.21