ob_start and HTTP headers

I was helping someone in IRC deal with some "headers already sent" issues and told them to use ob_start. Very diligently, the person went looking for why that was the right answer. He did not find a good explination. I looked around and I did not either. So, here is why this happens and why ob_start can fix it.

How HTTP works

HTTP is the communication protocol that happens between your web server and the user's browser.  Without too much detail, this is broken into two pieces of data: headers and the body.  The body is the HTML you send. But, before the body is sent, the HTTP headers are sent. Here is an example of an HTTP request response including headers:
HTTP/1.1 200 OK
Date: Fri, 29 Jan 2010 15:30:34 GMT
Server: Apache
X-Powered-By: PHP/5.2.12-pl0-gentoo
Set-Cookie: WCSESSID=xxxxxxxxxxxxxxxxxxxxxxxxxxxx; expires=Sun, 28-Feb-2010 15:30:34 GMT; path=/
Content-Encoding: gzip
Vary: Accept-Encoding
Keep-Alive: timeout=15, max=99
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Ramblings of a web guy</title>
.
.
So, all those lines before the HTML starts have to come first. HTTP headers are where things like cookies and redirection occur. When a PHP script starts to send HTML out to the browser, the headers are stopped and the body begins. When your code tries to set a cookie after this has started, you get the "headers already sent" error message.

How ob_start works

So, how does ob_start help? The ob in ob_start stands for output buffering. ob_start will buffer the output (HTML) until the page is completely done. Once the page is completely done, the headers are sent and then the output is sent. This means any calls to setcookie or the header function will not cause an error and will be sent to the browser properly. You do need to call ob_start before any output occurs. If you start output, it is too late.

The down side

The down side of doing this is that the output is buffered and sent all at once. That means that the time between the user request and the time the first byte gets back to the user is longer than it has to be. However, in modern PHP application design, this is often already the case. An MVC framework for example would do all the data gathering before any presentation is done. So, your application may not have any issue with this.

Another down side is that you (or someone) could get lazy and start throwing setcookie calls in any old place. This should be avoided. It is simply not good programming design. In a perfect world, we would not need output buffering to solve this problem for us.

Using ini files for PHP application settings

At dealnews we have three tiers of servers. First is our development servers, then staging and finally production. The complexity of the environment increases at each level. On a development server, everything runs on the localhost: mysql, memcached, etc. At the staging level, there is a dedicated MySQL server. In production, it gets quite wild with redundant services and two data centers.

One of the challenges of this is where and how to store the connection information for all these services. We have done several things in the past. The most common thing is to store this information in a PHP file. It may be per server or there could be one big file like:

<?php

if(DEV){
    $server = "localhost";
} else {
    $server = "10.1.1.25";
}

?>


This gets messy quickly. Option two is to deploy a single file that has the settings in a PHP array. And that is a good option. But, we have taken that one step further using some PHP ini trickeration. We use ini files that are loaded at PHP's startup and therefore the information is kept in PHP's memory at all times.

When compiling PHP, you can specify the --with-config-file-scan-dir to tell PHP to look in that directory for additional ini files. Any it finds will be parsed when PHP starts up. Some distros (Gentoo I know) use this for enabling/disabling PHP extensions via configuration. For our uses we put our custom configuration files in this directory. FWIW, you could just put the above settings into php.ini, but that is quite messy, IMO.

To get to this information, you can't use ini_get() as you might think.  No, you have to use get_cfg_var() instead. get_cfg_var returns you the setting, in php.ini or any other .ini file when PHP was started. ini_get will only return values that are registered by an extension or the PHP core. Likewise, you can't use ini_set on these variables. Also, get_cfg_var will always reflect the initial value from the ini file and not anything changed with ini_set.

So, lets look at an example.

; db.ini
[myconfig]
myconfig.db.mydb.db     = mydb
myconfig.db.mydb.user   = user
myconfig.db.mydb.pass   = pass
myconfig.db.mydb.server = host


This is our ini file. the group in the braces is just for looks. It has no impact on our usage. Because this is parsed along with the rest of our php.ini, it needs a unique namespace within the ini scope. That is what myconfig is for. We could have used a DSN style here, but it would have required more parsing in our PHP code.

<?php

/**
 * Creates a MySQLi instance using the settings from ini files
 *
 * @author     Brian Moon <brianm@dealnews.com>
 * @copyright  1997-Present dealnews.com, Inc.
 *
 */

class MyDB {

    /**
     * Namespace for my settings in the ini file
     */
    const INI_NAMESPACE = "dealnews";

    /**
     * Creates a MySQLi instance using the settings from ini files
     *
     * @param   string  $group  The group of settings to load.
     * @return  object
     *
     */
    public static function init($group) {

        static $dbs = array();

        if(!is_string($group)) {
            throw new Exception("Invalid group requested");
        }

        if(empty($dbs["group"])){

            $prefix = MyDB::INI_NAMESPACE.".db.$group";

            $db   = get_cfg_var("$prefix.db");
            $host = get_cfg_var("$prefix.server");
            $user = get_cfg_var("$prefix.user");
            $pass = get_cfg_var("$prefix.pass");

            $port = get_cfg_var("$prefix.port");
            if(empty($port)){
                $port = null;
            }

            $sock = get_cfg_var("$prefix.socket");
            if(empty($sock)){
                $sock = null;
            }

            $dbs[$group] = new MySQLi($host, $user, $pass, $db, $port, $sock);

            if(!$dbs[$group] || $dbs[$group]->connect_errno){
                throw new Exception("Invalid MySQL parameters for $group");
            }
        }

        return $dbs[$group];

    }

}

?>


We can now call DB::init("myconfig") and get a mysqli object that is connected to the database we want. No file IO was needed to load these settings except when the PHP process started initially.  They are truly constant and will not change while this process is running.

Once this was working, we created separate ini files for our different datacenters. That is now simply configuration information just like routing or networking configuration. No more worrying in code about where we are.

We extended this to all our services like memcached, gearman or whatever. We keep all our configuration in one file rather than having lots of them. It just makes administration easier. For us it is not an issue as each location has a unique setting, but every server in that location will have the same configuration.

Here is a more real example of how we set up our files.

[myconfig.db]
myconfig.db.db1.db         = db1
myconfig.db.db1.server     = db1hostname
myconfig.db.db1.user       = db1username
myconfig.db.db1.pass       = db1password

myconfig.db.db2.db         = db2
myconfig.db.db2.server     = db2hostname
myconfig.db.db2.user       = db2username
myconfig.db.db2.pass       = db2password

[myconfig.memcache]
myconfig.memcache.app.servers    = 10.1.20.1,10.1.20.2,10.1.20.3
myconfig.memcache.proxy.servers  = 10.1.20.4,10.1.20.5,10.1.20.6

[myconfig.gearman]
myconfig.gearman.workload1.servers = 10.1.20.20
myconfig.gearman.workload2.servers = 10.1.20.21

Autoloading for fun and profit

So, as I stated in a previous post, the code base here at dealnews is going through some changes.  I have been working heavily on those changes since then.  Testing and benchmarking to see what works best. One of those changes is the heavy use of autoloading

During the holidays, I always like to read the PHP Advent.  One of the posts this year was by Marcel Esser titled You Don’t Need All That. It was a great post and echos many things I have said about PHP and web development. In his post, Marcel benchmarks the difference between using an autoloader and using straight require statements. I was not surprised by the result. The autoloading overhead and class overhead is well known to me. It is one thing that kept me from using any heavy OOP (we banned classes on our user facing pages for a long time) in PHP 4 and PHP 5.0. It has gotten a lot better however. Class overhead is very, very small now. Especially when using it as a namespacing for static functions. However, there is one word of caution I would like to add to his statements. When you use require/include statements instead of autoloading, you end up with a file like this:
<?php

require "DB.php";
require "Article.php";


function myfunc1() {
    DB::somemethod();
}

function myfunc2(){
    Article::somemethod();
}

?>
That file needs to require two files, but each one is only needed by one function in the file.  This is the dilemma we have found ourselves in.  We have a file that is filled with functions for building, taking apart, repairing, fetching, or anything else you can think to do with a URL. So, at the top of that file are 13 require statements. This all happened organically over time. But, now we are in a situation where we load lots of files that may not even be needed to begin with. By moving to an autloading system, we will only include the code we need. This saves cycles and file IO.

Again, Marcel's post was dead on. Our front end is written with multiple entry points.  We use auto_prepend files to do any common work that all requests need (sessions, loading the application settings, etc).  The front page of dealnews.com is about 600 lines of PHP that does the job as quickly as possible. But, we are moving it to use autoloaded code because its require list has grown larger than I would like and some of those requires are not always "required" to generate the page.

One point Marcel did make was about how modern MVC systems are part of the problem. We don't have a highly structured traditional MVC layout of our code. We use a very wide, flat directory structure rather than deep nesting of classes and objects. We also don't do a lot of overloading. Maybe 1% of our classes extend another and never more than one deep. All that makes for much quicker loading objects vs. some of the packaged frameworks like Zend Framework.

So, as Marcel warned, be aware of both your use of autoloading and require statements. Both can be bad when used the wrong way.

Code Organization Dilemma

So, we have been building up our code library at dealnews for 9 years. It was started at the end of PHP3 and the beginning of PHP4. So, we did not have autoloading, static functions, and all that jazz. Classes had lots of overhead in early PHP4 so we started down a pure procedural road in 2000. And for a long time, it was very maintainable. We had 2 or 3 developers for most of this time. We now have 5 or 6 depending on whether we have contractors. There are starting to be too many files and too many functions. We find ourselves adding new files when some new function is created instead of adding it to an existing file because we don't want to have huge files with 100 functions in them. File names and function names are getting longer and more ambiguous. For example, we have a file called url_functions.php. It contains functions to generate URLs for different types of pages on the site, functions to fetch URLs from the web and functions to parse URLs from an article. Those probably don't all belong in one file. But, they got nickle and dimed in there over time. So, now, we are inclined to not add anything to that file and make new files for new semi-URL related functions. Ugh.

It is time to start thinkinb about a reorganization. There are 1,900+ functions in 400+ files in our code library. This is just our library. This does not include the code that actually builds a page and generates output. It does not include our cron jobs or system administration scripts. Yeah, that is a lot. So, where do we go from here? Some things are easy to do. For example, we have a file called string.php. Most all the functions in that file can easily be moved a String class with static functions that can be accessed via an autoloader.

Then we have the various ways we deal with the articles on the web site. I have written about our front end vs. back end system before. What this means for our code base is that we have two ways to deal with an article. One is in our highly relational backend system. The other is in our optimized front end database servers. So, one Article object won't really do. We already have an Article object that serves as an ORM interface for the backend. To access the front end data, we currently have a library of functions (fetch_article for a single, fetch_articles for a set, etc.) but it does not fit with an autoloading environment. It also is not related to the object (the article) and is associated with where the data is stored. New developers don't grok the server infrastructure, so the code organization may not make sense to them. We have about 10 different objects that need both a back end and front end interface.

On the other hand, I really don't want to end up with a class named FrontEndArticle and BackEndArticle. Much less do we want to have stuff like BackEnd_Article where the file is actually in BackEnd/Article.php somewhere. The verbosity becomes overwhelming and hard to read, IMO.

So, what are others doing with huge code bases? I see lots of projects with 100 or so functions/methods in 20-30 files.  Frameworks have it easy because they don't have a CEO that wants something on this one page to be different than it is on every other page where that data is used. We have to deal with those types of hacks in an elegant way that can be maintained.

Forums are crap. Can we get some help?

Amy Hoy has written a blog post about why forums are crap. And she is right. Forum software does not always do a good job of helping people communicate. I have worked with Amy. She did a great analysis of dealnews.com that led to our new design. So, she is not to be ignored.

However, as a software developer (Phorum), I see a lot of problems and no answers.  And it is not all on the software.  Web site owners use forums to solve problems that they really, really suck at.  Ideally, every web site would be very unique for their audience.  They would use a custom solution that fits a number of patterns that best solves their problem.  However, most web site owners don't want to take the time to do such things.  They want a one stop, drop in solution. See the monolith that is vBulletin, scary.

And what if a forum is the best solution? Well, software developers, in general, are not good designers. They don't think like normal people. And they don't see their applications as a whole, but as pieces that do jobs. The forum software market has been run by software developers for over 10 years. Most of them all are still copies of what UBB was 13 years ago. And software (like Phorum) that has tried to be different is shunned by the online communities of the world because they don't work/look/feel like every other forum software on the planet.

So, as software developers, what are we to do? We want to make great software. We want to help our users help their users. But, what we have been doing for 10+ years has only been adequate. As the leader of an open source forum software project, I am open to any and all ideas.

Memcached: What is it and what does it do?

I spoke at CodeWorks in Atlanta, GA this week.  I totally dropped the ball promoting it on my blog.  It was a neat venue.  Rather than a large conference they are doing a traveling show.  Seven cities in 14 days.  Many of the presenters are working in every city.  Crazy.  I was just in Atlanta.  It is close to home and easy for me to get to.

I spoke about memcached.  I tried to dig a bit deeper into how memcached works.  On the mailing list we get a lot of new people that make assumptions about memcached.  Most talks I have seen focus on why caching is good, how to use memcached, the performance gain.  I kind of assumed everyone knew that stuff already.  I guess you could say I gave a talk that was the real FAQs of the project.

Here are the slides.  Derick Rethans took video of the talk.  When he gets that online I will add it to this post.

Forking PHP!

We use PHP everywhere in our stack. For us, it makes sense because we have hired a great staff of PHP developers. So, we leverage that talent by using PHP everywhere we can.

One place where people seem to stumble with PHP is with long running PHP processes or parallel processing. The pcntl extension gives you the ability to fork PHP processes and run lots of children like many other unix daemons might. We use this for various things. Most notably, we use it run Gearman worker processes. While at the OReilly Open Sourc Convention in 2009, we were asked about how we pulled this off. So, we are releasing the two scripts that handle the forking and some instructions on how we use them.

This is not a detailed post about long running PHP scripts.  Maybe I can get to the dos and don'ts of that another time.  But, these are the scripts we use to manage long running processes.  They work great for us on Linux.  They will not run on Windows at all.  We also never had any trouble running them on Mac OS X.

The first script, prefork.php, is for forking a given function from a given file and running n children that will execute that function. There can be a startup function that is run before any forking begins and a shutdown function to run when all the children have died.

The second script, prefork_class.php, uses a class with defined methods instead of relying on the command line for function names. This script has the added benefit of having functions that can be run just before each fork and after each fork. This allows the parent process to farm work out to each child by changing the variables that will be present when the child starts up. This is the script we use for managing our Gearman workers. We have a class that controls how many workers are started and what functions they provide. I may release a generic class that does that soon. Right now it is tied to our code library structure pretty tightly.

We have also included two examples. They are simple, but do work to show you how the scripts work.

You can download the code from the dealnews.com developers' page.

UPDATE: I have released a Gearman Worker Manager on Github.