Separating Apache logs by virtualhost with Lua

By default, most distributions use logrotate to rotate Apache logs. Or worse, they don't rotate them at all. I find the idea of a cron job restarting my web server every night to be very disturbing. So, years ago, we started using cronolog. Cronolog separates logs using a date/time picture. So, you get nice logs per day.

But, what if you are running 5 or 6 virtual hosts on the server? Do you really want all those logs in one file? You might. But, I don't. So, we ended up running a cronolog command per virtual host. At one time, this was 10 cronolog processes. Now, they are tiny at about 500k of resident memory used when running. But still, it seemed like a waste. Enter vlogger. Vlogger could take a virtual host name in its file name picture. And it would create the directories if they did not exist. So, now, we could have logs separated by virtual host and date. Alll was good.

But, vlogger has not been updated for a while. It started spitting out errors, right into my access logs. And I could not find a solution. The incoming log data did not change. My only assumption is that some Perl library it used changed and broke it. So, here I am again with cronolog.

I decided I could just write one. So, I started thinking about the problem. It needs to be small. PHP would be a stupid choice. One PHP process would be more than 10 cronolog processes. I decided on Lua.

"Lua is a powerful, fast, lightweight, embeddable scripting language." It is also usable as a shell scripting language, which is what I needed. So, I got to hacking and came up with a script that does the job quite well. When running, it uses about 800k of resident memory. You can download the script here on my site.

vlualogger - 3.7k

The rise of the GLAMMP stack

First there was LAMP.  But are you using GLAMMP?  You have probably not heard of it because we just coined the term while chatting at work.  You know LAMP (Linux, Apache, MySQL and PHP or Perl and sometimes Python). So, what are the extra letters for?

The G is for Gearman - Gearman is a system to farm out work to other machines, dispatching function calls to machines that are better suited to do work, to do work in parallel, to load balance lots of function calls, or to call functions between languages.

The extra M is for Memcached - memcached is a high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load.

More and more these days, you can't run a web site on just LAMP.  You need these extra tools (or ones like them) to do all the cool things you want to do.  What other tools do we need to work into the acronym?  PostgreSQL replaces MySQL in lots of stacks to form LAPP.  I guess Drizzle may replace MySQL in some stacks soon.  For us, it will likely be added to the stack.  Will that make it GLAMMPD?  We need more vowels!  If you are starting the next must use tool for running web sites on open source software, please use a vowel for the first letter.

mod_substitute is cool. But, be careful with mod_proxy

For our development servers, we have always used output buffering to replace the URLs (dealnews.com) with the URL for that development environment.  Where we run into problems is with CSS and JavaScript.  If those files contains URLs for images (CSS) or AJAX (JS) the URLS would not get replaced.  Our solution has been to parse those files as PHP (on the dev boxes only) and have some output buffering replace the URLs in those files.  That has caused various problems over the years and even some confusion for new developers.  So, I got to looking for a different solution.  Enter mod_substitute for Apache 2.2.
mod_substitute provides a mechanism to perform both regular expression and fixed string substitutions on response bodies. - Apache Documentation
Cool!  I put in the URL mappings and VIOLA!  All was right in the world.

Fast forward a day.  Another developer is testing some new code and finds that his XML is getting munged.  At first we blamed libxml because we had just been through an ordeal with a bad combination of a libxml compile option and PHP a while back.  Maybe we missed that box when we fixed it.  We recompiled everything on the dev box but there was no change.  So I started to think what was recently different with the dev boxes.  So, I turn off mod_substitute.  Dang, that fixed it.  I looked at my substitution strings and everything looked fine.  After cursing and being depressed that such a cool tool was not working, I took a break to let it settle in my mind.

I came back to the computer and decided to try a virgin Apache 2.2 build.  I downloaded the source from the web site instead of building from Gentoo's Portage.  Sure enough, a simple test worked fine.  No munging.  So, I loaded up the dev box Apache configuration into the newly compiled Apache.  Sure enough, munged XML.  ARGH!!

Up until this point, I had configured the substitutions globally and not in a particular virtual host.  So, I moved it all into one virtual host configuration.  Still broken.

A little more background on our config.  We use mod_proxy to emulate some features that we get in production with our F5 BIG-IP load balancers.  So, all requests to a dev box hit a mod_proxy virtual host and are then directed to the appropriate virtual host via a proxied request. 

So, I got the idea to hit the virtual host directly on its port and skip mod_proxy.  Dang, what do you know.  It worked fine.  So, something about the output of the backend request and mod_proxy was not playing nice.  So, hmm.  I got the idea to move the mod_substitute directives into the mod_proxy virtual hosts configuration.  Tested and working fine.  So, basically, this ensures that the substitution filtering is done only after the proxy and all other requests have been processed.  I am no Apache developer, so I have not dug any deeper.  I have a working solution and maybe this blog post will reach someone that can explain it.  As for mod_substitute, here is the way my config looks.

In the VirtualHost that is our global proxy, I have this:

FilterDeclare DN_REPLACE_URLS
FilterProvider DN_REPLACE_URLS SUBSTITUTE resp=Content-Type $text/
FilterProvider DN_REPLACE_URLS SUBSTITUTE resp=Content-Type $/xml
FilterProvider DN_REPLACE_URLS SUBSTITUTE resp=Content-Type $/json
FilterProvider DN_REPLACE_URLS SUBSTITUTE resp=Content-Type $/javascript
FilterChain DN_REPLACE_URLS


Elsewhere, in a file that is local to each dev host, I keep the actual mappings for that particular host:

Substitute "s|http://dealnews.com|http://somedevbox.dealnews.com|in"
Substitute "s|http://dealmac.com|http://somedevbox.dealmac.com|in"
# etc....


I am trying to think of other really cool uses for this.  Any ideas?

ForceType for nice URLs with PHP

This has been covered before, but I was just setting up a new force type on our servers and thought I would mention it for the fun of it. You see lots of stuff about using mod_rewrite to make friendly URLs or SEO friendly URLs. But, if you are using PHP (and I guess other Apache modules) you can do it without mod_rewrite.  We have been doing this for a while at dealnews.  Even before SEO was an issue.

Setting up Apache

From the docs, the ForceType directive "forces all matching files to be served as the content type given by media type." Here is an example configuration:

<Location /deals>
ForceType application/x-httpd-php
</Location>


Now any URL like http://dealnews.com/deals/Cubicle-Warfare/186443.html will attempt to run a file called deals that is in your document root.

Making the script

First save a file called deals witout the .php extension. Modern editors will look for the <?php tag at the first and will color it right. Normally you take input to your PHP scripts with the $_SERVER["QUERY_STRING"] or the $_GET variables. But, in this case, those are not filled by the URL above. They will still be filled if there is a query string, but the path part is not included.  We need to use $_SERVER["PATH_INFO"]. In the case above, $_SERVER["PATH_INFO"] will be filled with /Cubicle-Warfare/186443.html. So, you will have to parse the data yourself. In my case, all I need is the numeric ID toward the end.

$id = (int)basename($_SERVER["PATH_INFO"]);

Now I have an id that I can use to query a database or whatever to get my content.

Avoid "duplicate content"

The bad part of my use case is that any URL that starts with /deals/ and ends in 186443.html will work. So, now we have duplicate content on our site. You may have a more exact URL pattern and not have this issue.  But, to work around this in my case, we should verify that the $_SERVER["PATH_INFO"] is the proper data for the content requested. This code will vary depending on your URLs. In my code, I generate the URL for the content and see if it matches. Kind of a reverse lookup on the URI.  If it does not match, I issue a 301 redirect to the proper location.

header("HTTP/1.1 301 Moved Permanently");
header("Location: $new_url");
exit();


Returning 404

Now, you have to be careful to always return meaningful data when using this technique. Search engines won't like you if you return status 200 for every possible random URL that falls under /deals. I know that Yahoo! will put random things on your URLs to see if you are doing the right thing. So, if you get your id and decide this is not a valid URL, you can return a 404.  In my case, I have a 404 file in my document root.  So, I just send the proper headers and include my regular 404 page.

header('HTTP/1.1 404 Not Found');
header('Status: 404 Not Found');
include $_SERVER["DOCUMENT_ROOT"]."/404.html";
exit();