mod_substitute is cool. But, be careful with mod_proxy

Tue, Apr 7, 2009 08:03 PM
For our development servers, we have always used output buffering to replace the URLs (dealnews.com) with the URL for that development environment.  Where we run into problems is with CSS and JavaScript.  If those files contains URLs for images (CSS) or AJAX (JS) the URLS would not get replaced.  Our solution has been to parse those files as PHP (on the dev boxes only) and have some output buffering replace the URLs in those files.  That has caused various problems over the years and even some confusion for new developers.  So, I got to looking for a different solution.  Enter mod_substitute for Apache 2.2.
mod_substitute provides a mechanism to perform both regular expression and fixed string substitutions on response bodies. - Apache Documentation
Cool!  I put in the URL mappings and VIOLA!  All was right in the world.

Fast forward a day.  Another developer is testing some new code and finds that his XML is getting munged.  At first we blamed libxml because we had just been through an ordeal with a bad combination of a libxml compile option and PHP a while back.  Maybe we missed that box when we fixed it.  We recompiled everything on the dev box but there was no change.  So I started to think what was recently different with the dev boxes.  So, I turn off mod_substitute.  Dang, that fixed it.  I looked at my substitution strings and everything looked fine.  After cursing and being depressed that such a cool tool was not working, I took a break to let it settle in my mind.

I came back to the computer and decided to try a virgin Apache 2.2 build.  I downloaded the source from the web site instead of building from Gentoo's Portage.  Sure enough, a simple test worked fine.  No munging.  So, I loaded up the dev box Apache configuration into the newly compiled Apache.  Sure enough, munged XML.  ARGH!!

Up until this point, I had configured the substitutions globally and not in a particular virtual host.  So, I moved it all into one virtual host configuration.  Still broken.

A little more background on our config.  We use mod_proxy to emulate some features that we get in production with our F5 BIG-IP load balancers.  So, all requests to a dev box hit a mod_proxy virtual host and are then directed to the appropriate virtual host via a proxied request. 

So, I got the idea to hit the virtual host directly on its port and skip mod_proxy.  Dang, what do you know.  It worked fine.  So, something about the output of the backend request and mod_proxy was not playing nice.  So, hmm.  I got the idea to move the mod_substitute directives into the mod_proxy virtual hosts configuration.  Tested and working fine.  So, basically, this ensures that the substitution filtering is done only after the proxy and all other requests have been processed.  I am no Apache developer, so I have not dug any deeper.  I have a working solution and maybe this blog post will reach someone that can explain it.  As for mod_substitute, here is the way my config looks.

In the VirtualHost that is our global proxy, I have this:

FilterDeclare DN_REPLACE_URLS
FilterProvider DN_REPLACE_URLS SUBSTITUTE resp=Content-Type $text/
FilterProvider DN_REPLACE_URLS SUBSTITUTE resp=Content-Type $/xml
FilterProvider DN_REPLACE_URLS SUBSTITUTE resp=Content-Type $/json
FilterProvider DN_REPLACE_URLS SUBSTITUTE resp=Content-Type $/javascript
FilterChain DN_REPLACE_URLS


Elsewhere, in a file that is local to each dev host, I keep the actual mappings for that particular host:

Substitute "s|http://dealnews.com|http://somedevbox.dealnews.com|in"
Substitute "s|http://dealmac.com|http://somedevbox.dealmac.com|in"
# etc....


I am trying to think of other really cool uses for this.  Any ideas?

Best practices for escaping HTML

Fri, Mar 20, 2009 10:55 PM
I am working on Wordcraft, trying to get the last annoying HTML validation errors worked out.  Thinks like ampersands in URLs.  In doing so, I am asking myself where the escaping should take place. In the case of Wordcraft, there are several parts to it.
  1. The code that pulls data from the database.  Obviously not the right place.
  2. The code that formats data like dates and such.  It also organizes data from several data sources into one nice tidy array.  Hmm, maybe
  3. The parts of the code that set up the output data for the templates.
  4. The templates themselves.
Now, I am sure 1 is not the place.  And I really would not want 4 to be the place.  That would make for some ugly templating.  Plus, the templates, IMO, should assume the data is ready to be output.  So, that leaves the code that does the formatting and the code that does the data setup.

Of those two, I guess the place to do this job is in the data setup.  Wordcraft has a $WCDATA array that is available in the scope of the templates.  I suppose anything that goes into that array should be escaped as appropriate.

I largely wrote this blog post as a teddy bear exercise.  But, I am curious.  Where and when do you escape your data for use in HTML documents?

HTML vs. XHTML and validation

Tue, Mar 10, 2009 05:08 PM
There is no shortage on the pages on the internet that talk about HTML vs. XHTML.  The vast majority of these (in the first few pages of Google) seem to favor XHTML.  I don't really have an agenda, so I thought I would post my thoughts on the topic.

I have stated on this blog that I use HTML 4.01 Transistional.  I do so because it is easiest for me.  Some people argue that XHMTL is easier because there are set rules and if you violate those rules, the documents will not render.  Is that a good thing?  Perhaps my time in the late 90's has made my mind work differently than newcomers to the World Wide Web.

The browser wars were ugly.  And I mean literally ugly.  If you wanted to do anything fancy, it required lots of images or compromise.  I learned early on that it was ok that the spacing in IE on my PC was larger than IE on the Mac.  The fonts were all different sizes from browser to browser and OS to OS.  I learned that graceful fallback was part of the web.  Even now, dealnews.com looks "adequate" in IE 6.  I could make it look perfect.  But, the declining traffic from IE6 does not merit my time to fix the errors in IE 6.

So, when I start thinking about HTML vs. XHTML, I want the more flexible of the two.  I find syntax like nowrap='nowrap' very annoying in XHTML.  Especially since I can't say nowrap='yeswrap' and it mean anything.  nowrap=1 I could handle.  But, no, it has to be nowrap='nowrap'.  Geez.

Ok, ok, this is turning into an XHTML hate post.  I don't want to do that.  There are some things about XHTML that I do like.  I like the self closing tags.  My OCD (which I have brought up before) has never liked having an open tag without a closing tag.  so, the <br /> format is appealing to me in that sense.  I love that XHTML elements should always be lower case.  I hate upper case HTML.  It just reads funny.  Like camel case function names.  Some folks on our content team used to use Adobe PageMaker to write up deals.  They would copy and paste the HTML from there into our CMS.  The output would be pretty ugly.

So, I like parts of both.  What is interesting to me is the fact that the "big sites" on the internet don't seem concerned with document types or validation.

Site DocType Validates
Google None No
Yahoo HTML 4.01 Strict No
Live.com (Microsoft) XHTML 1.0 Transitional No
MSN.com XHTML 1.0 Strict Yes
Facebook XHTML 1.0 Strict No
eBay HTML 4.01 Transitional No
YouTube HTML 4.01 Transitional No
Amazon.com None No
Wikipedia XHTML 1.0 Strict Yes
MySpace XHTML 1.0 Transitional No

So, of the 10 most popular sites on the internet (according to Compete.com), two don't include a document type in their front page at all.  Only two of the sites validate according to the W3C.  MSN and Wikipedia both validated on their front page with XHTML 1.0 Strict.  However, neither is sending a Content-Type of application/xhtml+xml.  According to this page, that is a bad thing.  And the search results page for XHTML on MSN.com did not validate.  Kudos to Wikipedia.  Their page on XHTML does validate.  Interestingly, they switch to XHTML 1.0 Transitional for that page.

So, is the internet broken?  No.  The most important validation is that of your users.  Can they use the site?  Does the site look right in their browser?  Most sites have much bigger navigation and content issues than they do document structure.

So, my idea of validation is this:   Does it render the same (or damn near) in the browsers that cover 90% of the internet users?  If so, then your page validates.  The only way to check that is (most likely without SkyNet) the human eye.

Open Source Web Design

Mon, Nov 17, 2008 12:00 PM
So, my wife told me that my site design was boring.  Yeah, she was right.  I am no designer.  I just don't have that gene.  But, during my work on Wordcraft, I came across some cool places to find designs that are relased under Open Source licenses.
  • Open Designs - This is arguably the the prettiest of the three. The search, however, is painfully slow because all results return on one page.  I guess if you can wait, this is a plus as browsing is easier.  Also, you can pick multiple colors and choose by license.  They only list XHTML templates (at least as search options).  That could be a turn off if you like HTML 4 like me.
  • Open Web Design - The site itself could use a design overhaul.  But, the content is good.  The search lets you choose primary and secondary color, a unique feature among these sites.  Thumbnails are a bit small though.
  • Open Source Web Design - Their search is not as powerful as the others, but it does return very fast.  The thumbnails are a nice size.
You will find the same content on all three sometimes.  But, it comes down to browsing and searching.

I found my new design at one of those.  Not sure which, I looked at a lot of them.  I did not use the template's HTML exactly as I like HTML 4.0 and wanted a different sidebar than the original author.  But, the design is the hard part.  So, thanks for Deep Red.

Google Chrome and privacy

Tue, Sep 2, 2008 02:39 PM
So, Google Chrome is out. If you don't know, it's Google's new browser. I downloaded it on my Windows XP machine and tried it out. I found this curious thing in the options.

Google Chrome Spying on you?

So, I thought, I will click "Learn more" to see what they are watching. I get this.

Uh OH! 404!

So, I unchecked the box. Let's hope the premature launch is the reason there is no more information out there.

UPDATE: The page comes up now and says:
Information that's sent to Google includes crash reports and statistics on how often you use Google Chrome features. When you choose to accept a suggested query or URL in the address bar, the text you typed and the corresponding suggestion is sent to Google. Google Chrome doesn't send other personal information, such as name, email address, or Google Account information.

So, if you use their suggestions, they know it.  And it tracks what features you use.  Hmm, I think I will disable.

Forums are the red headed step child of a web site

Wed, Feb 20, 2008 06:07 PM
I have seen it time and time again. And yet, every time, it irritates me to no end. You are on a professional web site. You are navigating around and at some point you hit the link for their forums. And just like that you feel transported to another place. The whole site design just changes. Colors, layout, navigation... everything. Here are some examples, including the new C7Y site from php|Architect which inspired this post. (I really do love you guys on the podcast I promise =)

  • php|architect's C7Y - main site - forums

  • Zend's Developer Zone - main site - forums
    Zend's forums do at least use the Zend.com header, but you can't get to the forums from the main Zend.com site. You have to go to the Developer Zone.

  • TextPad (great windows editor) - main site - forums
    The header is kind of the same. Fonts and link colors change slightly though which is worse in some ways than a wholesale change. It looks like they just wedged in their HTML into the phpBB template.


I could continue to list some here, but you get the idea. So, what is the problem? Does most message board software make it too hard to edit their templates? Are forums an after thought and some underling is given the task to make them work and not allowed access to the main site's templates?

Some people do better at it. MySQL for example. Theirs is still not perfect. An ad awkwardly appears in the forums in a way that makes it look like an error. However, thanks to Phorum (cha-ching), MySQL was able to make their own log in system work with their forums. Heck, even at dealnews I have not done that. Mostly because our forum logins predate our site accounts for email alerts and newsletters. I am not asking for perfection though. I would just like to feel like the company/entitiy gave some love to making their forums part of their site and not an afterthought.

So, I call for all web sites to start treating their forums like real pages. Give them the same love and attention you give that front page or any other page. And, if your message board software makes that hard, give Phorum a try.

My editor of choice

Wed, Oct 10, 2007 12:57 AM
So, I was listening to the Pro PHP Podcast on the way home from work today.  They were talking about Komodo a lot.  I figured I would give my favorite editor a plug.  Believe it or not, it's jEdit.

I keep trying all the latest and greatest editors out there.  I fought with Eclipse and have tried the newer more PHP centric offerings built on Eclipse.  I recently tried out Komodo Edit for a week.  I had tried the Komodo IDE when it came out for Mac a while back.  But, I just keep coming back to jEdit.

What I like about it

The main thing that I like about jEdit over the other top contenders of the new generation is that it has a simple file browser.  It does not have the concept of "projects".  Eclipse and Komodo both have these concepts.  But, when I really got to looking at the projects in Komodo, you basically set a point in your filesystem and tell it that everything in this dir is Project Foo.  So, really, you have to have your code organized on disk anyway.  It also bugged me (in Komodo Edit at least) that my project file had to live in the same dir with my project's code.  That just seemed awkward.  Not everyone that shares my SVN is gonna want that and its gonna be sitting there in my svn status as an unknown file.

Another thing I like about jEdit is the rather large plugin repository.  Now, it's an older project, so that is something that you would hope any established application would have.  But, if I am thinking about switching today, I have to give the nod to jEdit here.  The list is a bit Java-centric of course.  It's a Java application after all.  But, there are some good ones in there like a PHP code structure browser.  I can't live without that.  Makes finding functions or methods really easy in large libraries.

What I don't like

Its Java so its not quite like working with a native application.  The dialogs are funny and the UI is just a bit off even with the Mac plugin that makes it more Mac looking.  Having said that, I don't want a truly "Mac like" editor.  BBEdit and XCode are not my kind of editors.  I like tabbed interfaces vs. multi windowed UIs.

Its not an IDE, its an editor.  There is no debugging, at least, not easily.  There looks to be some ability to hook in debugging tools, but I have not gone through the trouble.  Of course, that could be said of many of the IDEs out there.  PHP has never had the ease of debugging that say Visual Basic had (still has?) back in 1998 when that was my full time job.  That was one thing about VB I loved.  The language was "eh".  But the IDE was really nice.

Things I don't care about that you might

jEdit does not have an SVN plugin that I can find.  I like my command line.  I know one coworker is addicted to the Eclipse real time SVN diff highlighting.  There is a CVS plugin, but I don't know how good it is.  I am not aware of any PHP code completion, but it may be there.  I have an odd knack for remembering stuff like that and those little pop ups just annoy me.  Oh, and did I mention its Java?  That put me off for a long time.  But, it won me over.

O'Reilly Open Source Conference Day One

Thu, Jul 26, 2007 02:52 AM
Day one is complete.  Portland is great as always.  Its really day 1 1/2 since we got in at 1PM yesterday.  That allowed us to go to the MySQL/Zend party last night.  Great party by those guys.  Touched based with old friends and made some new ones.

I kind of session hopped today.  Of note, I attended Andi Gutmans PHP Security talk which really had little to do with PHP.  Like Larry Wall's onion metaphor, Andi presented an onion metaphor for security.  I stopped in for a while on the SOLR talk.  It looks neat.  I like that it is a REST interface to Lucene.  If we were not using Sphinx already I might take a longer look.  But, we like Sphinx and, SOLR and Lucene are Java.  Not that there is anything wrong with that, we just don't use Java a lot, so its just one more thing that would be out of the norm.  I admit I spent a good bit of time in what is being called the "hallway track" working on some code.  Work does not stop just because you are at a conference.

I got to hang out with Jay Pipes of the MySQL Community team a good bit.  We talked about the MySQL forums (which or course runs Phorum) and how they want to improve them.  They would like to see tagging, user and post rating and some other things.  Some good things will come out of that.  Hopefully they have some of the tagging stuff done already at MySQL Forge and can contribute that code to Phorum, saving us time.

I hosted the Caching for fun and profit BoF.  It was not packed, but it was a good time.  The MySQL BoF was at the same time, so we lost some folks to that I am sure.  They had beer and pizza.  Brad Fitzpatrick did come by and contribute.  Thanks Brad.  It was mostly the same stuff you get on the memcached mailing list.  "How do we expire lots of cache at once?"  Questions about different clients.  Stuff like that.  It kind of turned into a memcached BoF, but I tried to share the dealnews experience with the attendees including our MySQL Cluster pushed caching.

I have met many readers of both dealnews and this blog (hi to you) while here.  Glad to know that both my professional work and my personal work are of use to folks.  The demographic at this conference is dead on for dealnews.  Maybe I can get them to sponsor it next year.  That would be cool.

I say every year that I want to present "next year".  Something always keeps me from doing it.  Usually its just not having time to prep for it.  By the time I think about it, the call for papers has passed.  I really want to get it done this time.  We shall see I suppose.

We went to the Sun party tonight.  It was a good time.  There was beer that was free as in beer.  More hanging with friends and talking about all kinds of stuff.  Now, all you Slashdotters sit down.  I saw people from the PostgreSQL and MySQL teams drinking beer and having fun together.  OMGWTFBBQ!!!1!!  See, the people that really matter in those projects don't bicker and fight about which is better.  They just drink beer and have a good time together.

Anyhow, I will blog more after day 2.  There won't be a day 3 as I have to catch an 11:30 flight back home.  That is usually how it goes.  Not sure why they book anything on Friday really.  Even O'Reilly has its "after party" on Thursday night.  Its late, and I need sleep.

HTML Purifier and Phorum

Thu, Jun 28, 2007 11:12 PM
There have been several posts about HTML Purifier 2.0 lately. I did not look to closely at it until I saw this post on our Phorum support forum. Seems the creator of HTML Purifier has chosen Phorum for his site. I hope that means it met his standards for HTML and security. He has posted some questions about the Phorum core. We always welcome a fresh mind.

He is writing a module for Phorum to allow straight HTML in Phorum posts. We have an HTML module already, but its quite basic compared to what you can do with his library. Several people have wanted to use the WYSIWYG text editors that are out there. This should/could open that up to people. I don't see the Phorum core ever having one, but that is what modules are for.

Quick script to check user bandwidth usage

Sun, Apr 15, 2007 01:51 AM
A buddy needed a quick report to see if one of his users was slamming his site. I got a little carried away and wrote a PHP script (plus some awk and grep) to make a little report for him. I am sure it is full of bugs and will bring your server crashing down. So, use at your own risk.

$ ./bwreport.php -h
Usage: bwreport.php [-d YYYYMMDD] [-u URI] [-i HOST/IP] [-r REGEXP] [-v]
-d YYYYMMDD Date of the logs to parse. If no date provided, yesterday assumed.
-i IP/HOST Only report log lines with IP/HOST for host part of log line
-r REGEXP Only report log lines that match REGEXP. Should be a valid grep regexp
-u URI Only report log lines with URI match to URI
-v Verbose mode

http://www.phorum.org/downloads/bwreport.php.gz