You have to be really smart to code PHP!

In Terry Chay's brilliant blog post about Rails and PHP, a comment read:
I agree, PHP is great, but you have to be really smart to remember 3000+ core functions, totally inconsistent function names, and writing non object-oriented nonsense like $x = some_function(’x', ‘y’, $x) .. some of us just need something simple and logical.

I  agree.  So, if you are not smart, then stop coding PHP.  Its a good thing that C developers are smart.  They have to remember all those functions.  Assembly language developers don't have to remember functions.  I guess they are not smart?

Ok, that last paragraph was tongue in cheek.  Knowing 3000+ functions does not make you smart.  It means you have a good memory.  I will take intelligence over smart or a good memory any day.  Intelligent people can use a manual to find the function they need right now and then forget it until they need it again.  If they need it enough, it will be set in their long term memory.

I also want to comment on "non object-oriented nonsense".  Its good that working with objects means never having to call a function.  See, they call them methods.  And they don't return stuff a lot of the time.  Its like magic.  Grow a set people.  How hard is $x = some_function($x,$y) to understand.  That is like the simplest programming concept in existence after $x=1.  OOP appears to make coding easier on the front end.  However, it almost always makes it more complicated on the back side.

As for inconsistent function names, I have no solid defense.  I understand why they are what they are.  But, that does not make it any easier on a new comer to the language.  Having experience in C, the functions that are direct ports of C functions (e.g. strcmp) make sense.  The PCRE functions make sense because I know that regular expressions have always been expressed in /search/replace/ format.  Image Magick functions are the way they are because the API calls they are making are in the same format.  That is what a lot of people don't get about PHP.  Many of the functions you are calling are (or at some point were) direct ports of some C library function.  Therefore it makes sense to keep the function names and parameters in the same order as the API they are wrapping.  But, like I said, that is a weak defense.

Quick script to check user bandwidth usage

A buddy needed a quick report to see if one of his users was slamming his site. I got a little carried away and wrote a PHP script (plus some awk and grep) to make a little report for him. I am sure it is full of bugs and will bring your server crashing down. So, use at your own risk.

$ ./bwreport.php -h
Usage: bwreport.php [-d YYYYMMDD] [-u URI] [-i HOST/IP] [-r REGEXP] [-v]
-d YYYYMMDD Date of the logs to parse. If no date provided, yesterday assumed.
-i IP/HOST Only report log lines with IP/HOST for host part of log line
-r REGEXP Only report log lines that match REGEXP. Should be a valid grep regexp
-u URI Only report log lines with URI match to URI
-v Verbose mode

http://www.phorum.org/downloads/bwreport.php.gz

Vixie Cron and the new US DST

So, the new DST changes in the US caused a small stir among system administrators recently. We got all of our servers updated and verified they were working before the even. Or so we thought.

I noticed today that our 3PM Eastern newsletters arrived in my inbox at 3PM alright. However, I am in Central time. My immediate assumption was that we missed the server that sends that email out. Logging in I found the time correct on the server. It had received the appropriate updates thanks to portage. So, what happened? I looked at /etc/crontab and all was fine. I then looked at the system log where cron jobs are logged. Oddly, that log line said the job started at 15:00. I knew that was not correct. I started looking around at other cron jobs on other servers, especially ones that wrote files to disk. Sure enough, every server I checked was doing things an hour behind except one. It just so happens we had restarted cron on this server however last week. We had to shut it down to keep it from causing some errors while we updated the server.

So, long story short, we restarted cron on all the servers. That seems to be the only thing needed. These servers have been running (and crond as well) long before even the announcement of the DST change. I guess vixie cron can't handle time zone rules changing after he starts. For the record, we are using the latest stable version in Gentoo's Portage.

Big arrays in PHP

Update: Terry Chay has answered my question about why this is happening. In a nutshell, PHP is using 33,160 opcodes and 33,157 regsters for the verbose code. In comparison, the serialized array only uses 5 opcodes and 2 registers. He used something called VLD, which I had not heard of, to figure all this out.

So, at dealnews, we have a category tree. To make life easy, we dump it to an array in a file that we can include on any page. It has 420 entries. Expanded, one entry may look like:
$CATEGORIES[202]['id'] = "202";
$CATEGORIES[202]['name'] = "clothing & accessories";
$CATEGORIES[202]['parent'] = "0";
$CATEGORIES[202]['standalone'] = "";
$CATEGORIES[202]['description'] = "clothing";
$CATEGORIES[202]['precedence'] = "0";
$CATEGORIES[202]['preferred'] = "0";
$CATEGORIES[202]['searchable'] = "1";
$CATEGORIES[202]['product'] = "1";
$CATEGORIES[202]['aliased_id'] = "0";
$CATEGORIES[202]['path'] = "clothing & accessories";
$CATEGORIES[202]['url_safe_name'] = "clothing-accessories";
$CATEGORIES[202]['child_count'] = "6";
$CATEGORIES[202]['childlist'][0] = 202;
$CATEGORIES[202]['childlist'][1] = 2;
$CATEGORIES[202]['childlist'][2] = 275;
$CATEGORIES[202]['childlist'][3] = 4;
$CATEGORIES[202]['childlist'][4] = 481;
$CATEGORIES[202]['childlist'][5] = 446;
$CATEGORIES[202]['childlist'][6] = 454;
$CATEGORIES[202]['childlist'][7] = 436;
$CATEGORIES[202]['childlist'][8] = 205;
$CATEGORIES[202]['childlist'][9] = 227;
$CATEGORIES[202]['childlist'][10] = 203;
$CATEGORIES[202]['childlist'][11] = 280;
$CATEGORIES[202]['childlist'][12] = 204;
$CATEGORIES[202]['children'][2] = &$CATEGORIES[2];
$CATEGORIES[202]['children'][275] = &$CATEGORIES[275];
$CATEGORIES[202]['children'][4] = &$CATEGORIES[4];
$CATEGORIES[202]['children'][481] = &$CATEGORIES[481];
$CATEGORIES[202]['children'][446] = &$CATEGORIES[446];
$CATEGORIES[202]['children'][454] = &$CATEGORIES[454];
$CATEGORIES[202]['children'][436] = &$CATEGORIES[436];
$CATEGORIES[202]['children'][205] = &$CATEGORIES[205];
$CATEGORIES[202]['children'][227] = &$CATEGORIES[227];
$CATEGORIES[202]['children'][203] = &$CATEGORIES[203];
$CATEGORIES[202]['children'][280] = &$CATEGORIES[280];
$CATEGORIES[202]['children'][204] = &$CATEGORIES[204];

So, I was curious how efficient this was. I noticed that some code that was using this array was jumping in memory usage as soon as I ran the script. So, I devised a little piece of code:
<?phpecho "Memory used: ".number_format(memory_get_usage())." bytesnn";

include_once "./cat_code.php";

echo "type: ".gettype($CATEGORIES)."n";

echo "count: ".count($CATEGORIES)."nn";

echo "Memory used: ".number_format(memory_get_usage())." bytesnn";

?>

The output was very surprising:
Memory used: 41,772 bytes

type: array
count: 420

Memory used: 4,951,248 bytes

Um, whoa. 5MB of memory just for including this file. The file itself is just 326k. Needless to say, that is bad. We include that file quite liberally. I decided to see if other methods of storing that would be better. First I tried var_export format.
Memory used: 41,784 bytes

type: array
count: 420

Memory used: 1,212,076 bytes

Well, that is much better. But, it took some fiddling to get it right. var_export does not export reference notation. The children arrays were being fully expanded and not made references. Without the references, the code was using 8MB of memory. That was much worse. Also, this is not really all that readable like the raw code version. If we can't read it, it may as well be serialized. So, I tried it serialized.
Memory used: 41,764 bytes

type: array
count: 420

Memory used: 907,668 bytes

That is by far the best result. FWIW, timing the code showed that the var_export format was fastest and serializing was slowest. However, it was just .04 seconds faster (.047 vs. .089) including the PHP start up time. I will take that for the memory savings and ease of creation.

I am going to pose a question about this on the internals list and see if this is expected behavior or if it is a shocker to them as well.

I wish I was as cool as DJB

I should throw up a fanboy alert right here.  You have been warned. =)

I was reading a heated discussion about security (no link, MARC is read only right now) on the PHP internals list this past week.  In the middle of it, Zeev Suraski writes: "No remotely accessible software has a perfect track record, perhaps other than qmail."  For those that don't know, qmail is the second most used MTA (Mail Transfer Agent) on the internet.  It was written by Dan J. Bernstein (DJB).  DJB, as I like to refer to him around the office, is a professor at University of Illinois at Chicago.  You can read all about him at his web site.

The basis for Zeev's comments is DJB's qmail security guarantee.  As Dan writes, he was fed up with security holes in sendmail.  So, he decided to do something about them.  He just avoided the whole app and wrote his own.  Besides being rock solid, the application takes a very intuitive (to me) approach to internet mail.  DJB believes in separating jobs into separate daemons that run with separate users and permissions.  One daemon accepts incoming mail and puts it in a queue.  Another reads that queue and then decides if it is an internal or external delivery.  I then hands that to an local or remote daemon responsible for those jobs.  Everything has its job.  Nice and neat.

DJB did not stop there.  He also wrote (IMO) the best darn DNS server ever in djbdns.  Like qmail, it has a security guarantee.  It uses the same logical design as qmail.  Honestly, DNS propagation is a bit of mystery to me.  Bind zone files confused the hell out of me.  But, djbdns is easy as pie to use.

I have been lucky enough to use qmail for my entire career.  The first host I ever signed up with used qmail and it was all I ever wanted to use.  When our current systems administrator, a life long sendmail and bind user, came to work for us, I showed him qmail and djbdns.  It took a little while, but now he will never go back.  Even with the occasional annoyance, its better than the alternative to him.

You do have to adjust to the DJB style.  His applications don't have the normal configure, make, make install setup.  He is a FreeBSD user.  At times there are errors on non FreeBSD systems that are in his opinion flaws of those systems and not qmail.  He is usually right.  At the least, you can't say he is wrong.  djbdns for example does propagate data between hosts "automatically" like bind does.  You have to rsync the data somehow yourself.  That is a turn off at first for some.  Then they realize how much more control that will give them.
He is very diligent when it comes to sticking strictly to whatever RFC exist for each daemon he writes.  One guy I know complains that qmail is the only  MTA that requires the \r\n at the end of emails.  qmail will reject them straight away.  As you soon discover, there is a huge community of "patches" to make qmail do all sorts of things.  There is a patch for that "feature" as well.

For more on qmail, see qmail.org, a collection of patches, documents and add-ons.  The most popular of those documents is likely Life with qmail.  It is sort of a noobs guide to qmail.

For more on djbdns, see DJB's page about it.

Is Yahoo!ed a word?

Everyone has heard of being slashdotted or maybe dugg. But have you ever been Yahoo!ed?

Phones started beeping, mayhem ensued. The first thing we looked at was the database. Is some MyISAM table locked? Is there a hung log processor running? The database was busy, but it looked odd. The web servers were going nuts.

As we soon discoverd, we (dealnews.com) were mentioned in an article on Yahoo!. At 5Pm Eastern, that article made it to be the featured article on the Yahoo! front page. It was there for an hour. We went from our already high Christmas traffic of about 80 req/s for pages and 200 req/s for images to a 130 req/s for pages and 500 req/s for images. We survived with a little tinkering. We have been working on a proxy system and this sounded like as good a time as any to try it out. Thanks to the F5 BIG-IP load balancers, we could send all the traffic from Yahoo! to the proxy system. That allowed us to handle the traffic. Just after 6PM, Yahoo! changed the featured article and things returned to normal.

Until 9PM. It seems the earlier posting by Yahoo! must not have went out to all their users. Because at 9PM the connections came back with a vegance. We started hitting bottleneck after bottleneck. We would up one limit and another would bottleneck would appear. The site was doing ok during this time. Some things like images were loading slow. That was a simple underestimation of having our two image servers set to only 250 MaxClients. Their load was nothing. We upped that and images flowed freely once again. Next we realized that all our memcached daemons were maxed out on connections. So, again, we up that and restart them. That's fixed now. Oh, now that we are not waiting on memcached, the Apache/PHP servers are hitting their MaxClients. We check the load and the servers are not stressed. So, up those limits go. The proxy servers were not doing well using a pool of memcached servers. So, we set them to use just one server each. This means several copies of the same cache, but better access to the data for each server. After all that, we were handling the Yahoo! load.

In the end, it was 300 req/s for pages and 3000 req/s for images. It lasted for over 2 hours. The funny thing is, we have been talking all week about how to increase our capacity before next Christmas. Given our content, this is our busy time. Our traffic has doubled each December for the last 3 years. At one point, during the Yahoo! rush, the incoming traffic was 10MB/s. A year and a half ago, that was the size of our whole pipe with our provider. Luckily we increased that a while back.

The silver lining is that I got to see this traffic first hand for over 2 solid hours. This will help us to design our systems to handle this load and then some all the time in the future. In some ways it was a blessing.

Digg? Slashdot? They can bring traffic for sure. We have been on both several times. But wow, just getting in the third paragraph of an article that is one page deep from the Yahoo! front page can bring you to your knees if you are not ready. But, in this business, I will do it again tomorrow. Bring it on.

Update:  Yahoo! put the article on their front page again on the 26th.  Both our head sys admin and I were off.  No phones went off.  We handled 400 req/s for the front pages and 1500 req/s for images.  This lasted for 3 hours.  Granted, some things were not working.  You could not change your default settings for the front page for example.  But, all in all, the site performed quite well.

Browser KeepAlive Secrets

So, at dealnews, we are getting ready to launch a super secret thing (redesign beta preview) that requires us to use some cookie tricks.  What we decided to do was to give our users a link to a page that would set a cookie.  Then we configured our F5 BIG-IP load balancers to direct those users with the cookie set to a different pool (back end ip/port pairs).  Its not an original idea.  Yahoo! was doing something similar with their recent front page beta.  In fact, that is where I got the idea.

Well, it worked great in testing with mod_rewrite (buying a $40k device for testing is not in the budget right now) on my local machine and on the test servers.  We had no problems.  However, when we turned it all on in production using the BIG-IP we got some unexpected results.  We could go to the URL to set our cookie and our site would change.  On the redesigned page, there is another link to switch you back.  It simply deleted the cookie and redirected you.  Since the cookie was gone, you would be back to the old design, right?  WRONG!  You were stuck.  But, if you did not click on any links on the site for about, oh, 15 seconds, you would get back to the old design.  I should say at this point that Safari was the only browser that did not do this.  IE, Mozilla and Opera all had this problem.

Hmm, 15 seconds.  That is the default KeepAliveTimeout in Apache.   I took a chance and disabled keep alive in Firefox (about:config, search for keep, set to false).  BAM!  It all worked like a charm.  It seemed that IE, FF and Opera all keep your keep alive connection open even after the page is done loading.  And because the BIG-IP determined which pool you are connected to at connection time, you stayed connected with the new pool rather than switching back.  And, as long as you kept clicking around on our site, you would keep that connection open.

As for a solution, we decided to let Apache do the work for us.  We didn't want to tell the BIG-IP to start disconnecting users on every request.  Instead, we used a Location directive and SetEnv to set the nokeepalive environment variable only when users access the page that sets/unsets the cookie.  Now Apache sends the Connection: close header and the browsers comply.  You can see an immediate difference too.  Firefox for example has a noticable pause while it closes the connection and makes a new one.  I am going to dig around in the BIG-IP manual some more to see if there is anything we can do to make this work at the load balancer layer.  But, I don't really want my load balancers spending CPU cycles on something that will not be an issue once this redesign is launched.

Getting all SOAPY

So, we (dealnews.com) rolled out a new site this month, metaprice.com.  Its young and lacking features of many of the other price comparison sites, but is has great potential.  Our hope is to bring together the best features of all the other players in the market in one great application.
Part of this project required using web services with several different data suppliers.  Most support simple REST and SOAP, but some only offer SOAP.  So, given that I bit the bullet and enabled the SOAP extension for PHP5.  Wow!  I was happily surprised.  The last SOAP code I had looked at was the old PEAR code.  It was not that attractive to me.  It required a lot of work IMO to talk SOAP.

Now, with just 3 lines of code, I can get back a nice object that has all the data I need.  Kudos to Brad Lafountain, Shane Caraveo, Dmitry Stogov and anyone else that worked on this extension.  It definitely made my life easier.  Its so easy, I am actually looking forward to making a SOAP server with some of our data.

On another note, I have been a little disappointed with the MySQL FullText relavance matching.  I know that single term searches are not really easy to deal with.  But, sometimes, even multiword searches don't yield what I would hope.  For example, a search for Windows XP yields 2 systems that includes Windows XP as 2 of the top 4 matches.  There are two other matches there that are good  matches.  And, yes, I do have my min length set to 2.   I am thinking about giving Sphinx a shot to see if its relevance ranking is any better.

Anyone have a good home grown algorithm for relevance?

One million dollars (picture Dr. Evil)

I followed the lead of others on the PHP Planet and added Phorum to Ohloh.com.  They have completed their analysis and have determined that Phorum would cost $1 million to produce.  Its probably would have calculated more because we lost our CVS in January due to a hard drive issue.  So, we had to start from scratch.

According to their stats, I am the 3rd ranked contributer on my own project.  For this year, that has probably been true.  Maurice Makaay has brought lots of new ideas this year.

Initializing & typing variables with settype()

These days, the way to develop is to have E_ALL and maybe even throw in E_STRICT if you are really hard core. That of course means having all your variables initialized before they are used. You could just do something like this:
<php

$var1 = 0;
$var2 = "";
$var3 = 0.00;
$var4 = array();
$var5 = new stdClass;

?>

That works fine and is pretty clear. However, I think a more elegant solution is to use settype() like this:
<php

settype($var1, "int");
settype($var2, "string");
settype($var3, "float");
settype($var4, "array");
settype($var5, "object");

?>

IMHO, this is much more clear. Its a standard way to set the type and default value of your variables. You see, settype() assigns a default value to the variables if they do not exist already.

Another handy use is in a function.
<?php

function myfunc( $someint, $somestring) {

settype($someint, "int");
settype($somestring, "string");

if($someint>0){
echo $somestring;
}
}

?>

Now this example does not do much for the string, but it ensures the int variable is an integer. Without the settype() line, the if() statement would evaluate to true if some string value was passed in for $someint.

The one down side that some may not like is with arrays and objects. If the variable is a scalar type (int, string, float, boolean), it will be changed into an array/object with a member called scalar that contains the scalar value.

There is no note in the official docs (there is one in the comments) about settype initializing a variable. I think I will submit a documentation update request in the bug system. I am also considering writing a test to ensure this unknown, but IMHO very useful feature stays in PHP.