MemProxy 0.1

MemProxy 0.1 is out!  It has taken me a while, but I have finally gotten around to releasing the code that I credited with saving us during a Yahoo! mention.  It is a caching proxy "server" that uses memcached for storing the cache.  I put server in quotes because it is really just a PHP script that handles the caching and talking to the application servers.  Apache and other HTTP servers already do a good job talking HTTP to a vast myriad of clients.  I did not see any reason to reinvent the wheel.  Here are some of the features that make it different from anything I could find:

  • Uses memcached for storage

  • Serves cache headers to clients based on TTL of cached data

  • Uses custom headers to assemble multiple pieces of cache into one object

  • Minimal dependencies.  Only PHP and pecl/memcached needed.

  • Small code base.  It is just two files, one when settings are cached.

  • Application agnostic.  If the backend is hosted on an HTTP server this can cache it.


Some other things it does that you might expect:

  • Handles HTTP 1.1 requests to the backend

  • Allows TTLs set by the standard Cache-Control header

  • Appears transparent to the client.

  • Sends proper HTTP error codes relating to proxies/gateways

  • Allows pages to be refreshed or removed from cache

  • Allows a page to be viewed from the application server without caching it

  • more....


You can find the code on Google Code.  The code (or something like it rather) has been in use at dealnews for well over a year.  But, this is a new code base.  It had to be refactored for public consumption.  So, there may be bugs.

OOP does not equal portable or shareable

So, just now, I was reading a good Rails post by Stuart Herbert and nodding my head along.  I have not gotten into the Rails bashing fun on my blog, but I do poke fun at it around the office.  Then I got to this part:
The OO in Rails continues to leave PHP for dead, and OO brings many advantages to a thriving development community.  There are real advantages to being able to share code between both the must-be-real-time web front-end and the non-real time backends, and to be able to easily reuse whatever external open-source libraries save you time and effort.

Now, I have no idea about the first part.  I am not an OOP guy.  But, what I have issue with is the idea that for code to be reusable, it has to be OOP.  So, if I am a college kid or young PHP developer, I would read this and think "Oh, so, to reuse code or share it, I have to be using OOP".  Man, this is just so dead wrong and irresponsible.  Can someone tell me why only OOP can be reused?  Why can't people write sane functions that can be reused?  I do it every day.  They do it in C all the time.  Our front end web servers run the same code base as the cron jobs that do a wide variety of things.  They use the same libraries.  They use the same objects (yeah, i use them when they are a good idea).

Please, someone explain this to me.

(I have a half written post about how you can write good, maintainable, reusable code without OOP.  I have not finished it yet, but I guess I need to.  It seems the world is going to OOP hell otherwise.)

in_array is quite slow

So, we had a cron job hanging for hours.  No idea why.  So, I started debugging.  It all came down to a call to in_array().  See, this job is importing data from a huge XML file into MySQL.  After it is done, we want to compare the data we just added/updated to the data in the table so we can deactivate any data we did not update.  We were using a mod_time field in mysql in the past.  But, that proved to be an issue when we wanted to start skipping rows from the XML that were present but unchanged.  Doing that saved a lot of MySQL writes and sped up the process.

So, anyhow, we have this huge array of ids accumulated during the import.  So, an in clause with 2 million parts would suck.  So, we suck back all the ids in the database that exist and stick that into an array.  We then compared the two arrays by looping one array and using in_array() to check if the value was in the second array.  Here is a pseudo example that shows the idea:

[sourcecode language='php']

foreach($arr1 as $key=>$i){

if(in_array($i, $arr2)){

unset($arr1[$key]);

}
}

[/sourcecode]

So, that was running for hours with about 400k items.  Our data did not contain the value as the key, but it could as the value was unique.  So, I added it.  So, now, the code looks like:

[sourcecode language='php']

foreach($arr1 as $key=>$i){

if(isset($arr2[$i])){

unset($arr1[$key]);

}
}

[/sourcecode]

Yeah, that runs in .8 seconds.  Much better.

So, why were we using in_array to start with if in_array is clearly not the right solution to this problem?  Well, it was basic code evolution.  Originally, these imports would be maybe 100 items.  But, things changed.

FWIW,  I tried array_diff() as well.  It took 25 seconds.  Way better than looping and calling in_array, but still not as quick as a simple isset check.  There was refactoring needed to put the values into the keys of the array.

UPDATE: I updated this post to properly reflect that there is nothing wrong with in_array, but simply that it was not the right solution to this problem.  I wrote this late and did not properly express this.  Thanks to all those people in the comments that helped explain this.

Stupid PHP Tricks: Normalizing SimpleXML Data

SimpleXML is neat.  Some people don't think it is so simple.  Boy, use the old stuff.  The DOM-XML stuff.

Anyhow, one annoying thing about SimpleXML has to do with caching.  When using web services, we often cache the contents we get back.  We were having a problem where we would get an error about a SimpleXML node not existing.  We were caching the data in memcached which serializes the variable.  So, when it unserialized the variable, there were references in there to some SimpleXML nodes that we did not take care of.  Basically, a tag like:

<foo>bar</foo>

is a string.  But a tag like:

<foo></foo>

is an empty SimpleXML Object.  That is a little annoying, but I don't feel like digging into the C code and figuring out why.  So, we just work around it.  We made a recursive function to do the dirty work for us.

function makeArray($obj) {
$arr = (array)$obj;
if(empty($arr)){
$arr = "";
} else {
foreach($arr as $key=>$value){
if(!is_scalar($value)){
$arr[$key] = makeArray($value);
}
}
}
return $arr;
}

That will turn whatever you pass it into an array or empty string if it is empty.

But, while I was hacking around tonight, I came up with another idea.  Check out this hackery:

$data = json_decode(json_encode($data));

Yeah!  One liner.  That converts all the SimpleXML elements into stdClass objects.  All other vars are left intact.

Ok, so this is where someone in the comments can tell me about the magic SimpleXML method or magic OOP function I have missed to take care of all this.  Go ahead, please make my code faster.  I dare you.

Short Array Syntax for PHP

So, I was asked in IRC today about the proposed short array syntax for PHP. For those that don't know, I mean the same syntax that other languages (javascript, perl, python, ruby) all have. Currently in PHP we have this:

$var = array(1,2,3);

The proposed additional syntax is:

$var = [1,2,3];

So, I voted +1 for this feature on the PHP Internals list. A colleague asked me why I voted +1. At first I had no good answer other than it was just a gut feeling. It just feels like a good addition to the language. It is common among web languages and therefore users coming into PHP from other languages may find it more comfortable.

The best thing I could tell him was that it would make arrays fall in line with other data types in PHP. For example, you never write:

$var = int(1);

$var = string(foo);

So, why oh why do we have to have what looks like a function, but in reality is not, for creating an array? It is a language construct and should look like a language construct. I think the [ ] syntax makes more sense when you think about it in those terms.

I say commit it Andi. That seems to be what everyone else does. =)

PHP session cookie refresh

I have always had an issue with PHP Sessions. Albeit, a lot of my issues are now invalid. When they were first implemented, they had lots of issues. Then the $_SESSION variable came to exist and it was better. Then memcached came to exist and you could store sessions there. That was better. But, still, after all this time, there is one issue that still bugs me.

When you start a session, if the user had no cookie, they get a new session id and they get a cookie. You can configure that cookie to last for n seconds via php.ini or session_cookie_set_params(). But, and this is a HUGE but for me, that cookie will expire in n seconds no matter what. Let me explain further. For my needs, the cookie should expire in n seconds from last activity. So, each page load where sessions are used should reset the cookie's expiration. This way, if a user leaves the site, they have n seconds to come back and still be logged in.

Consider an application that sets the cookie expiration to 5 minutes. The person clicks around on the site, gets a phone call that lasts 8 minutes and then gets back to using the site. Their session has expired!!!! How annoying is that? The only sites I know that do that are banks. They have good reason. I understand that.

My preference would be to either set an ini value that tells PHP sessions to keep the session active as long as the user is using the site. Or give me access to the internal function php_session_send_cookie(). That is the C function that sends the cookie to the user's browser. Hmm, perhaps a patch is in my future.

In the short term, this is what I do:

setcookie(
ini_get("session.name"),
session_id(),
time()+ini_get("session.cookie_lifetime"),
ini_get("session.cookie_path"),
ini_get("session.cookie_domain"),
ini_get("session.cookie_secure"),
ini_get("session.cookie_httponly")
);


That will set the session cookie with a fresh ttl.

Ok, going to dig into some C code now and see if I can make a patch for this.

Thoughts on the 2008 MySQL Conference and Expo

Well, it has been almost a month.  I know I am late to the blogosphere on my thoughts.  Just been busy.

Again this year, the Phorum team was invited to be a part of the DotOrg Pavilion.  What is that?  Basically they just give expo floor space to open source projects.  It is cool.  We had a great location this year.  We were right next to the area where they served food and drinks during the breaks.  We had lots of traffic and met some of our power users.  IMVU.com is getting 1.5 million messages per month in their Phorum install.  They did have to customize it to fit into their sharding.  But, that is expected.  A guy (didn't catch his name) from Innobase came by and told us that they just launced InnoDB support forums on their site using Phorum.  Cool.  So now MySQL and Innobase use Phorum.  I am humbled by the message that sends to me about Phorum.

Speaking of our booth, we were right next to the phpMyAdmin guys.  Wow, that product has come a long way.  I was checking out the visual database designer they have now.  It was neat.  I also met the Gentoo MySQL package maintainer.  He was in the phpMyAdmin booth.

I was interviewed by WebDevRadio as I already posted.  I was also asked to do a short Q&A with the Sun Headlines video team.  They used one part of my clip.  I won't link to that.  No, if you find it good for you.  I need to be interviewed some more or something.  I did not look comfortable at all.

There were lots of companies with open in their name or slogan.  I guess this is expected pandering.

I attended part of the InnoDB talk given by Mark Callaghan of Google.  It appears that Google is serious about improving InnoDB on large machines.  That is, IMO, good news for anyone that likes InnoDB.  If I counted right, they had more than 5 people who at least part of their job is to improve InnoDB.

I gave my two talks.  The first had low attendance, but the feedback was nice.  It was just after the snack break in the expo hall and I was in the farthest room from the expo hall.  That is what I keep telling myself. =)  The second was better attended and the feedback seemed good there.  I was told by Maurice (Phorum Developer) that I talked too fast and at times sounded like Mr. Mackey from South Park by repeating the word bad a lot.  I will have to work on that in the future.  I want to do more speaking.

On the topic of my second talk, there seemed to be a lot of "This is how we scaled our site" talks.  I for one found them all interesting.  Everyone solves the problem differently.

Next year I am thinking about getting more specific with my talk submissions.  Some ideas include: PHP, MySQL and Large Data Sets, When is it ok to denormalize your data?, Using memcached (not so much about how it works), Index Creation (tools, tips, etc.).

In closing, I want to give a big thanks to Jay Pipes and Lenz Grimmer from MySQL.  Despite Jay's luggage being lost he was still a big help with some registration issues among other things.  Both of them helped out the Phorum team a great deal this year.  Thanks guys.

Amazon MP3 Store has holes

A coworker found out how secure Amazon's MP3 store is.  Even big guys like Amazon make errors in their web site security.
So, I clicked purchase and the album immediately started downloading. It was at this point that I had the thought cross my mind: "Did I update my credit card info?"

Well, no, I didn't. Before the album finished downloading, I was trying to change the method of payment. Turns out, for a digital purchase, you can't do such a thing. So, I waited and wondered was was going to come of this...

Example my.cnf files

NEW UPDATE: MySQL Forge is being end of lifed. And honestly, there were never that many examples there. It just never took off. But, there is good news. Percona, the company in my opinion that is leading the way for binary compatible MySQL progress, has an online tool for creating a my.cnf file based on your input about your server, your data and your workload. It seems to work very well. I recommend using it for creating your my.cnf file.

UPDATE: There are some examples being added at the MySQL Forge now.

When I first started installing MySQL for myself, it was quite handy to have the example my.cnf files in the source package. I was a noob to the MySQL configuration. Even after I became more experienced, I would use them as a starting point. However, I now find that they are so behind the times they are not as useful. Here are some of the comments from the files.

my-small.cnf

# This is for a system with little memory (<= 64M) where MySQL is only used
# from time to time and it's important that the mysqld daemon
# doesn't use much resources.

my-medium.cnf

# This is for a system with little memory (32M - 64M) where MySQL plays
# an important part, or systems up to 128M where MySQL is used together with
# other programs (such as a web server)

my-large.cnf

# This is for a large system with memory = 512M where the system runs mainly
# MySQL.

my-huge.cnf

# This is for a large system with memory of 1G-2G where the system runs mainly
# MySQL.

I end up using the large or huge files as a starting point for every server I set up by hand. The small and medium should be renamed underpowered and teeny-tiny. Who has less than 64MB of RAM on a server now? Can you even buy sticks of memory that small in any modern system? Most come with 256MB sticks minimum. And they never come with just one stick.

I will use the large example as a starting point for a server that has 2GB of RAM and will be running an entire site on one server. I use huge for any server that runs only MySQL. And even then, most of them have 4GB of RAM or more.

I don't know if anyone at MySQL has plans on tweaking these files or not. Perhaps those good guys at the MySQL Performance Blog or Percona could create some example my.cnf files. I could put some out there, but I fear their sole purpose would be for someone to point out what I am doing wrong. =P Hey, they work for me. Hmm, maybe this would make a good MySQL Forge section. A whole area of user contributed my.cnf files. They could be architecture specific and everything. What runs best on Solaris? Linux? BSD? Windows? 32-bit? 64-bit?

One thing I would for sure like to see is example files for InnoDB dominant servers. Most of our servers all run primariy InnoDB tables. None of these above examples covers InnoDB. They have comments, but no preconfigured values. I have seen more than one server using InnoDB tables without any custom configuration in their my.cnf. In the end that is the fault of the server admin/owner no doubt.

What do you say? Anyone up for a MySQL Forge section for my.cnf files?

Interview with WebDevRadio

While I was at the MySQL Conference, I sat down with Michael Kimsal of WebDevRadio and recapped the two talks that I gave at the conference.  I have uploaded the slides so you can follow along if you want.

One to a Cluster - The evolution of the dealnews.com architecture.

MySQL Tips and Tricks - Some simple tips and some of the more advanced SQL we use in Phorum.

Thanks Michael.  Any time you need a guest, just let me know.