Sharing gotchas on Facebook and Twitter

I have been working on adding some sharing features to dealnews.com. Dealing with Facebook and Twitter has been nothing if not frustrating. Neither one seems to understand how to properly deal with escaping a URL. At best they do it one way, but not all ways. At worst, they flat out don't do it right. I thought I would share what we found out so that someone else my be helped by our research.

Facebook

Facebook has two main ways to encourage sharing of your site on Facebook. The older way is to "Share" a page. The second, newer, cooler way to promote your page/site on Facebook is with Facebook's Like Button. Both have the same bug. I will focus on Share as it is easier to show examples of sharing. To do this, you make a link and send it to a special landing page on Facebook's site. But, lets say my URL has a comma in it. If it does, Facebook just blows up in horrible fashion. The users of Phorum have run into this problem too. In Phorum, we dealt with register_globals in a unique way long ago. We just don't use traditional query strings on our URLs. Instead of the traditional var1=1&var2=2 format, we decided to use a comma delimited query string. 1,2,3,var4=4 is a valid Phorum URL query string.

According to RFC 3986, a query string is made up of:
query = *( pchar / "/" / "?" )
where pchar is defined as:
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
and finally, sub-delims is defined as:
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / 
"*" / "+" / "," / ";" / "="
That is RFC talk for "A query string can have an un-encoded comma in it as a delimiter." So, in Phorum we have URLs like http://www.phorum.org/phorum5/read.php?61,145041,145045. That is the post in Phorum talking about Facebook's problem. It is a valid URL. The commas do not need to be escaped. They are delimiters much like an & would be in a traditional URL. So, what happens when you share this URL on Facebook? Well, a share link would look like http://www.facebook.com/share.php?u=http%3A%2F%2Fwww.phorum.org%2Fphorum5%2Fread.php%3F61%2C146887%2C146887. If I go to that share page and then look in my Apache logs I see this:
66.220.149.247 - - [18/Nov/2010:00:47:51 -0600] "GET /phorum5/read.php?61%2C146887%2C146887 HTTP/1.1" 302 26 "-" "facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)"
Facebook sent %2C instead of comma? It decoded the other stuff in the URL. The slashes, the question mark, all of it. So, what is their deal with commas? Well, maybe I can hack Facebook and not send an encoded URL to the share page. Nope, same thing. So, they are proactively encoding commas in URL's query strings.

This has two effects. The first is that the share app attempts to pull in the title, description, etc. from the page. In this case, we redirect the request as the query string is invalid for a Phorum message page. So, they end up getting the main Phorum page. In the case of dealnews, we usually throw a 400 HTTP error when we get invalid query strings. Neither of these get the user what he wanted. The second problem is that the URL that is clickable when the user has shared the URL is not valid. So, the whole thing was just a huge waste of time.

I have submitted this to the Facebook Bugzilla. The only work around is to use a URL shortener or don't use commas in your URLs. Just make sure the shortener does not use commas. I guess you could use special URLs for Facebook that used something besides comma that are then redirected to the real URL with commas. I don't know what that character is, I am just guessing.

Twitter

Twitter's issues deal with their transition from their old interface to their new interface. Twitter is in the process of (or is done with) rolling a new UI on their site. The link in the old site to share something on Twitter was something like: http://twitter.com/home?status=[URL encoded text here]. This worked pretty darn well. You could put any valid URL encoded text in there and it worked. However, that now redirects you to their new interface's way of updating your status and they don't encode things right.

If I want to tweet "I love to eat pork & beans" I would make the URL http://twitter.com/home?status=I+love+to+eat+pork+%26+beans. Twitter then takes that, decodes the query string and redirects me to http://twitter.com/?status=I%20love%20to%20eat%20pork%20&%20beans. The problem is that they did not re-encode the &. It is in the bare URL. So, when I land on my twitter page, my status box just says "I love to eat pork ". Which while true, is not what I mean to tweet. This bug has been submitted to Twitter, but has yet to be fixed.

The second problem is with the new site and how they deal with validly encoded spaces. Spaces can be escaped two ways in a URL. The first, older way (which the PHP function urlencode uses) is to encode spaces as a plus (+) sign. This comes from the standard for how forms submit (or used to submit) data. It is understood by all browsers. The second way comes from the later RFC's written about URLs. They state that spaces in a URL should be escape like other characters by replacing a space with %20. The old Twitter UI would accept either one just fine. And, if you send that to the old status update URL it will redirect you (see above) with %20 in the URL instead of +. However, if you send + to the new Twitter UI, as above, you get "I+love+to+eat+pork+&+beans" in your status box. The only solution is to not send + has an encoding for space to Twitter. In PHP you can use the function rawurlencode to do this. It conforms to the RFC(s) on URL encoding. Doing so, with thew new linking pattern generates the URL http://twitter.com/?status=I%20love%20to%20eat%20pork%20%26%20beans which works great. This was also reported to Twitter as a bug by our team.

So, maybe that will help someone out that is having issues with sharing your site on the two largest social networks. Good luck with your social media development.

Monitoring PHP Errors

PHP errors are just part of the language. Some internal functions throw warnings or notices and seem unavoidable. A good case is parse_url. The point of parse_url is to take apart a URL and tell me the parts. Until recently, the only way to validate a URL was a regex. You can now use filter_var with the FILTER_VALIDATE_URL filter. But, in the past, I would use parse_url to validate the URL. It worked as the function returns false if the value is not a URL. But, if you give parse_url something that is not a URL, it throws a PHP Warning error message. The result is I would use the evil @ to suppress errors from parse_url. Long story short, you get errors on PHP systems. And you don't need to ignore them.

In the past we just logged them and then had the log emailed daily. On some unregular schedule we would look through them and fix stuff to be more solid. But, when you start having lots of traffic, one notice error on one line could cause 1,000 error messages in a short time. So, we had to find a better way to deal with them.

Step 1: Managing the aggregation

The first thing we did was modify our error handler to log both human readable logs to the defined php error log and to write a serialzed (json_encode actually) version of the error to a second file. This second file makes parsing of the error data super easy. Now we could aggregate the logs on a schedule, parse them and send reports as needed. We have fatal errors monitored more aggressively than non-fatal errors. We also get daily summaries. The one gotcha is that PHP fatal errors do not get sent to the error handler. So, we still have to parse the human readable file for fatal errors. A next step may be to have the logs aggregated to a central server via syslog or something. Not sure where I want to go with that yet.

Step 2: Monitoring

Even with the logging, I had no visibility into how often we were logging errors. I mean, I could grep logs, do some scripting to calculate dates, etc. Bleh, lots of work. We recently started using Circonus for all sorts of metric tracking. So, why not PHP errors? Circonus has a data format that allows us to send any random metrics we want to track to them. They will then graph them for us. So, in my error handler, I just increment a counter in memached every time I log an error. When Circonus polls me for data, I just grab that value out of memcached and give it to them. I can then have really cool graphs that show me my error rate. In addition I can set alerts up. If we start seeing more than 1 error per second, we get an email. If we start seeing more than 5 per second, we get paged. You get the idea.

The Twitterverse:

I polled my friends on Twitter and two different services were mentioned. I don't know that I would actually send my errors to a service, but it sounds like an interesting idea. One was HopToad. It appears to be specifically for application error handling. The other was loggly which appears to be a more generic log service. Both very interesing concepts. The control freak in me would have issues with it. But, that is a personal problem.

Most others either manually reviewed things or had some similar notification system in place. Anyone else have a brilliant solution to monitoring PHP errors? Or any application level errors for that matter?

P.S. For what it is worth, our PHP error rate is about .004 per second. This includes some notices. Way too high for me. I want to approach absolute zero. Lets see if all these new tools can help us get there.

Boycott NuSphere. They are spammy spammers

It took a lot for me to finally write this post. I tweeted about it a while back. It seems NuSphere needs more business. They have resorted to spamming people to promote their PhpED product. I have gotten emails to email addresses that:
  1. I know are not on any mailing list
  2. Are on web pages as plain mailto: anchor tags for good reasons.
In one case, it was the security@phorum.org address we have on the site to make it easy for people to report any security related issues. We get all kind of spam because of this, but it is worth it to have an easy access address for security issues.

In the other case, the email addresses used were on the dealnews.com jobs page. It was the addresses that are used to accept resumes.

Now, today, I started getting them to my personal inboxes.



Apparently, they sent it out the first time with a misspelling, so they had to send it out again!?!?

Now, the NuSphere people posted a tweet that claims the emails are not coming from them. That it's an independent marketer that "have permission to sell our products". First, the links in the email to go the NuSphere site, not a 3rd party. I would think an independent would want to take me to their site. Second, if that is the case, revoke the permission you have given them to sell your products. Don't use a marketing/sales agreement as a shield to allow people in the PHP Community to be spammed in your name.

PHP Community, please do no support NuSphere. They are spammers, directly or indirectly. If they do this to promote their product, how will they find a way to hose their user base later on?

PHP 5.3 and mysqlnd - Unexpected results

I have started seriously using PHP 5.3 recently due to it finally making it into Portage. (Gentoo really isn't full of bleeding edge packages people.) I have used mysqlnd a little here and there in the past, but until it was really coming to my servers I did not put too much time into it.

What is mysqlnd?

mysqlnd is short for MySQL Native Driver. In short, it is a driver for MySQL for PHP that uses internal functions of the PHP engine rather than using the externally linked libmysqlclient that has been used in the past. There are two reasons for this. The first reason is licensing. MySQL is a GPL project. The GPL and the PHP License don't play well together. The second is better memory management and hopefully more performance. Being a performance junky, this is what peaked my interests. Enabling mysqlnd means it is used by the older MySQL extension, the newer MySQLi extension and the MySQL PDO driver.

New Key Feature - fetch_all

One new feature of mysqlnd was the fetch_all method on MySQLi Result objects. At both dealnews.com and in Phorum I have written a function to simply run a query and fetch all the results into an array and return it. It is a common operation when writing API or ORM layers. mysqlnd introduces a native fetch_all method that does this all in the extension. No PHP code needed. PDO already offers a fetchAll method, but PDO comes with a little more overhead than the native extensions and I have been using mysql functions for 14 years. I am very happy using them.

Store Result vs. Use Result

I have spoken in the past (see my slides and interview: MySQL Tips and Tricks) about using mysql_unbuffered_query or using mysqli_query with the MYSQLI_USE_RESULT flag. Without going into a whole post about that topic, it basically allows you to stream the results from MySQL back into your PHP code rather than having them buffered in memory. In the case of libmysqlclient, they could be buffered twice. So, my natural thought was that using MYSQLI_USE_RESULT with fetch_all would yield the most awesome performance ever. The data would not be buffered and it would get put into a PHP array in C instead of native code. The code I had hoped to use would look like:
$res = $db->query($sql, MYSQLI_USE_RESULT);
$rows = $res->fetch_all(MYSQLI_ASSOC);
But, I quickly found out that this does not work. For some reason, this is not supported. fetch_all only works with the default which is MYSQLI_STORE_RESULT. I filed a bug which was marked bogus. Which I put back to new because I really don't see a reason this should not work other than a complete oversight by the mysqlnd developers. So, I started doing some tests in hopes I could show the developers how much faster using MYSQLI_USE_RESULT could be. What happened next was not expected. I ended up benchmarking several different options for fetching all the rows of a result into an array.

Test Data

I tested using PHP 5.3.3 and MySQL 5.1.44 using InnoDB tables. For test data I made a table that has one varchar(255) column. I filled that table with 30k rows of random lengths between 10 and 255 characters. I then selected all rows and fetched them using 4 different methods.
  1. mysqli_result::fetch_all*
  2. PDOStatement::fetchAll
  3. mysqli_query with MYSQLI_STORE_RESULT followed by a loop
  4. mysqli_query with MYSQLI_USE_RESULT followed by a loop
In addition, I ran this test with mysqlnd enabled and disabled. For mysqli_result::fetch_all, only mysqlnd was tested as it is only available with mysqlnd. I ran each test 6 times and threw out the worst and best result for each test. FWIW, the best and worst did not show any major deviation for any of the tests. For measuring memory usage, I read the VmRSS value from Linux's /proc data. memory_get_usage() does not show the hidden memory used by libmysqlclient and does not seem to show all the memory used by mysqlnd either.



So, that is what I found. The memory usage graphs are all what I thought they would be. PDO has more overhead by its nature. Storing the result always uses more memory than using it. mysqli_result::fetch_all uses less memory than the loop, but more than directly using the results.

There are some very surprising things in the timing graphs however. First, the tried and true method of using the result followed by a loop is clearly still the right choice in libmysqlclient. However, it is a horrible choice for mysqlnd. I don't really see why this is so. It is nearly twice as slow. There is something really, really wrong with MYSQLI_USE_RESULT in mysqlnd. There is no reason it should ever be slower than storing the result and then reading it again. This is also evidenced in the poor performance of PDO (since even PDO uses mysqlnd when enabled). PDO uses an unbuffered query for its fetchAll method and it too got slower. It is noticably slower than libmysqlclient. The good news I guess is that if you are using mysqlnd, the fetch_all method is the best option for getting all the data back.

Next Steps

My next steps from here will be to find some real workloads that I can test this on. Phorum has several places where I can apply real world pages loads to these different methods and see how they perform. Perhaps the test data is too small. Perhaps the number of columns would have a different effect. I am not sure.

If you are reading this and have worked on or looked at the mysqlnd code and can explain any of it, please feel free to comment.

Engineers Work In Their Sleep

My grandmother worked at the Marshall Space Flight Center. She worked with a lot of engineers. I have always remembered when she told my cousin and I that we should be engineers. Her reasoning was that if you saw a janitor sitting in a chair with his arms crossed and his eyes closed, you pretty much knew he was not working. But, if you see an engineer in the same position, you can not prove he is not working. Now, was she saying she wanted us to sleep on the job? No, of course not. The message was to have a job where you could use your mind. Although I think watching our grandfather work hard every day as a construction foreman may have shaped her opinion. You know what is funny about that? He was the healthiest man I have ever known. Maybe I don't need to be sitting behind a desk with my arms crossed and eyes closed?

For what it is worth, I don't believe that all jobs that use your mind are desk jobs. Some of the smartest people I know have jobs in fields that would be considered "manual" labor. I believe that smart people will always rise to the top no matter what your profession.

Velocity 2010 In Review

I just got back from Velocity for the third straight year. I have been to all three of them which is kind of a neat little club to be in. The first one only had maybe 300 people. This year there were over 1,000 attendees. Registration was shut down by the fire code for the rooms we were using. Most sessions had standing room only. It was awesome.

The people that talk at Velocity are really smart. I am always humbled by the likes of John Allspaw. He and I see eye to eye on a lot, but he is so much better at explaining to people and showing them how to make the ideas work. I wish I had his charisma when at the podium. I was lucky enough to write a chapter in a book for John this year. He and Velocity co-chairperson Jesse Robbins organized and authored a book titled Web Operations that debuted at the conference. I basically just told and expanded on my Yahoo story. John loves that story for some reason. I was happy to be a part of it. So many smart people in that book.

The IE9 technology preview dropped while we were there. HTML 5, CSS 3 and more in there. One feature where Microsoft is actually ahead of the curve is in a new DOM level measurement feature. Basically they expose statistics via the DOM about the time it takes to do different things in the page. The other browser vendors in attendance (Google and Mozilla) vowed to support the same data. Another big advancement of IE9 is the heavy use of the GPU for rendering the pages. They have a real advantage here. They are the only browser vendor that is now locked to one operating system. IE9 will require Vista or higher. They can really max out the system for faster rendering.

As usual some of the best content was in the hall ways and bar. We hung out with Theo Schlossnagle from OmniTI and talked about Reconnoiter. It is a kind of Cacti/Ganglia/Nagios all in one. I got to see the Six Apart guys again this year. That is becoming an annual thing. I shared our new Gearman assisted proxy with them. They do some similar stuff for TypePad. More on that proxy later this year. I met a guy from CloudTest. It sounds like a really good use of on-demand cloud resources. I am gonna talk with them about some possible testing.

Membase also dropped while we were there. Most of the persistent key/value stores I have used have disappointed me or just been way too complex for our needs. We don't want a memcached replacement. It does its job damn well. I just need a place to store adhoc data for various applications. Membase is promising because the guys that wrote it are core memcached contributors. There is a company behind it, so it is not as inviting as Drizzle. But, the code is on GitHub so it is more open than say MySQL. Time will tell.

If you have not been to Velocity I encourage you to go next year. It is right for all types of people in the web business. Developers can learn about performance in new ways that will change they way they write code. Operations can learn techniques to make their work day much less painful. Everyone will learn how to empower their business to achieve the goals of the business.

PHP and Memcached: The state of things

Memcached is the de facto standard for caching in dynamic web sites. PHP is the one of the most widely used languages on the web. So, naturally there is lots of interest in using the two together. There are two choices for using memcached with PHP: PECL/memcache and PECL/memcached. Great names huh? But as of this writing there are issues with the two most popular Memcached libraries for PHP. This is a summary of those issues that I hope will help people being hurt by them and may bring about some change.

PECL/memcache

This is the older of the two and was the first C based extension for using memcached with PHP. Before this all the options were native PHP code. While they worked, they were slower of course. C > PHP. That is just fact. However, there has not been much active development on this code in some time. Yes, they have fixed bugs, but support for new features in the memcached server have not been added to this extension. Newer version of the server suppot a more efficient binary protocol and many new features. In addition, there are some parts of the extension that simply don't work anymore.

The most glaring one is the delete() function. It takes a second parameter that is documented as: "the item will expire after timeout seconds". In fact that was never a feature of memcached. It was completely misunderstood by the original extension authors. When that parameter was supported, it locked the key for timeout seconds and would not allow a new add operation on that key. Second, this feature was completely removed in memcached 1.4.0. So, if you send a timeout to the delete function, you simply get a failure. This is creating a huge support issue in the memcached community (not the PHP/PECL community) with people not being able to delete keys because of bad documentation and unsupported behavior. I sent a documentation patch for the this function to the PHP Docs list. I then modified it based on feedback. But since I have heard nothing about it getting merged. I have a PHP svn account, but even if I do have karma on the docs repository, I don't want to hijack them without the support of the people that work on the docs all the time. If you are reading this and can change the docs, please make the documentation for that function say "DON'T USE THIS PARAMETER, IT HAS BEEN DEPRECATED IN THE MEMCACHED SERVER!" or something.

Not too long ago the extension was officially abandoned by its original maintainers. Some people stepped up and claimed they wanted to see it continue. But, since that time, there have been no releases. There are bugs being closed though so maybe there is good things coming.

A very problematic issue with this extension is with the 3.0 beta release. It needs to just die. It has huge bugs that, IMO, were introduced by the previous maintainers in an effort to bring it up to speed, but never saw them through to make sure the new code worked. In their defense, it is marked as beta on PECL. But, thanks to Google, people don't see the word beta anymore. There are lots of people using this version and when they get bad results they blame the whole memcached world. Really, the new maintainers would do the world a favor if they just removed the 3.x releases from the PECL site.

PECL/memcached

This extension was started by Andrei Zmievski while working at Digg.com as an open source developer. It uses libmemcached, a C++ memcached library that does all the memcached work. This made it quite easy to support the new features in the memcached daemon as it as it was being developed at the same time as the new server. However, it has not had a stable release on PECL in nearly a year except for release to make it compatible with new versions of libmemcached. No bug fixes and no new features. There are currently 28 open bugs on PECL for this extension. Not all of which are bugs. Some are feature requests. The ironic thing is that the GitHub repository for this extension has seen a lot of development. But, none of these bug fixes have made it into the official PECL channel. And some of these bugs are major and others are just huge WTF for a developer.

The most major bug is one that I found. If you use persistent connections with this extension, it basically leaks those connections, not reusing an existing connection but also not closing the ones already made. This uses up the memory in your processes until they crash and it creates an exponential number of connections to your memcached server until it has no more connections available. Andrei does report in the bug that it is fixed in hist GitHub. But, for now, you can't use persistent connections.

The big WTF for a developer is that some functions don't take a numeric key and use it properly.
<?php

$mc = new Memcached();

$mc->addServer("localhost", 11211);

$mc->set(123, "yes!");

var_dump($mc->getMulti(array(123)));

?>
The above code should generate:
array(1) {
  [123]=>
  string(4) "yes!"
}
But, in reality, it generates:
bool(false)
The set succeeds, but the getMulti fails. Again, this is being fixed in GitHub, but is not available for release on PECL.

Compatibility

One big issue with these two extensions is that they are not drop in replacements for each other. If you want to move from one to the other, you have to take into consideration some time to convert your code. For instance, the older extension's set() function takes flags as the third parameter. The newer extension does not take a flags parameter at all. The older uses get() with an array to do a multi-get and the newer extension has a separate method for that called getMulti(). There are others as well.

Summary

So, what should you do as a PHP developer? If you are deploying memcached today, I would use the 2.2.x branch of PECL/memcache. It is the most stable. Just avoid the delete bug. It is not as fast and does not have the features. But, it is very reliable for set, get, add.... the basics of memcached. For the long term, it is a bit unclear. PECL/memcached looks to fix a lot of things in 2.0. But, I have not used it yet. In addition the long term growth of the project is a bit in question. Will there be a 2.1, 2.2, etc? I hope so. The other unknown is if the those people fixing the PECL/memcache bugs will keep it up and release a good stable product that supports new features of the server. Again, I hope so. The best scenario would be to have a choice between two fully compatible and feature rich extensions. Keep your fingers crossed.

PHP generated code tricks

Something that is great about PHP is that you can write code that generates more PHP code to be used later. Now, I am not saying this a best practice. I am sure it violates some rule in some book somewhere. But, sometimes you need to be a rule breaker.

A simple example is taking a database of configuration information and dumping it to an array. We do this for each publication we operate. We have a publication table. It contains the name, base URL and other stuff that is specific to that publication. But, why query the database for something that only changes once in a blue moon? We could cache it, but that would still require an on demand database hit. The easy solution is to just dump the data to a PHP array and put it on disk.
<?php

$sql = "select * from publications";

$res = $mysqli->query($sql);

while($row = $res->fetch_assoc()){

    $pubs[$row["publication_id"]] = $row;

}

$pubs_ser = str_replace("'", "\\'", serialize($pubs));

$php_code = "<?php global \$PUBLICATIONS; \$PUBLICATIONS = unserialize('$pubs_ser'); ?>";

file_put_contents("/some/path/publications.php", $php_code);

?>
Now you can include the publications.php file and have a global variable named $PUBLICATIONS that holds the publication settings. But, how do we load a single publication without knowing numeric ids? Well, you could make some constants.
<?php

$sql = "select * from publications";

$res = $mysqli->query($sql);

while($row = $res->fetch_assoc()){

    $pubs[$row["publication_id"]] = $row;

    $constants[$row["publication_id"]] = strtoupper($row["name"]);

}

$pubs_ser = str_replace("'", "\\'", serialize($pubs));

$php_code = "<?php\n";

$php_code.= "global \$PUBLICATIONS;\n";

$php_code.= "\$PUBLICATIONS = unserialize('$pubs_ser');\n";

foreach($constants as $id=>$const){

    $php_code.= "define('$const', $id);\n";

}

$php_code.= "?>";

file_put_contents("/some/path/publications.php", $php_code);

?>

So, now, we have constants. We can do stuff like:
<?php

//load a publication

require_once "publications.php";

echo $PUBLICATIONS[DEALNEWS]["name"];

?>
But, how about autoloading? It would be nice if I could just autoload the constants.
<?php

$sql = "select * from publications";

$res = $mysqli->query($sql);

while($row = $res->fetch_assoc()){

    $pubs[$row["publication_id"]] = $row;

    $constants[$row["publication_id"]] = strtoupper($row["name"]);

}

$pubs_ser = str_replace("'", "\\'", serialize($pubs));

$php_code = "<?php\n";

$php_code.= "class PUB_DATA {\n";

foreach($constants as $id=>$const){

    $php_code.= " const $const = $id;\n";

}

$php_code.= "    protected \$pubs_ser = '$pubs_ser';\n";

$php_code.= "}";

$php_code.= "?>";

file_put_contents("/some/path/pub_data.php", $php_code);

?>
Then we create a class in our autoloading directory that extends that object.
<?php

require_once "pub_data.php";

class Publication extends PUB_DATA {

    private $pub;

    public function __construct($pub_id) {

        $pubs = unserialize($this->pubs_ser);

        $this->pub = $pubs[$pub_id];

    }

    public function __get($var) {

        if(isset($this->pub[$var])){

            return $this->pub[$var];

        } else {

            // Exception

        }

    }

}

?>
Great, now we can do things like:
$pub = new Publication(Publication::DEALNEWS);

echo $pub->name;
The only problem that remains is dealing with getting the generated code to all your servers. We use rsync. It works quite well. You may have a different solution for your team. Back when we ran our own in house ad server we did all the ad work this way. None of the ad calls ever hit the database to get ads. We stored stats on disk in logs and processed them on a schedule. It was a very solid solution.

One more benefit of using generated files on disk is that they can be cached by APC or XCache. This means you don't have to actually hit disk for them all the time.

Want to understand MySQL indexes?

I have introduced some new people to MySQL recently and had to back track the years to figure out how I learned what I learned about MySQL indexes.  A quick way to get up to speed about MySQL indexes is these three podcasts by Sheeri Cabral.
  1. OurSQL Episode 13: The Nitty Gritty of Indexes
  2. OurSQL Episode 17: Hashing it out
  3. OurSQL Episode 18: De-myth-tifying Indexes
Those three episodes do a good job of explaining how indexes work so that you have a better understanding of how MySQL indexes find your data.

MySQL Conference Review

I am back home from a good week at the 2010 O'Reilly MySQL Conference & Expo. I had a great time and got to see some old friends I had not seen in a while.

Oracle gave the opening keynote and it went pretty much like I thought it would. Oracle said they will keep MySQL alive. They talked about the new 5.5 release. It was pretty much the same keynote Sun gave last year. Time will tell what Oracle does with MySQL.

The expo hall was sparse. Really sparse. There were a fraction of the booths compared to the past. I don't know why the vendors did not come. Maybe because they don't want to compete with Oracle/Sun? In the past you would see HP or Intel have a booth at the conference. But, with Oracle/Sun owning MySQL, why even try. Or maybe they are not allowed? I don't know. It was just sad.

I did stop by the Maatkit booth and was embarrassed to tell Baron (its creator) I was not already using it. I had heard people talk about it in the past, but never stopped to see what it does. It would have only saved me hours and hours of work over the last few years. Needless to say it is now being installed on our servers. If you use MySQL, just go install Maatkit now and start using it. Don't be like me. Don't wait for years, writing the same code over and over to do simple maintenance tasks.

Gearman had a good deal of coverage at the conference. There were three talks and a BoF. All were well attended. Some people seemed to have an AHA! moment where they saw how Gearman could help their architecture. I also got to sit down with the PECL/gearman maintainers and discuss the recent bug I found that is keeping me from using it.

I spoke about Memcached as did others. Again, there was a BoF. It was well attended and people had good questions about it. There seemed to be some FUD going around that memcached is somehow inefficient or not keeping up with technology. However, I have yet to see numbers or anything that proves any of this. They are just wild claims by people that have something to sell. Everyone wants to be the caching company since there is no "Memcached, Inc.". There is no company in charge. That is a good thing, IMO.

That brings me to my favorite topic for the conference, Drizzle. I wrote about Drizzle here on this blog when it was first announced. At the time MySQL looked like it was moving forward at a good pace. So, I had said that it would only replace MySQL in one part of our stack. However, after what, in my opinion, has been a lack of real change in MySQL, I think I may have changed my mind. Brian Aker echoed this sentiment in his keynote address about Drizzle. He talked about how MySQL AB and later Sun had stopped focusing on the things that made MySQL popular and started trying to be a cheap version of Oracle. That is my interpretation of what he said, not his words.

Why is Drizzle different? Like Memcached and Gearman, there is no "Drizzle, Inc.". It is an Open Source project that is supported by the community. It is being supported by companies like Rackspace who hired five developers to work on it. The code is kept on Launchpad and is completely open. Anyone can create a branch and work on the code. If your patches are good, they will be merged into the main branch. But, you can keep your own branch going if you want to. Unlike the other forks, Drizzle has started over in both the code and the community. I personally see it as the only way forward. It is not ready today, but my money is on Drizzle five or ten years from now.