MySQL Conference Review
Oracle gave the opening keynote and it went pretty much like I thought it would. Oracle said they will keep MySQL alive. They talked about the new 5.5 release. It was pretty much the same keynote Sun gave last year. Time will tell what Oracle does with MySQL.
The expo hall was sparse. Really sparse. There were a fraction of the booths compared to the past. I don't know why the vendors did not come. Maybe because they don't want to compete with Oracle/Sun? In the past you would see HP or Intel have a booth at the conference. But, with Oracle/Sun owning MySQL, why even try. Or maybe they are not allowed? I don't know. It was just sad.
I did stop by the Maatkit booth and was embarrassed to tell Baron (its creator) I was not already using it. I had heard people talk about it in the past, but never stopped to see what it does. It would have only saved me hours and hours of work over the last few years. Needless to say it is now being installed on our servers. If you use MySQL, just go install Maatkit now and start using it. Don't be like me. Don't wait for years, writing the same code over and over to do simple maintenance tasks.
Gearman had a good deal of coverage at the conference. There were three talks and a BoF. All were well attended. Some people seemed to have an AHA! moment where they saw how Gearman could help their architecture. I also got to sit down with the PECL/gearman maintainers and discuss the recent bug I found that is keeping me from using it.
I spoke about Memcached as did others. Again, there was a BoF. It was well attended and people had good questions about it. There seemed to be some FUD going around that memcached is somehow inefficient or not keeping up with technology. However, I have yet to see numbers or anything that proves any of this. They are just wild claims by people that have something to sell. Everyone wants to be the caching company since there is no "Memcached, Inc.". There is no company in charge. That is a good thing, IMO.
That brings me to my favorite topic for the conference, Drizzle. I wrote about Drizzle here on this blog when it was first announced. At the time MySQL looked like it was moving forward at a good pace. So, I had said that it would only replace MySQL in one part of our stack. However, after what, in my opinion, has been a lack of real change in MySQL, I think I may have changed my mind. Brian Aker echoed this sentiment in his keynote address about Drizzle. He talked about how MySQL AB and later Sun had stopped focusing on the things that made MySQL popular and started trying to be a cheap version of Oracle. That is my interpretation of what he said, not his words.
Why is Drizzle different? Like Memcached and Gearman, there is no "Drizzle, Inc.". It is an Open Source project that is supported by the community. It is being supported by companies like Rackspace who hired five developers to work on it. The code is kept on Launchpad and is completely open. Anyone can create a branch and work on the code. If your patches are good, they will be merged into the main branch. But, you can keep your own branch going if you want to. Unlike the other forks, Drizzle has started over in both the code and the community. I personally see it as the only way forward. It is not ready today, but my money is on Drizzle five or ten years from now.
Logging with MySQL
SQL is really good at retrieving a set of data based on a key or range of keys. Whereas NoSQL products are really good at writing things and retrieving one item from storage. When looking at redoing our architecture a few years ago to be more scalable, I had to consider these two issues. For what it is worth, the NoSQL market was not nearly as mature as it is now. So, my choices were much more limited. In the end, we decided to stick with MySQL. It turns out that a primary or unique key lookup on a MySQL/InnoDB table is really fast. It is sort of like having a key/value storage system. And, I can still do range based queries against it.
But, back to Dathan's problem: clicks. We store clicks at dealnews. Lots of clicks. We also store views. We store more views than we do clicks. So, lots of views and lots of clicks. (Sorry for the vague numbers, company secrets and all. We are a top 1,000 Compete.com site during peak shopping season.) And we do it all in MySQL. And we do it all with one server. I should disclose we are deploying a second server, but it is more for high availability than processing power. Like Dathan, we only use about the last 24 hours of data at any given time. There are three keys for us doing logging like this in MySQL.
Use MyISAM
MyISAM supports concurrent inserts. Concurrent inserts means that inserts can add rows to the end of a table while selects are being performed on other parts of the data set. This is exactly the use case for our logging. There are caveats with range queries as pointed out by the MySQL Performance Blog.
Rotating tables
MySQL (and InnoDB in particular) really sucks at deleting rows. Like, really sucks. Deleting causes locks. Bleh. So, we never delete rows from our logging tables. Instead, nightly we rotate the tables. RENAME TABLE is an (near) atomic process in MySQL. So, we just create a new table.
create table clicks_new like clicks;
rename table clicks to clicks_2010032500001, clicks_new to clicks;
Tada! We now have an empty table for today's clicks. We now drop any table with a date stamp that is longer than x days old. Drops are fast, we like drops.
For querying these tables, we use UNION. It works really well. We just issue a SHOW TABLES LIKE 'clicks%' and union the query across all the tables. Works like a charm.
Gearman
So, I get a lot of flack at work for my outright lust for Gearman. It is my new duct tape. When you have a scalability problem, there is a good chance you can solve it with Gearman. So, how does this help with logging to MySQL? Well, sometimes, MySQL can become backed up with inserts. It happens to the best of us. So, instead of letting that pile up in our web requests, we let it pile up in Gearman. Instead of having our web scripts write to MySQL directly, we have them fire Gearman background jobs with the logging data in them. The Gearman workers can then write to the MySQL server when it is available. Under normal operating procedure, that is in near real time. But, if the MySQL server does get backed up, the jobs just queue up in Gearman and are processed when the MySQL server is available.
BONUS! Insert Delayed
This is our old trick before we used Gearman. MySQL (MyISAM) has a neat feature where you can have inserts delayed until the table is available. The query is sent to the MySQL server and it answers with success immediately to the client. This means your web script can continue on and not get blocked waiting for the insert. But, MySQL will only queue up so many before it starts erroring out. So, it is not as fool proof as a job processing system like Gearman.
Summary
To log with MySQL:
- Use MyISAM with concurrent inserts
- Rotate tables daily and use UNION to query
- Use delayed inserts with MySQL or a job processing agent like Gearman
PS: You may be asking, "Brian, what about Partitioned Tables?" I asked myself that before deploying this solution. More importantly, in IRC I asked Brian Aker about MySQL partitioned tables. I am paraphrasing, but he said that if I ever think I might alter that table, I would not trust it with the partitions in MySQL. So, that kind of turned me off of them.
Using ini files for PHP application settings
One of the challenges of this is where and how to store the connection information for all these services. We have done several things in the past. The most common thing is to store this information in a PHP file. It may be per server or there could be one big file like:
<?php
if(DEV){
$server = "localhost";
} else {
$server = "10.1.1.25";
}
?>
This gets messy quickly. Option two is to deploy a single file that has the settings in a PHP array. And that is a good option. But, we have taken that one step further using some PHP ini trickeration. We use ini files that are loaded at PHP's startup and therefore the information is kept in PHP's memory at all times.
When compiling PHP, you can specify the --with-config-file-scan-dir to tell PHP to look in that directory for additional ini files. Any it finds will be parsed when PHP starts up. Some distros (Gentoo I know) use this for enabling/disabling PHP extensions via configuration. For our uses we put our custom configuration files in this directory. FWIW, you could just put the above settings into php.ini, but that is quite messy, IMO.
To get to this information, you can't use ini_get() as you might think. No, you have to use get_cfg_var() instead. get_cfg_var returns you the setting, in php.ini or any other .ini file when PHP was started. ini_get will only return values that are registered by an extension or the PHP core. Likewise, you can't use ini_set on these variables. Also, get_cfg_var will always reflect the initial value from the ini file and not anything changed with ini_set.
So, lets look at an example.
; db.ini
[myconfig]
myconfig.db.mydb.db = mydb
myconfig.db.mydb.user = user
myconfig.db.mydb.pass = pass
myconfig.db.mydb.server = hostThis is our ini file. the group in the braces is just for looks. It has no impact on our usage. Because this is parsed along with the rest of our php.ini, it needs a unique namespace within the ini scope. That is what myconfig is for. We could have used a DSN style here, but it would have required more parsing in our PHP code.
<?php
/**
* Creates a MySQLi instance using the settings from ini files
*
* @author Brian Moon <brianm@dealnews.com>
* @copyright 1997-Present dealnews.com, Inc.
*
*/
class MyDB {
/**
* Namespace for my settings in the ini file
*/
const INI_NAMESPACE = "dealnews";
/**
* Creates a MySQLi instance using the settings from ini files
*
* @param string $group The group of settings to load.
* @return object
*
*/
public static function init($group) {
static $dbs = array();
if(!is_string($group)) {
throw new Exception("Invalid group requested");
}
if(empty($dbs["group"])){
$prefix = MyDB::INI_NAMESPACE.".db.$group";
$db = get_cfg_var("$prefix.db");
$host = get_cfg_var("$prefix.server");
$user = get_cfg_var("$prefix.user");
$pass = get_cfg_var("$prefix.pass");
$port = get_cfg_var("$prefix.port");
if(empty($port)){
$port = null;
}
$sock = get_cfg_var("$prefix.socket");
if(empty($sock)){
$sock = null;
}
$dbs[$group] = new MySQLi($host, $user, $pass, $db, $port, $sock);
if(!$dbs[$group] || $dbs[$group]->connect_errno){
throw new Exception("Invalid MySQL parameters for $group");
}
}
return $dbs[$group];
}
}
?>
We can now call DB::init("myconfig") and get a mysqli object that is connected to the database we want. No file IO was needed to load these settings except when the PHP process started initially. They are truly constant and will not change while this process is running.
Once this was working, we created separate ini files for our different datacenters. That is now simply configuration information just like routing or networking configuration. No more worrying in code about where we are.
We extended this to all our services like memcached, gearman or whatever. We keep all our configuration in one file rather than having lots of them. It just makes administration easier. For us it is not an issue as each location has a unique setting, but every server in that location will have the same configuration.
Here is a more real example of how we set up our files.
[myconfig.db]
myconfig.db.db1.db = db1
myconfig.db.db1.server = db1hostname
myconfig.db.db1.user = db1username
myconfig.db.db1.pass = db1password
myconfig.db.db2.db = db2
myconfig.db.db2.server = db2hostname
myconfig.db.db2.user = db2username
myconfig.db.db2.pass = db2password
[myconfig.memcache]
myconfig.memcache.app.servers = 10.1.20.1,10.1.20.2,10.1.20.3
myconfig.memcache.proxy.servers = 10.1.20.4,10.1.20.5,10.1.20.6
[myconfig.gearman]
myconfig.gearman.workload1.servers = 10.1.20.20
myconfig.gearman.workload2.servers = 10.1.20.21
Forums are crap. Can we get some help?
However, as a software developer (Phorum), I see a lot of problems and no answers. And it is not all on the software. Web site owners use forums to solve problems that they really, really suck at. Ideally, every web site would be very unique for their audience. They would use a custom solution that fits a number of patterns that best solves their problem. However, most web site owners don't want to take the time to do such things. They want a one stop, drop in solution. See the monolith that is vBulletin, scary.
And what if a forum is the best solution? Well, software developers, in general, are not good designers. They don't think like normal people. And they don't see their applications as a whole, but as pieces that do jobs. The forum software market has been run by software developers for over 10 years. Most of them all are still copies of what UBB was 13 years ago. And software (like Phorum) that has tried to be different is shunned by the online communities of the world because they don't work/look/feel like every other forum software on the planet.
So, as software developers, what are we to do? We want to make great software. We want to help our users help their users. But, what we have been doing for 10+ years has only been adequate. As the leader of an open source forum software project, I am open to any and all ideas.
Wordcraft 0.10 available
The rise of the GLAMMP stack
The G is for Gearman - Gearman is a system to farm out work to other machines, dispatching function calls to machines that are better suited to do work, to do work in parallel, to load balance lots of function calls, or to call functions between languages.
The extra M is for Memcached - memcached is a high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load.
More and more these days, you can't run a web site on just LAMP. You need these extra tools (or ones like them) to do all the cool things you want to do. What other tools do we need to work into the acronym? PostgreSQL replaces MySQL in lots of stacks to form LAPP. I guess Drizzle may replace MySQL in some stacks soon. For us, it will likely be added to the stack. Will that make it GLAMMPD? We need more vowels! If you are starting the next must use tool for running web sites on open source software, please use a vowel for the first letter.
Scaling for the Expected and Unexpected - Speaking at Velocity
Last year, I was on a panel with some great guys. I was able to share just a bit about my experience dealing with the instant success of a web site. This year, my proposal was accepted to talk more about dealing with success of a web site. The talk will be focused on my experience at dealnews.com and from working with power users for Phorum. Here is the summary:
Lots of people talk about scaling and performance. But, are they preparing for all the things that could happen? There are multiple problems and there is not one solution to solve them all.
Everything is running fine and BAM! – your site is linked from the front page of Yahoo! What do you do? How can you handle that sudden rush of traffic. Requests per second are running 5x normal levels. Servers have CPU spikes. Daemons are hitting the maximums. You are running out of bandwidth. How could you have been prepared for this? What are the tools and techniques for this type of sudden rush?
Or, lets say you have just come out of a meeting where everyone discovered that your site is growing in traffic 70% – 80% year over year. That means that 1 million page views this month will be nearly 3 million this time in 2 years. How can you plan for that? You don’t want to redesign the whole architecture every 2 years. What methods could be used to deal with this constant long term growth?
While there is no magic bullet for either of these scenarios, there are techniques used by many sites out there to help you get through these situations. This session will cover some of these techniques and talk about their pros and cons.
I must admit, this if the first time since 2000 that I am a little intimidated to speak at a conference. The people that present and attend Velocity are so awesome. I just hope I don't disappoint.
Wordcraft 0.9.1 available
- Tokens on post forms in the admin to help ward off CSRF attacks.
- Database schema updates automated.
In addition to those two big ones, there were some notable small ones:
- HTML 4.01 validation fixes
- Ensuring UTF-8 on all encoding function calls
- Protection against hitting the back button when writing a post (most annoying on Macs as the back button and the beginning of line keystroke is the same).
I will or course need many more testers and users before I can ever declare this software as stable. If you need a simple blog, give it a try.
About Wordcraft
Wordcraft aims to be a simple, lightweight blogging application. Wordcraft is written exclusively for PHP 5+ and MySQL 5.0+ using only the PHP mysqli extension, UTF-8, and HTML 4.01 to achieve that simpleness.
Wordcraft 0.8 available
I have decided to go back to YUI's Editor. I tried TinyMCE in the last release. But, using it full time I found it messed with my HTML too much for my liking. When I would switch to raw HTML mode and add something like a <code> tag, it would be lost when saving the data back into the WYSIWYG editor.
I also converted the admin HTML to HTML 4.01 Transitional. I never use XHTML anymore these days. So, I was writing invalid XHTML inadvertantly.
I worked on the session handling some more in this release. Users should stay logged in to the admin better now.
I put comment blocks in all the files and documented every function. This should help anyone wanting to dig in and help out.
I fixed several bugs reported by users (or maybe just testers, not sure). Thanks for that and keep the feedback coming.