Playing with MySQL's index merge

So, I mentioned before that I found out about index_merge at the MySQL Conference. I was wondering why I had not heard more about it since it came out in 5.0.3. When talking with some MySQL people about it, I received mixed results. So, I decided to kind of run my own tests on some data and see what I could figure out.

I apologize for Wordpress' bad output. =(

The Data

I created a table with 5 million rows. Early tests with MySQL's Harrison Fisk (HarrisonF) over my shoulder with small data sets showed MySQL would optimize out the indexes in favor of table scans. I wanted to avoid that. This is my table schema:


CREATE TABLE `test2` (
`id1` int(10) unsigned NOT NULL default '0',
`id2` int(10) unsigned NOT NULL default '0',
`id3` int(10) unsigned NOT NULL default '0',
`dt` datetime NOT NULL default '0000-00-00 00:00:00',
`somevar` varchar(255) NOT NULL default '',
KEY `id1` (`id1`),
KEY `id2` (`id2`)
) ENGINE=MyISAM


The field id1 was filled with random vaules between 1 and 5000. I filled id2 with random values between 1 and 100, except that about half the data has the value 999 in it. This was to emulate the issue we were seeing on the smaller table. We found that if a value was in more than n% of the rows, the optimizer would skip the index. I wanted to test that on larger data sets. id3 was filled with random values between 1 and 1000000. dt was a random date/time between 1999 and 2008. and somevar was a random string chars.

Intersect Merges


mysql> explain select count(*) from test2 where id2=99 and id1=4795;
+----+-------------+-------+-------------+---------------+---------+---------+------+------+----------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------------+---------------+---------+---------+------+------+----------------------------------------------------+
| 1 | SIMPLE | test2 | index_merge | id1,id2 | id1,id2 | 4,4 | NULL | 3 | Using intersect(id1,id2); Using where; Using index |
+----+-------------+-------+-------------+---------------+---------+---------+------+------+----------------------------------------------------+


This is the most basic of example. MySQL uses the two indexes, finds where they intersect and merges the data together. This query is quite fast, although a key on the two together would be faster. If you have this showing up a lot, you probably need to combine the two keys into one. I should also note that in this example, only the keys are needed, no data from the tables. This is important.


mysql> explain select sql_no_cache somevar from test2 where id2=99 and id1=4795;
+----+-------------+-------+------+---------------+------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+-------+------+-------------+
| 1 | SIMPLE | test2 | ref | id1,id2 | id1 | 4 | const | 930 | Using where |
+----+-------------+-------+------+---------------+------+---------+-------+------+-------------+


As you see, as soon as we ask for data that is not in the indexes, our intersect is dropped in favor of using the key with the least values and simply scanning on those to match the rest of the where clause. This was the case pretty much every time I tried it. I was never able to use an index_merge with intersect when requesting data not available in the key.

Union Merges


explain select sql_no_cache somevar from test2 where id2=99 or id1=4795;
+----+-------------+-------+-------------+---------------+---------+---------+------+-------+-----------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------------+---------------+---------+---------+------+-------+-----------------------------------+
| 1 | SIMPLE | test2 | index_merge | id1,id2 | id2,id1 | 4,4 | NULL | 27219 | Using union(id2,id1); Using where |
+----+-------------+-------+-------------+---------------+---------+---------+------+-------+-----------------------------------+


mysql> select sql_no_cache somevar from test2 where id2=99 or id1=4795;
26237 rows in set (0.20 sec)

This merge type takes to keys involved in an OR and then merges the data much like a UNION statement would. As you can see, in this case, it did use the index even though we requested `somevar` that is not in the index.

To show the alternative to this, I selected using id3 instead of id1. id3 has no index.


mysql> explain select sql_no_cache somevar from test2 where id2=99 or id3=266591;
+----+-------------+-------+------+---------------+------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+---------+-------------+
| 1 | SIMPLE | test2 | ALL | id2 | NULL | NULL | NULL | 5000000 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+---------+-------------+


mysql> select sql_no_cache somevar from test2 where id2=99 or id3=266591;
25252 rows in set (26.01 sec)

As you can see, this does a table scan even though there is a key on id2. It does you know good.

Sort Union Merge


mysql> explain select sql_no_cache id1, id2 from test2 where id2=99 or id1 between 4999 and 5000;
+----+-------------+-------+-------------+---------------+---------+---------+------+-------+----------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------------+---------------+---------+---------+------+-------+----------------------------------------+
| 1 | SIMPLE | test2 | index_merge | id1,id2 | id2,id1 | 4,4 | NULL | 44571 | Using sort_union(id2,id1); Using where |
+----+-------------+-------+-------------+---------------+---------+---------+------+-------+----------------------------------------+


mysql> select sql_no_cache somevar from test2 where id2=99 or id1 between 4999 and 5000;
27295 rows in set (0.19 sec)

This behaves much like the union merge. However, because one index is using a range, MySQL must first sort one index and then merge the two. Again, if I switch this to an AND instead of an OR, index_merge is not used in favor of scanning the id2 indexed data for matches to the rest of the where clause.

Conclusion

Hmm, after all this, I see why this was not a big announcement. It can only make bad SQL and tables better. Tables and queries that are already optimized using composite indexes will see no benefit from this. At best this will help me with some one off queries or reports that are only run monthly where I don't want to pollute the indexes with special cases just for those queries.

2008 MySQL Conference, part 1

It is always surprising what I learn when I go to a conference these days. Years ago, I could go to any talk and just suck it all in. Now, it is the little nuggets. The topics as a whole do more to confirm what I have already developed while running the Phorum project and building the infastructure for dealnews.com. That confirmation is still nice. You know you are not the only one that thought a particular solution was a good idea.

One of the confirmations I have had is that the big sites like Flickr, Wikipedia, Facebook and others don't use exotic setups when it comes to their hardware and OS. During a keynote panel, they all commented that they did not do any virtualization on their servers. Most did not use SANs. Some ran older MySQL versions but some were running quite recent versions. I have kept thinking that I did not have the desire to get to fancy with that stuff and clearly I am not the only one.

One of the little nuggets that will likely change my world is index_merge in MySQL. I feel silly as this has been around since 5.0.3 but I was not aware of it. Basically MySQL will now use more than one key to resolve a where clause and possibly an order by depending on the query. This could lead to me removing several keys from tables in both Phorum and at dealnews.

There were others, but I am tired and trying to get OpenID into the Phorum trunk right now so I will have to think of more later.

Local: Best practices for SQL backed web applications

When
Tuesday, March 11, 2008 at 12:00 PM

Where
BizTech
515 Sparkman Drive
Huntsville , AL 35816

Details
Brian Moon of dealnews.com will be discussing best practices for writing database backed web based applications. Many users teach themselves SQL and programming on the web. Other developers may have experience in enterprise desktop applications. No matter what your background, there are common mistakes made when deploying web based applications that use a database.

Also, at this event, we will be giving away two copies of NuSphere's PhpED. Plus, everyone who attends can purchase any NuSphere product at 50% off.

Lunch will be served at this event.

People really do run PHP on Windows

One of my favorite restaurants these days is Buffalo Wild Wings. They show the UFC fights. It is cheaper to go there than to throw a party at the house. I went there tonight to get some nutritional information for last nights snacks, I got this:

Fatal error: Maximum execution time of 30 seconds exceeded in C:Inetpubwwwrootindex.php on line 3

Dang.  I hate that for them.  I am sure they just pay someone to host their site.  Maybe it will clear up soon.  Someone should key them into how to turn display_errors to off.

Forums are the red headed step child of a web site

I have seen it time and time again. And yet, every time, it irritates me to no end. You are on a professional web site. You are navigating around and at some point you hit the link for their forums. And just like that you feel transported to another place. The whole site design just changes. Colors, layout, navigation... everything. Here are some examples, including the new C7Y site from php|Architect which inspired this post. (I really do love you guys on the podcast I promise =)

  • php|architect's C7Y - main site - forums

  • Zend's Developer Zone - main site - forums
    Zend's forums do at least use the Zend.com header, but you can't get to the forums from the main Zend.com site. You have to go to the Developer Zone.

  • TextPad (great windows editor) - main site - forums
    The header is kind of the same. Fonts and link colors change slightly though which is worse in some ways than a wholesale change. It looks like they just wedged in their HTML into the phpBB template.


I could continue to list some here, but you get the idea. So, what is the problem? Does most message board software make it too hard to edit their templates? Are forums an after thought and some underling is given the task to make them work and not allowed access to the main site's templates?

Some people do better at it. MySQL for example. Theirs is still not perfect. An ad awkwardly appears in the forums in a way that makes it look like an error. However, thanks to Phorum (cha-ching), MySQL was able to make their own log in system work with their forums. Heck, even at dealnews I have not done that. Mostly because our forum logins predate our site accounts for email alerts and newsletters. I am not asking for perfection though. I would just like to feel like the company/entitiy gave some love to making their forums part of their site and not an afterthought.

So, I call for all web sites to start treating their forums like real pages. Give them the same love and attention you give that front page or any other page. And, if your message board software makes that hard, give Phorum a try.

Speaking at MySQL Conference 2008

I had mentioned a while back that I submitted three proposals for the 2008 MySQL Conference.  Well, two were accepted.

From one server to a cluster

In the last 10 years, dealnews.com has grown from a single shared hosting account to an entire rack of equipment. Luckily, we started using PHP and MySQL very early in the company's history.

From the early days of growing a forum to surviving Slashdotting, Digging and even a Yahoo! front page mention, we have had to adapt both our hardware and software many times to keep up with the growth.

I will discuss the traps, bottlenecks, and even some big wins we have encountered along the way using PHP and MySQL. From the small scale to using replication and even some MySQL Cluster.  We have done many interesting things to give our readers (and our content team) a good experience when using our web site.

MySQL hacks and tricks to make Phorum fast

Phorum is the message board software used by MySQL. One reason they chose Phorum was because of its speed. We have to use some tricks and fancy SQL to make this happen. Things we will talk about in this session include:

  • Using temporary tables for good uses.

  • Why PHP and MySQL can be a bad mix with large data sets.

  • What mysqlnd will bring to the table with the future of PHP and MYSQL.

  • How Phorum uses full text indexing and some fancy SQL to make our search engine fast.

  • Forcing MySQL to use indexes to ensure proper query performance.


You can find my conference page here.  (as Terry would say, me, me, me!)

Apache Worker and PHP

The PHP manual basically tells you not to use Apache 2 with a threaded MPM and PHP as an Apache module. In general, it may be good advice. But, at dealnews.com, we have found it very valuable.

Apache threaded MPMs

Well, first, what is an MPM? It stands for Multi-Processing Module. It is the process model that Apache uses for its children process. Each request that comes in is handed to a child. Apache 1 used only one model for this, the prefork model. That uses one process per Apache child. The most commonly used threaded MPM is the Worker MPM. In this MPM, you have several processes that run multiple threads within it. This is the one I will be talking about. You can read more on Apache MPMs at the Apache web site.

Huge memory savings

With the Apache prefork or even FastCGI, each apache/php process allocates its own memory. Most healthy sites I have worked on use about 15MB of memory per apache process. Code that has problems will use even more than this. I have seen some use as much as 50MB of RAM. But, lets stick with healthy. So, a server with 1GB of RAM will only realistically be able to run 50 Apache processes or 50 PHP children for FastCGI if each uses 15MB or RAM. That is 750MB total. That leaves just 256MB for the OS and other applications. Now, if you are Yahoo! or someone else with lots of money and lots of equipment, you can just keep adding hardware. But, most of us can't do that.

As I wrote above, the worker MPM apache uses children (processes) and threads. If you configure it to use 10 child processes, each with 10 threads you would have 100 total threads or clients to answer requests. The good news is, because 10 threads are in one process, they can reuse memory that is allocated by other threads in the same process. At dealnews, our application servers use 25 threads per child. In our experience, each child process uses about 35MB of RAM. So, that works out to about 1.4MB per thread. That is 10% the usage for a prefork server per client.

Some say that you will run out of CPU way before RAM. That was not what we experienced before switching to worker. Machines with 2GB of RAM were running out of memory before we hit CPU as a bottleneck due to having just 100 Apache clients running. Now, with worker, I am happy to say that we don't have that problem.

Building PHP for best success with Worker

This is an important part. You can't use radical extensions in PHP when you are using worker. I don't have a list of extensions that will and won't work. We stick with the ones we need to do our core job. Mainly, most pages use the mysql and memcached extension. I would not do any fancy stuff in a worker based server. Keep a prefork server around for that. Or better yet, do funky memory sucking stuff in a cron job and push that data somewhere your web servers can get to it.

Other benefits like static content

Another big issue you hear about with Apache and PHP is running some other server for serving static content to save resources. Worker allows you to do this without running two servers. Having a prefork Apache/PHP process that has 15MB of RAM allocated serve a 10k jpeg image or some CSS file is a waste of resources. With worker, like I wrote above, the memory savings negate this issue. And, from my benchmarks (someone prove me wrong) Apache 2 can keep up with the lighttpds and litespeeds of the world in terms of requests per second for this type of content. This was actually the first place we used the worker mpm. It may still be a good idea to have dedicated apache daemons running just for that content if you have lots of requests for it. That will keep your static content requests from over running your dynamic content requests.

Some issues we have seen

Ok, it is not without problems (but, neither was prefork). There are some unknown (meaning undiagnosed by us) things that will occasionally cause CPU spikes on the servers running worker. For example, we took two memcached nodes offline and the servers that were connected to them spiked their CPU. We restarted Apache and all was fine. It was odd. We had another issue where a bug in my PHP code that was calling fsockopen() without a valid host name and a long timeout would cause a CPU spike and would not seem to let go. So, it does seem that bad PHP code makes the server more sensitive. So, your mileage may vary.

As with any new technology, you need to test a lot before you jump in with both feet. Anyone else have experience with worker and want to share?

One last tip

We have adopted a technique that Rasmus Lerdorf had mentioned. We decide how many MaxClients a server can run and we configure that number to always run. We set the min and max settings of the Apache configuration the same. Of course, we are running service specific servers. If you only have one or two servers and they run Apache and MySQL and mail and dns and... etc. you probably don't want to do that. But, then again, you need to make sure MaxClients will not kill your RAM/CPU as well. I see lots of servers that if MaxClients was actually reached, they would be using 20GB of RAM. And, these servers only have 2GB of RAM. So, check those settings. If you can, configure it to start up more (all if you can) Apache process rather than a few and make sure you won't blow out your RAM.

Managing two data centers

Call it paranoia.  Call it being prepared.  Whatever your stance, we are considering using more than one data center for dealnews.com.  It is not a capacity issue.  We can keep growing our current data center without a problem.  But, stories of power outages and power outages we have experience have us wanting to explore the idea.

Here is the problem.  No one in our company has experience with this.  And, there does not seem to be any resources on the internet talking about this.  Our problems are not so much with managing the data between the two.  The problem is failover and how to deal with one data center being out.  Here are some of the ideas that have been thrown on to the wall.

Round Robin DNS

This was the first idea.  It seems simple enough.  We have two data centers.  We publish different DNS for each data center and traffic goes to each one.  The problem here is that it is, well, random.

Global Traffic Management

There are devices that "balance" traffic  across multiple different locations.  But, I am unsure how those deal with outages at one of the locations.  It seems like there is still one point of failure.

BGP Routing

This is the biggest mystery to me.  I know what it is.  I know what it means.  I have no idea how to deploy this type of solution.  I understand that you can "move" your IP addresses with routing changes.  But, that means running routers.  Where are these routers?  Does this happen at some provider?  Is there a provider that handles this?  Does that mean that all of our data centers are with one provider?  I think one more peace of mind feature of this is that we would not be tied to just one vendor.  So, if one vendor had major issues or there was some legal troubles (we lived through the dot come boom and bust) we would have security in knowing we had other equipment that was not affected.

Is there something else?  Are we being way paranoid?  Maybe it is not cost effective in the end.  I/we have no idea really.  Anyone out there that has knowledge on this subject?