So, I mentioned in my MySQL Cluster post that I found out that cluster and joins don't get a long too well. There are a couple of tips I have for using joins or replacing them with other techniques that I thought I would share.

Here are some things I keep in mind when using the NDB engine. These may apply to the other engines as well.

  1. Test using the NDB engine. Testing against another engine is not adequate.

  2. Avoid JOIN if you can, test often if you can't

  3. Avoid subqueries always

  4. Select using primary keys when at all possible. Always select using a key.

  5. Be sure to include both your where and order by columns in your keys.


Try your join from several directions.

Sometimes, the way you join can affect the engine. This is more apparent with NDB than MyISAM or InnoDB. Have a look at these two queries:

select pub_articles.article_id
from pub_articles
inner join article_edition_lookup on
pub_articles.article_id=article_edition_lookup.article_id and
pub_articles.publication_id=article_edition_lookup.publication_id and
publication_date='2007-06-07'
where pub_articles.publication_id=2;


select article_id
from article_edition_lookup
inner join pub_articles using (article_id, publication_id)
where publication_id=2 and publication_date='2007-06-07';


They look similar. Same tables. Same basic criteria. In fact the explain output looks nearly identical. But, the first takes 20 seconds on cluster and the second takes .02 seconds. Now, some MySQL internals person may be able to look at this and know why, but to me, I just have to test them to see. (If anyone cares, the primary key on pub_articles is publication_id, article_id and the PK on article_edition_lookup is publication_id, publication_date, article_id).

Try using temporary tables.

If you can't get the join to be fast no matter what, try temporary tables. I have had some success using the HEAP engine to store data and then joining the two tables together. I don't have an example of that from any code I have currently written. I have mostly just used it at the mysql prompt to get data. Here is an example that uses the same tables from above and retrieves the same data using a temporary table.

create temporary table if not exists foo engine=heap
select article_id from article_edition_lookup where publication_id=2 and publication_date='2007-06-08';


select article_id
from pub_articles use key (primary)
inner join foo using (article_id)
where publication_id=2 order by first_publish_time desc;


Together that all happens in about .06 seconds. You need to realize that this does create a table, in memory, on the API node. So, this data will only be good for this one connection. If your temporary table has lots of rows, you may need to add a key to it when you create it. I am not sure I would use this method for a query that is going to be run a lot. Chances are there is a better solution if you have to resort to this technique.

Subqueries

I have not found a way to make subqueries fast. Even simple queries like the following are slow.

select article_id from pub_articles where publication_id=2 and article_id in (select article_id from article_edition_lookup where publication_id=2 and publication_date='2007-06-07' );

That is a primary key lookup on both of those tables. Yet the query takes 30 seconds to run. Just stay away from them altogether. For what it's worth, the same query takes 2 seconds on the InnoDB engine. I don't think subqueries are optimized in 5.0. This may be a known issue. I got used to not having them for so long that I never think to use them now.

Unions

I have not used UNION much with cluster. When I have it has worked well. I would use caution when using it. Be sure to test the queries before putting them in a production environment.