PHP Appalachia Corrections

Just got home finally from PHP Appalachia. I enjoyed meeting all the great people.

I presented about what I learned and how we deal with importing large amounts of CSV data into MySQL. I threw my idea onto the wiki at the last minute, made the slides while everyone ate breakfast and I had planned on researching it all (been a few years since I wrote it), but we had no reliable internet. Some claims I made and their corrections.

I said our largest file is about 1.8 million lines. WRONG. Actually it is about 4.6 million. I was correct however that it does finish importing and indexing in about 5 minutes.
I claimed I LOAD DATA INFILE to MyISAM first and then "insert into ... select from" into an InnoDB table for speed reasons. WRONG. In fact, I do that because I need to merge fields from the file sometimes into one field in the databaes. I could not find a way to do that with LOAD DATA INFILE. As to speed. I can't say either way as I have no solid data. Sounds like a good test. MyISAM probably still wins on a LOAD DATA INFILE into a blank, fresh table based on my experience.
Total rows currently indexed is 7.2 million. I did not make a claim, but I thought I would just mention that. I wanted to include that, but did not have Internet. (Damn you Hughes)

1 comments

David Weingart Says:

Regarding #2, I've recently been working on a project that loads CSV data from various sources into MySQL, and I discovered that starting with version 5 you can use user variables to customize how columns are imported. You are able to discard columns or merge columns on the fly.

Check the manual for LOAD DATA INFILE for more details.

Comments are disabled for this post.