in_array is quite slow

Thu, Jun 5, 2008 02:38 AM
So, we had a cron job hanging for hours.  No idea why.  So, I started debugging.  It all came down to a call to in_array().  See, this job is importing data from a huge XML file into MySQL.  After it is done, we want to compare the data we just added/updated to the data in the table so we can deactivate any data we did not update.  We were using a mod_time field in mysql in the past.  But, that proved to be an issue when we wanted to start skipping rows from the XML that were present but unchanged.  Doing that saved a lot of MySQL writes and sped up the process.

So, anyhow, we have this huge array of ids accumulated during the import.  So, an in clause with 2 million parts would suck.  So, we suck back all the ids in the database that exist and stick that into an array.  We then compared the two arrays by looping one array and using in_array() to check if the value was in the second array.  Here is a pseudo example that shows the idea:

[sourcecode language='php']

foreach($arr1 as $key=>$i){

if(in_array($i, $arr2)){

unset($arr1[$key]);

}
}

[/sourcecode]

So, that was running for hours with about 400k items.  Our data did not contain the value as the key, but it could as the value was unique.  So, I added it.  So, now, the code looks like:

[sourcecode language='php']

foreach($arr1 as $key=>$i){

if(isset($arr2[$i])){

unset($arr1[$key]);

}
}

[/sourcecode]

Yeah, that runs in .8 seconds.  Much better.

So, why were we using in_array to start with if in_array is clearly not the right solution to this problem?  Well, it was basic code evolution.  Originally, these imports would be maybe 100 items.  But, things changed.

FWIW,  I tried array_diff() as well.  It took 25 seconds.  Way better than looping and calling in_array, but still not as quick as a simple isset check.  There was refactoring needed to put the values into the keys of the array.

UPDATE: I updated this post to properly reflect that there is nothing wrong with in_array, but simply that it was not the right solution to this problem.  I wrote this late and did not properly express this.  Thanks to all those people in the comments that helped explain this.
29 comments
Gravatar for Johny

Johny Says:

Well why do you wan´t to compare the data in PHP Anyway,
you said you would like to disable any data you did not update.
Why do it in PHP if you have a Database.
Give the updated entries an update flag or timestamp.
then
UPDATE your_table SET active_flag='0' WHERE update_timestamp < 'last_update_round'
i think it´s the smarter solution.
Best way to generate an Update Timestamp by starting of the cronjob so there is one time for all updates of one run.

Gravatar for Brian Moon

Brian Moon Says:

Read the post again Johny. I covered that. There are say 500k items in the xml file I am importing. Of those, only say 1000 have changes. But, there may be deletions. So, to avoid expensive updates on 499k rows that have not changed, I simply keep track of the unique ids which can live just fine in the index of the array.

Gravatar for ken

ken Says:

@Przemek Sobstel

Yes, isset() can't do a strict comparison, but it will tell you if the item exists, and if it does, you THEN do a strict comparison. However, we are missing the point here- the hashed array was only implemented to utilize the key; the value could be anything (or nothing). I think you might have isset() and array_key_exists() confused, because array_key_exists() will return false if the corresponding value for a key is null; isset() is the construct that verifies a value exists as well.

And come to think of it, I believe this would be one of the few times that I'd implement array_key_exists(), since it should be faster than isset() (because you don't need the value).

Gravatar for dv

dv Says:

uh, from 4 years ago. where have you been?

http://ilia.ws/archives/12-PHP-Optimization-Tricks.html

5) Another common operation in PHP scripts is array searching. This process can be quite slow as regular search mechanism such as in_array() or manuall implementation work by itterating through the entire array. This can be quite a performance hit if you are searching through a large array or need to perform the searches frequently. So what can you do? Well, you can do a trick that relies upon the way that Zend Engine stores array data. Internally arrays are stored inside hash tables when they array element (key) is the key of the hashtables used to find the data and result is the value associated with that key. Since hashtable lookups are quite fast, you can simplify array searching by making the data you intend to search through the key of the array, then searching for the data is as simple as $value = isset($foo[$bar])) ? $foo[$bar] : NULL;. This searching mechanism is way faster then manual array iteration, even though having string keys maybe more memory intensive then using simple numeric keys.

Ex.

$keys = array("apples", "oranges", "mangoes", "tomatoes", "pickles");
if (in_array('mangoes', $keys)) { ... }

vs

$keys = array("apples" => 1, "oranges" => 1, "mangoes" => 1, "tomatoes" => 1, "pickles" => 1);
if (isset($keys['mangoes'])) { ... }

The bottom search mechanism is roughly 3 times faster.

Gravatar for gubble

gubble Says:

Yes, Geo (in previous comment) is right... It is funny and I could say obvious... :-)

Gravatar for Rene L.

Rene L. Says:

Sorted arrays would allow a fast binary search.

Gravatar for Krister Karlström

Krister Karlström Says:

@Geo: He said that they switched to store the value as a key instead... Or more correctly, they store the same number as both key and value...

Gravatar for Brian Moon

Brian Moon Says:

I apologize. It was late and I was blogging half asleep. I should have went into why in_array is slow. I do understand why it is slow. The code was originally designed for maybe 100 items. But, like a lot of things, needs change as time goes by.

I really meant this post as a "don't make the mistake I made post". Not as a dig on in_array.

Gravatar for WAZ (weblog andre zorzo) » php - is_array é

WAZ (weblog andre zorzo) » php - is_array é Says:

[...] o que sugere o Brian Moon em seu blog. Ele relata os problemas de performance que teve com um script que rodava em background [...]

Gravatar for gubble

gubble Says:

@Rene L.

So this becomes very specific case and not cannot be treated as a general comparison between in_array and isset...

Gravatar for Brian Moon

Brian Moon Says:

I believe what it does show is that you should be careful how you use in_array. in_array should clearly not be used on large arrays. You should find another way to solve your problem if you have to work on a large array. There may not be another way. You may just have to wait. I know I will hesitate from now on every time I use in_array and think "could I do this another way that would not be subjected to the performance of in_array.

Gravatar for clintonskakun

clintonskakun Says:

Is_array is almost the only way. Using empty or is_null is better depending on what your using it for.

Gravatar for Glagla Dot Org - Olivier Mansour » Blogmarks

Glagla Dot Org - Olivier Mansour » Blogmarks Says:

[...] in_array is quite slow [...]

Gravatar for g

g Says:

yeah I agree with first post, was quiet obvious.

Gravatar for Carsten Pedersen

Carsten Pedersen Says:

Good with that kind of information sharing. I had similar experiences with array_merge recently: http://www.bitbybit.dk/carsten/blog/?p=203

Gravatar for Sara Golemon

Sara Golemon Says:

Um... sorry, but... duh....

It's basic algorithms 101... A hash table is faster than a linear search. (Not to mention that is_array() is a function and has the dreaded FCO, while isset is a language construct and does not).

A linear search is going to be O(n) complexity, the more elements in the array, the longer it'll take to find what you want.

A hashtable is going to be O(k), where k is a function of the hash table's density and its distribution function. Note that the density of PHP hash tables is typically less than 8, and the distribution function is decently fast and collision avoidant.

What's important to note with those numbers is that for any hash table larger than about 16 items, isset() should already be out-pacing in_array(), so if all else is equal and you can choose between building your arrays as: for(;;) { $data[] = $value; } versus for(;;) { $data[$value] = true; }, you're probably going to be better off with the latter.*

* - All generalizations are just that. Please remember that premature optimization is the root of all evil, offer void on planet Earth and all days ending in the letter 'y'. See terms and conditions for details. Limit one per household, smokers need not apply.

Gravatar for Brian Moon

Brian Moon Says:

Sarah, while I agree with "duh" for someone that understands internals, (you, me, etc) the common PHP developer very well may not. The full discussion in the comments has been good to fully explain to the laymen why in_array is not the right solution for this problem. But, at first glance, someone that does not understands internals may believe that in_array is the best solution to the problem I am describing above. As I said, the code was originally used for much smaller arrays. Also it was using arrays that already existed without proper keys. Some refactoring of other code was needed to make it work with keys. So, it was kind of an evolution that happens every day to people in the real world.

Gravatar for loufoque

loufoque Says:

Lookup in a hash table is O(1). Your k has nothing to do here, it's asymptotic behaviour. O(c) = O(1) for any constant c.

Gravatar for Misunderstanding How in_array Works | Mats Lindh

Misunderstanding How in_array Works | Mats Lindh Says:

[...] Moon has a post about how in_array() really, really sucks. This is not a problem with in_array per se, men failing to recognize the proper way to solve a [...]

Gravatar for Geo

Geo Says:

Am I the only one to actually look at the code???

Am I te only one seeing that those two pieces of code do a completely different job???

The first one checks to see if an array contain a VALUE $i at any key. And it does for 2 mln or so values - so it's 2 mln times. This includes each time comparing all the values in the array to the value of $i.

The second one just checks to see if there is a KEY $i in the array. It checks against the KEY not the VALUE!!! And of course it will be lightning fast as isset() is not a function but is an internall language construct and it is only looking at php's internal var tables/arrays. Besides it is only executed 2 mln times = 2 mln operations, compared to in_array() comparing 2 mln values 2 mln times = 2mln*2mln operations...

Gravatar for wenbert

wenbert Says:

interesting posts. i always thought that in_array() is one of the blessings for using php. i have used it a lot but i never used on 999999 elements.

how did you manage to go around this problem?

Gravatar for Jeff

Jeff Says:

I wouldn't say that in_array sucks at all! You need to take a little look at your algorithm and take responsibility for it, don't rely on functions to do everything for you.

Here is the situation, if you have 400,000 rows in an array as you stated, your algorithm is going to be doing 400,000 iterations in the first foreach loop. Then you make a function call (which in itself is more expensive then your replacement to the [] operator) to in_array. in_array in turn then has to loop through possibly all of the haystack array... so if you had an array of say... 400,000 rows again then in array might have to do 400,000 iterations. This then means that your overall program is going to be doing 400,000*400,000 iterations = 160,000,000,000 iterations. (or count($arr1)*count($arr2)). In essence both your arrays are O(n) so combined you have O(n^2)

Now... the super amazingly fast algorithm... You are still doing the 400,000 iterations for the outer loop, however... you now replaced the 400,000 or so iterations for each outer loop iteration with just one simple lookup... In fact you even removed the function call which probably saved a tiny bit. In essence your use of the [] operator acts as a hash table lookup in this case... which is O(1)... therefore your outer loop is O(n) and the inner is O(1) you get... O(n)... and O(n) is a lot less then O(1)!

Though... none of this really matters since it was probably a poor design choice from the beginning to be creating arrays of that size... it would have been better to come up with a way to let the database handle it.

Gravatar for Dominic

Dominic Says:

According to this http://www.phpbench.com/ tells you about this problem.

Gravatar for Krister Karlström

Krister Karlström Says:

Well, it's quite obvious that in_array() is slow if you have lots of data. Basically it has to loop through all the entries in the array and compare it to the given value. Using an associative array instead where you just check if the key is set or not is a totally different operation, just an index lookup is needed. Because arrays are indexed, it is always fast to get a value from an array if you know the key. If you don't know the key, well - then you would have to go through all the elements...

So we could say, as an conclusion, that you had a bad design from the beginning! :) Keys should always be used if the can be used.

Gravatar for Marcin Kłeczek

Marcin Kłeczek Says:

Who wrote that code? in_array for 400k elements? It must take a time...

Gravatar for Przemek Sobstel

Przemek Sobstel Says:

What isset() does not offer is strict comparision. Also, don't forget that isset() will return false when array element exists and is null. Not mentioning situation when you don't have a key (just like Krister said above). Conclusion is isset() is not 100% substitution for in_array(), so comparing them (in most cases) does not make sense.

Gravatar for Neil Garb

Neil Garb Says:

It's interesting to note that array_key_exists() seems faster than in_array() if the array is flipped. array_flip() adds extra overhead, but if the values are flipped to begin with you might get a slight performance boost.

Gravatar for John Jones

John Jones Says:

Well duh... your original code was moronic.

Comments are disabled for this post.