Cache memory and its effect on list searching

Fri Dec 16 21:20:00 EST 2016

Alternatively...why you should definitely use binary searches:

Python 3.5.2+ (default, Aug 30 2016, 19:08:42) 
[GCC 6.2.0 20160822] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import hashlib
>>> import timeit
>>> hashes =  [hashlib.md5(bytes(str(i), "utf-8")).hexdigest() for i in range(10000000)]
>>> sortedhashes = sorted(hashes)
>>> timeit.timeit('"x" in hashes', globals={'hashes': hashes}, number=10)
1.9478233020054176
>>> timeit.timeit('"x" in hashes', globals={'hashes': sortedhashes}, number=10)
18.001392804995703

I thought this was curious behavior.  I created a list of random-looking strings, then made a sorted copy.  I then found that using "in" to see if a string exists in the sorted list took over 9 times as long!

At first, I thought since both lists are the same size, and the 'in' test is a linear search, shouldn't they take the same amount of time?  Even if there was some trickery with branch prediction happening, that would have benefited the sorted list.  Then I remembered how lists work in Python.  The original list is going to be mostly contiguous in memory, making the memory cache quite effective.  When I create the sorted copy, I'm creating a list of references to strings that are all over the place in memory, causing tons of cache misses.

Of course, the best solution was to implement a binary search.  That turned the membership check into a 300-700 microsecond operation.