Cache memory and its effect on list searching

Fri Dec 16 23:44:12 EST 2016

On Friday, December 16, 2016 at 6:27:24 PM UTC-8, Chris Angelico wrote:
> On Sat, Dec 17, 2016 at 1:20 PM,  <sohcahtoa82 at gmail.com> wrote:
> > I thought this was curious behavior.  I created a list of random-looking strings, then made a sorted copy.  I then found that using "in" to see if a string exists in the sorted list took over 9 times as long!
> >
> 
> My numbers replicate yours (though my computer's faster). But my
> conclusion is different:
> 
> Python 3.7.0a0 (default:ecd218c41cd4, Dec 16 2016, 03:08:47)
> [GCC 6.2.0 20161027] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import hashlib
> >>> import timeit
> >>> hashes =  [hashlib.md5(bytes(str(i), "utf-8")).hexdigest() for i in range(10000000)]
> >>> sortedhashes = sorted(hashes)
> >>> timeit.timeit('"x" in hashes', globals={'hashes': hashes}, number=10)
> 0.8167107938788831
> >>> timeit.timeit('"x" in hashes', globals={'hashes': sortedhashes}, number=10)
> 5.029693723190576
> >>> timeit.timeit('"x" in hashes', globals={'hashes': hashes}, number=10)
> 0.855183657258749
> >>> timeit.timeit('"x" in hashes', globals={'hashes': sortedhashes}, number=10)
> 5.0585526106879115
> >>> sethashes = set(hashes)
> >>> timeit.timeit('"x" in hashes', globals={'hashes': sethashes}, number=10)
> 6.13601878285408e-06
> 
> You want containment checks? Use a set :)
> 
> ChrisA

Ah, I forgot about the optimizations of a set.  The only time I ever find myself using set is when I want to remove all the duplicates in a list.  I convert it to a set and then back.

For fun, I ran an experiment:

### BEGIN listsearch.py
import hashlib 
import timeit
import sys

from bisect import bisect_left

def bisect_search(a, x, lo=0, hi=None):   # can't use a to specify default for hi
    # From http://stackoverflow.com/questions/212358/binary-search-bisection-in-python
    hi = hi if hi is not None else len(a) # hi defaults to len(a)
    pos = bisect_left(a,x,lo,hi)          # find insertion position
    return (pos if pos != hi and a[pos] == x else -1) # don't walk off the end

def bin_search(haystack, needle):
    start = 0
    end = len(haystack) - 1
    while True:
        if start > end:
            return False
        middle = (start + end) // 2
        if needle < haystack[middle]:
            end = middle - 1
            continue
        elif needle > haystack[middle]:
            start = middle + 1
            continue
        elif needle == haystack[middle]:
            return True

print('Python version: ', sys.version_info)
print('building hashes...')
hashes = [hashlib.md5(bytes(str(i), "utf-8")).hexdigest() for i in range(10000000)]
print('sorting...')
sortedhashes = sorted(hashes)
print('creating set...')
sethashes = set(hashes)
print('Unsorted list:', timeit.timeit('"x" in hashes', globals={'hashes': hashes}, number=10))
print('Sorted:', timeit.timeit('"x" in hashes', globals={'hashes': sortedhashes}, number=10))
print('set:', timeit.timeit('"x" in hashes', globals={'hashes': sethashes}, number=10))
print('binary search:',  timeit.timeit('binsearch(hashes, "x")', globals={'hashes': sortedhashes, 'binsearch': bin_search}, number=10))
print('binary search with bisect:',  timeit.timeit('binsearch(hashes, "x")', globals={'hashes': sortedhashes, 'binsearch': bisect_search}, number=10))
### END listsearch.py

> python3 listsearch.py
Python version:  sys.version_info(major=3, minor=5, micro=2, releaselevel='final', serial=0)
building hashes...
sorting...
creating set...
Unsorted list: 1.7384763684627569
Sorted: 9.248799958145042
set: 1.4614161294446149e-06
binary search: 0.00010902164328108199
binary search with bisect: 1.7829276782066472e-05

Yup.  set is definitely the way to go!