finding most common elements between thousands of multiple arrays.

Sun Jul 5 20:30:58 EDT 2009

Scott David Daniels wrote:
> ... Here's a heuristic replacement for my previous frequency code:
> I've tried to mark where you could fudge numbers if the run time
> is at all close.

Boy, I cannot let go.  I did a bit of a test checking for cost to
calculated number of discovered samples, and found after:
     import timeit
     import numpy
     original = numpy.random.random(0, 100, (1000, 1000)).astype(int)
     data = original.flatten()
     data.sort()
     part = data[::100]
     t = timeit.Timer('sum(part[:-1]==part[1:])',
                      'from __main__ import part')
     v = timeit.Timer('len(part[part[:-1]==part[1:]])',
                      'from __main__ import part')

I got:
 >>> t.repeat(3, 10)
[0.58319842326318394, 0.57617574300638807, 0.57831819407238072]
 >>> v.repeat(3, 1000)
[0.93933027801040225, 0.93704535073584339, 0.94096260837613954]

So, len(part[mask]) is almost 50X faster!  I checked:
 >>> sum(part[:-1]==part[1:])
9393
 >>> len(part[part[:-1]==part[1:]])
9393

That's an awful lot of matches, so I with high selectivity:
     data = original.flatten()  # no sorting, so runs missing
     part = data[::100]

 >>> t.repeat(3, 10)
[0.58641335700485797, 0.58458854407490435, 0.58872594142576418]
 >>> v.repeat(3, 1000)
[0.27352554584422251, 0.27375686015921019, 0.27433291102624935]

about 200X faster

 >>> len(part[part[:-1]==part[1:]])
39
 >>> sum(part[:-1]==part[1:])
39

So my new version of this (compressed) code:
>     ...
>     sampled = data[::stride]
>     matches = sampled[:-1] == sampled[1:]
>     candidates = sum(matches) # count identified matches
>     while candidates > N * 10: # 10 -- heuristic
>         stride *= 2 # # heuristic increase
>         sampled = data[::stride]
>         matches = sampled[:-1] == sampled[1:]
>         candidates = sum(matches)
>     while candidates < N * 3: # heuristic slop for long runs
>         stride //= 2 # heuristic decrease
>         sampled = data[::stride]
>         matches = sampled[:-1] == sampled[1:]
>         candidates = sum(matches)
>     former = None
>     past = 0
>     for value in sampled[matches]:
>         ...
is:
       ...
       sampled = data[::stride]
       candidates = sampled[sampled[:-1] == sampled[1:]]
       while len(candidates) > N * 10: # 10 -- heuristic
           stride *= 2 # # heuristic increase
           sampled = data[::stride]
           candidates = sampled[sampled[:-1] == sampled[1:]]
       while len(candidates) < N * 3: # heuristic slop for long runs
           stride //= 2 # heuristic decrease
           sampled = data[::stride]
           candidates = sampled[sampled[:-1] == sampled[1:]]
       former = None
       past = 0
       for value in candidates:
           ...
This change is important, for we try several strides before
settling on a choice, meaning the optimization can be valuable.
This also means we could be pickier at choosing strides (try
more values), since checking is cheaper than before.

Summary: when dealing with numpy, (or any bulk <-> individual values
transitions), try several ways that you think are equivalent and
_measure_.  In the OODB work I did we called this "impedance mismatch,"
and it is likely some boundary transitions are _much_ faster than
others.  The sum case is one of them; I am getting numpy booleans
back, rather than numpy booleans, so conversions aren't going fastpath.

--Scott David Daniels
Scott.Daniels at Acm.Org