finding most common elements between thousands of multiple arrays.

Steven D'Aprano steve at REMOVE-THIS-cybersource.com.au
Sat Jul 4 21:39:33 EDT 2009


On Sat, 04 Jul 2009 07:19:48 -0700, Scott David Daniels wrote:

> Actually the next step is to maintain a min-heap as you run down the
> sorted array.  Something like:

Not bad.

I did some tests on it, using the following sample data:

arr = np.array([xrange(i, i+7000) for i in xrange(143)] + 
      [[750]*7000] + [xrange(3*i, 3*i+7000) for i in xrange(142)])


and compared your code against the following simple function:


def count(arr, N):
    D = {}
    for v in arr:
        for x in v:
            D[x] = D.get(x, 0) + 1
    freq = []
    for el, f in D.iteritems():
        freq.append((f, el))
    return sorted(freq, reverse=True)[:N]


As a rough figure, your min-heap code is approximately twice as fast as 
mine.

To the OP: I think you should start profiling your code and find out 
exactly *where* it is slow and concentrate on that. I think that trying a 
heuristic to estimate the most frequent elements by taking a random 
sample is likely to be a mistake -- whatever you're trying to accomplish 
with the frequency counts, the use of such a heuristic will mean that 
you're only approximately accomplishing it.



-- 
Steven



More information about the Python-list mailing list