finding most common elements between thousands of multiple arrays.

Tue Jul 7 17:48:37 EDT 2009

"mclovin" <hanooter at gmail.com> wrote in message 
news:c5332c9b-2348-4194-bfa0-d70c7710765d at x3g2000yqa.googlegroups.com...
> Currently I need to find the most common elements in thousands of
> arrays within one large array (arround 2 million instances with ~70k
> unique elements)
>
> so I set up a dictionary to handle the counting so when I am
> iterating  I up the count on the corrosponding dictionary element. I
> then iterate through the dictionary and find the 25 most common
> elements.
>
> the elements are initially held in a array within an array. so I am am
> just trying to find the most common elements between all the arrays
> contained in one large array.
> my current code looks something like this:
> d = {}
> for arr in my_array:
> -----for i in arr:
> #elements are numpy integers and thus are not accepted as dictionary
> keys
> -----------d[int(i)]=d.get(int(i),0)+1
>
> then I filter things down. but with my algorithm that only takes about
> 1 sec so I dont need to show it here since that isnt the problem.
>
>
> But there has to be something better. I have to do this many many
> times and it seems silly to iterate through 2 million things just to
> get 25. The element IDs are integers and are currently being held in
> numpy arrays in a larger array. this ID is what makes up the key to
> the dictionary.
>
> It currently takes about 5 seconds to accomplish this with my current
> algorithm.
>
> So does anyone know the best solution or algorithm? I think the trick
> lies in matrix intersections but I do not know.

Would the following work for you, or am I missing something?  For a 5Kx5K 
array, this takes about a tenth of a second on my machine.  This code 
doesn't deal with the sub-array issue.

#####################
import numpy
import time

LOWER = 0
UPPER = 1024
SIZE = 5000
NUM_BEST = 4

# sample data
data = numpy.random.randint(LOWER, UPPER, (SIZE,SIZE)).astype(int)

time.clock()
count = numpy.bincount(data.flat)
best = sorted(zip(count, range(len(count))))[-NUM_BEST:]
print 'time=', time.clock()
print best