finding most common elements between thousands of multiple arrays.

Sat Jul 4 05:55:44 EDT 2009

2009/7/4 Andre Engels <andreengels at gmail.com>:
> On Sat, Jul 4, 2009 at 9:33 AM, mclovin<hanooter at gmail.com> wrote:
>> Currently I need to find the most common elements in thousands of
>> arrays within one large array (arround 2 million instances with ~70k
>> unique elements)
>>
>> so I set up a dictionary to handle the counting so when I am
>> iterating  I up the count on the corrosponding dictionary element. I
>> then iterate through the dictionary and find the 25 most common
>> elements.
>>
>> the elements are initially held in a array within an array. so I am am
>> just trying to find the most common elements between all the arrays
>> contained in one large array.
>> my current code looks something like this:
>> d = {}
>> for arr in my_array:
>> -----for i in arr:
>> #elements are numpy integers and thus are not accepted as dictionary
>> keys
>> -----------d[int(i)]=d.get(int(i),0)+1
>>
>> then I filter things down. but with my algorithm that only takes about
>> 1 sec so I dont need to show it here since that isnt the problem.
>>
>>
>> But there has to be something better. I have to do this many many
>> times and it seems silly to iterate through 2 million things just to
>> get 25. The element IDs are integers and are currently being held in
>> numpy arrays in a larger array. this ID is what makes up the key to
>> the dictionary.
>>
>>  It currently takes about 5 seconds to accomplish this with my current
>> algorithm.
>>
>> So does anyone know the best solution or algorithm? I think the trick
>> lies in matrix intersections but I do not know.
>
> There's no better algorithm for the general case. No method of
> checking the matrices using less than 2000000-x look-ups will ensure
> you that there's not a new value with x occurences lurking somewhere.

Try flattening the arrays into a single large array & sorting it. Then
you can just iterate over the large array counting as you go; you only
ever have to insert into the dict once for each value and there's no
lookups in the dict. I don't know numpy, so there's probably a more
efficient way to write this, but this should show what I'm talking
about:

big_arr = sorted(reduce(list.__add__, my_array, []))
counts = {}
last_val = big_arr[0]
count = 0
for val in big_arr:
    if val == last_val:
        count += 1
    else:
        counts[last_val] = count
        count = 0
        last_val = val
counts[last_val] = count    # to get the count for the last value.

If flattening the arrays isn't practical, you may still get some
improvements by sorting them individually and applying the same
principle to each of them:

counts = {}
for arr in my_array:
    sorted_arr = sorted(arr)
    last_val = sorted_arr[0]
    count = 0
    for val in sorted_arr:
        if val == last_val:
            count += 1
        else:
            counts[last_val] = counts.get(last_val, 0) + count
            count = 0
            last_val = val
    counts[last_val] = counts.get(last_val, 0) + count

Hope that helps...

Vil.