finding most common elements between thousands of multiple arrays.

Sat Jul 4 09:42:06 EDT 2009

On Sat, 04 Jul 2009 10:55:44 +0100, Vilya Harvey wrote:

> 2009/7/4 Andre Engels <andreengels at gmail.com>:
>> On Sat, Jul 4, 2009 at 9:33 AM, mclovin<hanooter at gmail.com> wrote:
>>> Currently I need to find the most common elements in thousands of
>>> arrays within one large array (arround 2 million instances with ~70k
>>> unique elements)
...
>> There's no better algorithm for the general case. No method of checking
>> the matrices using less than 2000000-x look-ups will ensure you that
>> there's not a new value with x occurences lurking somewhere.
> 
> Try flattening the arrays into a single large array & sorting it. Then
> you can just iterate over the large array counting as you go; you only
> ever have to insert into the dict once for each value and there's no
> lookups in the dict.

You're suggesting to do a whole bunch of work copying 2,000,000 pointers 
into a single array, then a whole bunch of more work sorting that second 
array (which is O(N*log N) on average), and then finally iterate over the 
second array. Sure, that last step will on average involve fewer than 
O(N) steps, but to get to that point you've already done more work than 
just iterating over the array-of-arrays in the first place.

Now, if you're really lucky, your strategy can be done in fast C code 
instead of slow Python code, and you might see a speed-up for values of N 
which aren't too big. But I wouldn't put money on it.

Another strategy might be to pre-count elements in each array, as you 
build or modify them. This will marginally slow down each modification 
you make to the array, but the payback will be that finding the frequency 
of any element will be almost instantaneous.

-- 
Steven