finding most common elements between thousands of multiple arrays.

Vilya Harvey vilya.harvey at gmail.com
Sat Jul 4 13:59:24 EDT 2009


2009/7/4 Steven D'Aprano <steve at remove-this-cybersource.com.au>:
> On Sat, 04 Jul 2009 13:42:06 +0000, Steven D'Aprano wrote:
>
>> On Sat, 04 Jul 2009 10:55:44 +0100, Vilya Harvey wrote:
>>
>>> 2009/7/4 Andre Engels <andreengels at gmail.com>:
>>>> On Sat, Jul 4, 2009 at 9:33 AM, mclovin<hanooter at gmail.com> wrote:
>>>>> Currently I need to find the most common elements in thousands of
>>>>> arrays within one large array (arround 2 million instances with ~70k
>>>>> unique elements)
>> ...
>>>> There's no better algorithm for the general case. No method of
>>>> checking the matrices using less than 2000000-x look-ups will ensure
>>>> you that there's not a new value with x occurences lurking somewhere.
>>>
>>> Try flattening the arrays into a single large array & sorting it. Then
>>> you can just iterate over the large array counting as you go; you only
>>> ever have to insert into the dict once for each value and there's no
>>> lookups in the dict.
>>
>> You're suggesting to do a whole bunch of work copying 2,000,000 pointers
>> into a single array, then a whole bunch of more work sorting that second
>> array (which is O(N*log N) on average), and then finally iterate over
>> the second array. Sure, that last step will on average involve fewer
>> than O(N) steps,
>
> Er what?
>
> Ignore that last comment -- I don't know what I was thinking. You still
> have to iterate over all N elements, sorted or not.
>
>> but to get to that point you've already done more work
>> than just iterating over the array-of-arrays in the first place.
>
> What it does buy you though, as you pointed out, is reducing the number
> of explicit dict lookups and writes. However, dict lookups and writes are
> very fast, fast enough that they're used throughout Python. A line like:
>
> count += 1
>
> actually is a dict lookup and write.

I did some tests, just to be sure, and you're absolutely right: just
creating the flattened list took several hundred (!) times as long as
iterating through all the lists in place. Live and learn...

Vil.



More information about the Python-list mailing list