[Numpy-discussion] Find the N maximum values and corresponding indexes in an array

Wed Dec 2 21:05:02 EST 2009

On Wed, Dec 2, 2009 at 5:52 PM, Anne Archibald
<peridot.faceted at gmail.com> wrote:
> 2009/12/2 Keith Goodman <kwgoodman at gmail.com>:
>> On Wed, Dec 2, 2009 at 5:27 PM, Anne Archibald
>> <peridot.faceted at gmail.com> wrote:
>>> 2009/12/2 Keith Goodman <kwgoodman at gmail.com>:
>>>> On Wed, Dec 2, 2009 at 5:09 PM, Neal Becker <ndbecker2 at gmail.com> wrote:
>>>>> David Warde-Farley wrote:
>>>>>
>>>>>> On 2-Dec-09, at 6:55 PM, Howard Chong wrote:
>>>>>>
>>>>>>> def myFindMaxA(myList):
>>>>>>>    """implement finding maximum value with for loop iteration"""
>>>>>>>    maxIndex=0
>>>>>>>    maxVal=myList[0]
>>>>>>>    for index, item in enumerate(myList):
>>>>>>>        if item[0]>maxVal:
>>>>>>>            maxVal=item[0]
>>>>>>>            maxIndex=index
>>>>>>>    return [maxIndex, maxVal]
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> My question is: how can I make the latter version run faster? I
>>>>>>> think the
>>>>>>> answer is that I have to do the iteration in C.
>>>>>>
>>>>>>
>>>>>> def find_biggest_n(myarray, n):
>>>>>> ind = np.argsort(myarray)
>>>>>> return ind[-n:], myarray[ind[-n:]]
>>>>>>
>>>>>> David
>>>>> Not bad, although I wonder whether a partial sort could be faster.
>>>>
>>>> I'm doing a lot of sorting right now. I only need to sort the lowest
>>>> 30% of values in a 1d array (about 250k elements), the rest I don't
>>>> need to sort. How do I do a partial sort?
>>>
>>> Algorithmically, if you're doing a quicksort, you just don't sort one
>>> side of a partition when it's outside the range you want sorted (which
>>> could even be data-dependent, I suppose, as well as
>>> number-of-items-dependent). This is useful for sorting out only the
>>> extreme elements, as well as finding quantiles (and those elements
>>> above/below them). Unfortunately I'm not aware of any implementation,
>>> useful though it might be.
>>
>> Oh, I thought he meant there was a numpy function for partial sorting.
>>
>> What kind of speed up for my problem (sort lower 30% of a 1d array
>> with 250k elements) could I expect if I paid someone to write it in
>> cython? Twice as fast? I'm actually doing x.argsort(), if that
>> matters.
>
> Hard to say exactly, but I'd say at best a factor of two. You can get
> a rough best-case value by doing an argmin followed by an argsort of
> the first 30%.
>
> In reality things will be a bit worse because you won't immediately be
> able to discard the whole rest of the array, and because your argsort
> will be accessing data spread over a larger swath of memory (so worse
> cache performance). But if this doesn't provide a useful speedup, it's
> very unlikely partial sorting will be of any use to you.
>
> Medians and other quantiles, or partitioning tasks, are where it really shines.

Looks delicious:

>> y = np.random.rand(250000)
>> timeit y.argsort()
10 loops, best of 3: 33 ms per loop
>> y3 = np.random.rand(int(250000/3.0))
>> timeit y.argmin(); y3.argsort()
100 loops, best of 3: 10.2 ms per loop

My actual problem is this:

        idx = d.argsort()
        yc = y[idx].cumsum()
        yc = yc[ntops1]
        yc /= ntops

where d is a 1d array of length 250k, ntops is a 30 element 1d array,
and ntop1 = ntops + 1. So I really only need to know the exact order
of 30 elements but I need the right elements, in any order, between
those 30 elements.

So it looks like there is a lot to be gained from a cython
implementation of my specific problem.

Sorry for trying to hijack the thread.