[Numpy-discussion] Find indices of largest elements

Wed Apr 14 18:39:20 EDT 2010

On Wed, Apr 14, 2010 at 3:12 PM, Anne Archibald
<peridot.faceted at gmail.com> wrote:
> On 14 April 2010 16:56, Keith Goodman <kwgoodman at gmail.com> wrote:
>> On Wed, Apr 14, 2010 at 12:39 PM, Nikolaus Rath <Nikolaus at rath.org> wrote:
>>> Keith Goodman <kwgoodman at gmail.com> writes:
>>>> On Wed, Apr 14, 2010 at 8:49 AM, Keith Goodman <kwgoodman at gmail.com> wrote:
>>>>> On Wed, Apr 14, 2010 at 8:16 AM, Nikolaus Rath <Nikolaus at rath.org> wrote:
>>>>>> Hello,
>>>>>>
>>>>>> How do I best find out the indices of the largest x elements in an
>>>>>> array?
>>>>>>
>>>>>> Example:
>>>>>>
>>>>>> a = [ [1,8,2], [2,1,3] ]
>>>>>> magic_function(a, 2) == [ (0,1), (1,2) ]
>>>>>>
>>>>>> Since the largest 2 elements are at positions (0,1) and (1,2).
>>>>>
>>>>> Here's a quick way to rank the data if there are no ties and no NaNs:
>>>>
>>>> ...or if you need the indices in order:
>>>>
>>>>>> shape = (3,2)
>>>>>> x = np.random.rand(*shape)
>>>>>> x
>>>> array([[ 0.52420123,  0.43231286],
>>>>        [ 0.97995333,  0.87416228],
>>>>        [ 0.71604075,  0.66018382]])
>>>>>> r = x.reshape(-1).argsort().argsort()
>>>
>>> I don't understand why this works. Why do you call argsort() twice?
>>> Doesn't that give you the indices of the sorted indices?
>>
>> It is confusing. Let's look at an example:
>>
>>>> x = np.random.rand(4)
>>>> x
>>   array([ 0.37412289,  0.68248559,  0.12935131,  0.42510212])
>>
>> If we call argsort once we get the index that will sort x:
>>
>>>> idx = x.argsort()
>>>> idx
>>   array([2, 0, 3, 1])
>>>> x[idx]
>>   array([ 0.12935131,  0.37412289,  0.42510212,  0.68248559])
>>
>> Notice that the first element of idx is 2. That's because element x[2]
>> is the min of x. But that's not what we want. We want the first
>> element to be the rank of the first element of x. So we need to
>> shuffle idx around so that the order aligns with x. How do we do that?
>> We sort it!
>
> Unless I am mistaken, what you are doing here is inverting the
> permutation returned by the first argsort. The second argsort is an n
> log n method, though, and permutations can be inverted in linear time:
>
> ix = np.argsort(X)
> ixinv = np.zeros_like(ix)
> ixinv[ix] = np.arange(len(ix))
>
> This works because if ix is a permutation and ixinv is its inverse,
> A = B[ix]
> is the same as
> A[ixinv] = B
> This also means that you can often do without the actual inverse by
> moving the indexing operation to the other side of the equal sign.
> (Not in the OP's case, though.)

That is very nice. And very fast for large arrays:

>> x = np.random.rand(4)
>> timeit idx = x.argsort().argsort()
1000000 loops, best of 3: 1.45 us per loop
>> timeit idx = x.argsort(); idxinv = np.zeros_like(idx); idxinv[idx] = np.arange(len(idx))
100000 loops, best of 3: 9.52 us per loop

>> x = np.random.rand(1000)
>> timeit idx = x.argsort().argsort()
10000 loops, best of 3: 112 us per loop
>> timeit idx = x.argsort(); idxinv = np.zeros_like(idx); idxinv[idx] = np.arange(len(idx))
10000 loops, best of 3: 82.9 us per loop

>> x = np.random.rand(100000)
>> timeit idx = x.argsort().argsort()
10 loops, best of 3: 20.4 ms per loop
>> timeit idx = x.argsort(); idxinv = np.zeros_like(idx); idxinv[idx] = np.arange(len(idx))
100 loops, best of 3: 13.2 ms per loop

> I'll also add that if the OP needs the top m for 1<m<<n, sorting the
> whole input array is not the most efficient algorithm; there are
> priority-queue-based schemes that are asymptotically more efficient,
> but none exists in numpy. Since numpy's sorting is quite fast, I
> personally would just use the sorting.

Partial sorting would find a lot of uses in the numpy community (like
in median).