[Numpy-discussion] [Suggestion] Labelled Array

Allan Haldane allanhaldane at gmail.com
Sat Feb 13 13:01:53 EST 2016


Sorry, to reply to myself here, but looking at it with fresh eyes maybe 
the performance of the naive version isn't too bad. Here's a comparison 
of the naive vs a better implementation:

def split_classes_naive(c, v):
     return [v[c == u] for u in unique(c)]

def split_classes(c, v):
     perm = c.argsort()
     csrt = c[perm]
     div = where(csrt[1:] != csrt[:-1])[0] + 1
     return [v[x] for x in split(perm, div)]

 >>> c = randint(0,32,size=100000)
 >>> v = arange(100000)
 >>> %timeit split_classes_naive(c,v)
100 loops, best of 3: 8.4 ms per loop
 >>> %timeit split_classes(c,v)
100 loops, best of 3: 4.79 ms per loop

In any case, maybe it is useful to Sergio or others.

Allan

On 02/13/2016 12:11 PM, Allan Haldane wrote:
> I've had a pretty similar idea for a new indexing function
> 'split_classes' which would help in your case, which essentially does
>
>      def split_classes(c, v):
>          return [v[c == u] for u in unique(c)]
>
> Your example could be coded as
>
>      >>> [sum(c) for c in split_classes(label, data)]
>      [9, 12, 15]
>
> I feel I've come across the need for such a function often enough that
> it might be generally useful to people as part of numpy. The
> implementation of split_classes above has pretty poor performance
> because it creates many temporary boolean arrays, so my plan for a PR
> was to have a speedy version of it that uses a single pass through v.
> (I often wanted to use this function on large datasets).
>
> If anyone has any comments on the idea (good idea. bad idea?) I'd love
> to hear.
>
> I have some further notes and examples here:
> https://gist.github.com/ahaldane/1e673d2fe6ffe0be4f21
>
> Allan
>
> On 02/12/2016 09:40 AM, Sérgio wrote:
>> Hello,
>>
>> This is my first e-mail, I will try to make the idea simple.
>>
>> Similar to masked array it would be interesting to use a label array to
>> guide operations.
>>
>> Ex.:
>>  >>> x
>> labelled_array(data =
>>   [[0 1 2]
>>   [3 4 5]
>>   [6 7 8]],
>>                          label =
>>   [[0 1 2]
>>   [0 1 2]
>>   [0 1 2]])
>>
>>  >>> sum(x)
>> array([9, 12, 15])
>>
>> The operations would create a new axis for label indexing.
>>
>> You could think of it as a collection of masks, one for each label.
>>
>> I don't know a way to make something like this efficiently without a
>> loop. Just wondering...
>>
>> Sérgio.
>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>




More information about the NumPy-Discussion mailing list