[Numpy-discussion] [Suggestion] Labelled Array

Sat Feb 13 13:29:44 EST 2016

On Sat, Feb 13, 2016 at 1:01 PM, Allan Haldane <allanhaldane at gmail.com>
wrote:

> Sorry, to reply to myself here, but looking at it with fresh eyes maybe
> the performance of the naive version isn't too bad. Here's a comparison of
> the naive vs a better implementation:
>
> def split_classes_naive(c, v):
>     return [v[c == u] for u in unique(c)]
>
> def split_classes(c, v):
>     perm = c.argsort()
>     csrt = c[perm]
>     div = where(csrt[1:] != csrt[:-1])[0] + 1
>     return [v[x] for x in split(perm, div)]
>
> >>> c = randint(0,32,size=100000)
> >>> v = arange(100000)
> >>> %timeit split_classes_naive(c,v)
> 100 loops, best of 3: 8.4 ms per loop
> >>> %timeit split_classes(c,v)
> 100 loops, best of 3: 4.79 ms per loop
>

The usecases I recently started to target for similar things is 1 Million
or more rows and 10000 uniques in the labels.
The second version should be faster for large number of uniques, I guess.

Overall numpy is falling far behind pandas in terms of simple groupby
operations. bincount and histogram (IIRC) worked for some cases but are
rather limited.

reduce_at looks nice for cases where it applies.

In contrast to the full sized labels in the original post, I only know of
applications where the labels are 1-D corresponding to rows or columns.

Josef

>
> In any case, maybe it is useful to Sergio or others.
>
> Allan
>
>
> On 02/13/2016 12:11 PM, Allan Haldane wrote:
>
>> I've had a pretty similar idea for a new indexing function
>> 'split_classes' which would help in your case, which essentially does
>>
>>      def split_classes(c, v):
>>          return [v[c == u] for u in unique(c)]
>>
>> Your example could be coded as
>>
>>      >>> [sum(c) for c in split_classes(label, data)]
>>      [9, 12, 15]
>>
>> I feel I've come across the need for such a function often enough that
>> it might be generally useful to people as part of numpy. The
>> implementation of split_classes above has pretty poor performance
>> because it creates many temporary boolean arrays, so my plan for a PR
>> was to have a speedy version of it that uses a single pass through v.
>> (I often wanted to use this function on large datasets).
>>
>> If anyone has any comments on the idea (good idea. bad idea?) I'd love
>> to hear.
>>
>> I have some further notes and examples here:
>> https://gist.github.com/ahaldane/1e673d2fe6ffe0be4f21
>>
>> Allan
>>
>> On 02/12/2016 09:40 AM, Sérgio wrote:
>>
>>> Hello,
>>>
>>> This is my first e-mail, I will try to make the idea simple.
>>>
>>> Similar to masked array it would be interesting to use a label array to
>>> guide operations.
>>>
>>> Ex.:
>>>  >>> x
>>> labelled_array(data =
>>>   [[0 1 2]
>>>   [3 4 5]
>>>   [6 7 8]],
>>>                          label =
>>>   [[0 1 2]
>>>   [0 1 2]
>>>   [0 1 2]])
>>>
>>>  >>> sum(x)
>>> array([9, 12, 15])
>>>
>>> The operations would create a new axis for label indexing.
>>>
>>> You could think of it as a collection of masks, one for each label.
>>>
>>> I don't know a way to make something like this efficiently without a
>>> loop. Just wondering...
>>>
>>> Sérgio.
>>>
>>>
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion at scipy.org
>>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>
>>>
>>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20160213/48ab4ca2/attachment.html>