[SciPy-User] Accumulation sum using indirect indexes

Sat Feb 4 19:01:39 EST 2012

On Sat, Feb 4, 2012 at 2:27 PM, Wes McKinney <wesmckinn at gmail.com> wrote:
> On Sat, Feb 4, 2012 at 2:23 PM, Alexander Kalinin
> <alec.kalinin at gmail.com> wrote:
>> I have checked the performance of the "pure numpy" solution with pandas
>> solution on my task. The "pure numpy" solution is about two times slower.
>>
>> The data shape:
>>     (1062, 6348)
>> Pandas "group by sum" time:
>>     0.16588 seconds
>> Pure numpy "group by sum" time:
>>     0.38979 seconds
>>
>> But it is interesting, that the main bottleneck in numpy solution is the
>> data copying. I have divided solution on three blocks:
>>
>> # block (a):
>>     s = np.argsort(labels)
>>
>> keys, inv = np.unique(labels, return_inverse = True)
>>
>> i = inv[s]
>>
>> groups_at = np.where(i != np.concatenate(([-1], i[:-1])))[0]
>>
>>
>> # block (b):
>>     ordered_data = data[:, s]

can you try with numpy.take? Keith and Wes were showing that take is
much faster than advanced indexing.

Josef

>>
>> # block (c):
>>     group_sums = np.add.reduceat(ordered_data, groups_at, axis = 1)
>>
>> The timing for the blocks is:
>> block (a):
>>     0.00138 seconds
>>
>> block (b):
>>     0.29285 seconds
>>
>> block (c):
>>     0.08868 seconds
>>
>> The sorting and reduce_at procedures are very fast. But only one line:
>> "ordered_data = data[:, s]" takes the most time.
>>
>> For me it is a bit strange. The reduceat() procedure where summation is
>> executed is about 3 time faster than the only data copying.
>>
>> Alexander
>>
>>
>> On Thu, Feb 2, 2012 at 10:16 PM, Warren Weckesser
>> <warren.weckesser at enthought.com> wrote:
>>>
>>>
>>>
>>> On Wed, Feb 1, 2012 at 10:34 AM, Alexander Kalinin
>>> <alec.kalinin at gmail.com> wrote:
>>>>
>>>> Yes, but for large data sets loops is quite slow. I have tried Pandas
>>>> groupby.sum() and it works faster.
>>>>
>>>
>>>
>>> Pandas is probably the correct tool to use for this, but it will be nice
>>> when numpy has a native "group-by" capability.
>>>
>>> For what its worth (had to scratch the itch, so to speak), the attached
>>> script provides a "pure numpy" implementation without a python loop.  The
>>> output of the script is
>>>
>>> In [53]: run pseudo_group_by.py
>>> Label   Data
>>>  20    [1 2 3]
>>>  20    [1 2 4]
>>>  10    [3 3 1]
>>>   0    [5 0 0]
>>>  20    [1 9 0]
>>>  10    [2 3 4]
>>>  20    [9 9 1]
>>>
>>> Label  Num.   Sum
>>>   0     1   [5 0 0]
>>>  10     2   [5 6 5]
>>>  20     4   [12 22  8]
>>>
>>>
>>> A drawback of the method is that it will make a reordered copy of the
>>> data.  I haven't compared the performance to pandas.
>>>
>>> Warren
>>>
>>>
>>>>
>>>>
>>>> 2012/2/1 Frédéric Bastien <nouiz at nouiz.org>
>>>>>
>>>>> It will be slow, but you can make a python loop.
>>>>>
>>>>> Fred
>>>>>
>>>>> On Jan 31, 2012 3:34 PM, "Alexander Kalinin" <alec.kalinin at gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> Hello!
>>>>>>
>>>>>> I use SciPy in computer graphics applications. My task is to calculate
>>>>>> vertex normals by averaging faces normals. In other words I want to
>>>>>> accumulate vectors with the same ids. For example,
>>>>>>
>>>>>> ids = numpy.array([0, 1, 1, 2])
>>>>>> n = numpy.array([ [0.1, 0.1, 0.1], [0.1, 0.1, 0.1], [0.1, 0.1, 0.1],
>>>>>> [0.1, 0.1 0.1] ])
>>>>>>
>>>>>> I need result:
>>>>>> nv = ([ [0.1, 0.1, 0.1], [0.2, 0.2, 0.2], [0.1, 0.1, 0.1]])
>>>>>>
>>>>>> The most simple code:
>>>>>> nv[ids] += n
>>>>>> does not work, I know about this. For 1D arrays I use
>>>>>> numpy.bincount(...) function. But this function does not work for 2D arrays.
>>>>>>
>>>>>> So, my question. What is the best way calculate accumulation sum for 2D
>>>>>> arrays using indirect indexes?
>>>>>>
>>>>>> Sincerely,
>>>>>> Alexander
>>>>>>
>>>>>> _______________________________________________
>>>>>> SciPy-User mailing list
>>>>>> SciPy-User at scipy.org
>>>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> SciPy-User mailing list
>>>>> SciPy-User at scipy.org
>>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> SciPy-User mailing list
>>>> SciPy-User at scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>>
>>>
>>>
>>> _______________________________________________
>>> SciPy-User mailing list
>>> SciPy-User at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>
>>
>>
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>
>
> I should point out that pandas is not very optimized for a large
> number of columns like this. I just created a github issue about it:
>
> https://github.com/wesm/pandas/issues/745
>
> I'll get to it eventually
>
> - Wes
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user