[SciPy-User] Accumulation sum using indirect indexes

Wes McKinney wesmckinn at gmail.com
Sat Feb 4 14:27:18 EST 2012


On Sat, Feb 4, 2012 at 2:23 PM, Alexander Kalinin
<alec.kalinin at gmail.com> wrote:
> I have checked the performance of the "pure numpy" solution with pandas
> solution on my task. The "pure numpy" solution is about two times slower.
>
> The data shape:
>     (1062, 6348)
> Pandas "group by sum" time:
>     0.16588 seconds
> Pure numpy "group by sum" time:
>     0.38979 seconds
>
> But it is interesting, that the main bottleneck in numpy solution is the
> data copying. I have divided solution on three blocks:
>
> # block (a):
>     s = np.argsort(labels)
>
> keys, inv = np.unique(labels, return_inverse = True)
>
> i = inv[s]
>
> groups_at = np.where(i != np.concatenate(([-1], i[:-1])))[0]
>
>
> # block (b):
>     ordered_data = data[:, s]
>
> # block (c):
>     group_sums = np.add.reduceat(ordered_data, groups_at, axis = 1)
>
> The timing for the blocks is:
> block (a):
>     0.00138 seconds
>
> block (b):
>     0.29285 seconds
>
> block (c):
>     0.08868 seconds
>
> The sorting and reduce_at procedures are very fast. But only one line:
> "ordered_data = data[:, s]" takes the most time.
>
> For me it is a bit strange. The reduceat() procedure where summation is
> executed is about 3 time faster than the only data copying.
>
> Alexander
>
>
> On Thu, Feb 2, 2012 at 10:16 PM, Warren Weckesser
> <warren.weckesser at enthought.com> wrote:
>>
>>
>>
>> On Wed, Feb 1, 2012 at 10:34 AM, Alexander Kalinin
>> <alec.kalinin at gmail.com> wrote:
>>>
>>> Yes, but for large data sets loops is quite slow. I have tried Pandas
>>> groupby.sum() and it works faster.
>>>
>>
>>
>> Pandas is probably the correct tool to use for this, but it will be nice
>> when numpy has a native "group-by" capability.
>>
>> For what its worth (had to scratch the itch, so to speak), the attached
>> script provides a "pure numpy" implementation without a python loop.  The
>> output of the script is
>>
>> In [53]: run pseudo_group_by.py
>> Label   Data
>>  20    [1 2 3]
>>  20    [1 2 4]
>>  10    [3 3 1]
>>   0    [5 0 0]
>>  20    [1 9 0]
>>  10    [2 3 4]
>>  20    [9 9 1]
>>
>> Label  Num.   Sum
>>   0     1   [5 0 0]
>>  10     2   [5 6 5]
>>  20     4   [12 22  8]
>>
>>
>> A drawback of the method is that it will make a reordered copy of the
>> data.  I haven't compared the performance to pandas.
>>
>> Warren
>>
>>
>>>
>>>
>>> 2012/2/1 Frédéric Bastien <nouiz at nouiz.org>
>>>>
>>>> It will be slow, but you can make a python loop.
>>>>
>>>> Fred
>>>>
>>>> On Jan 31, 2012 3:34 PM, "Alexander Kalinin" <alec.kalinin at gmail.com>
>>>> wrote:
>>>>>
>>>>> Hello!
>>>>>
>>>>> I use SciPy in computer graphics applications. My task is to calculate
>>>>> vertex normals by averaging faces normals. In other words I want to
>>>>> accumulate vectors with the same ids. For example,
>>>>>
>>>>> ids = numpy.array([0, 1, 1, 2])
>>>>> n = numpy.array([ [0.1, 0.1, 0.1], [0.1, 0.1, 0.1], [0.1, 0.1, 0.1],
>>>>> [0.1, 0.1 0.1] ])
>>>>>
>>>>> I need result:
>>>>> nv = ([ [0.1, 0.1, 0.1], [0.2, 0.2, 0.2], [0.1, 0.1, 0.1]])
>>>>>
>>>>> The most simple code:
>>>>> nv[ids] += n
>>>>> does not work, I know about this. For 1D arrays I use
>>>>> numpy.bincount(...) function. But this function does not work for 2D arrays.
>>>>>
>>>>> So, my question. What is the best way calculate accumulation sum for 2D
>>>>> arrays using indirect indexes?
>>>>>
>>>>> Sincerely,
>>>>> Alexander
>>>>>
>>>>> _______________________________________________
>>>>> SciPy-User mailing list
>>>>> SciPy-User at scipy.org
>>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>>>
>>>>
>>>> _______________________________________________
>>>> SciPy-User mailing list
>>>> SciPy-User at scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>>
>>>
>>>
>>> _______________________________________________
>>> SciPy-User mailing list
>>> SciPy-User at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>
>>
>>
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>
>
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>

I should point out that pandas is not very optimized for a large
number of columns like this. I just created a github issue about it:

https://github.com/wesm/pandas/issues/745

I'll get to it eventually

- Wes



More information about the SciPy-User mailing list