[SciPy-User] Accumulation sum using indirect indexes

Sat Feb 4 19:28:24 EST 2012

On Sat, Feb 4, 2012 at 6:01 PM, <josef.pktd at gmail.com> wrote:

> On Sat, Feb 4, 2012 at 2:27 PM, Wes McKinney <wesmckinn at gmail.com> wrote:
> > On Sat, Feb 4, 2012 at 2:23 PM, Alexander Kalinin
> > <alec.kalinin at gmail.com> wrote:
> >> I have checked the performance of the "pure numpy" solution with pandas
> >> solution on my task. The "pure numpy" solution is about two times
> slower.
> >>
> >> The data shape:
> >>     (1062, 6348)
> >> Pandas "group by sum" time:
> >>     0.16588 seconds
> >> Pure numpy "group by sum" time:
> >>     0.38979 seconds
> >>
> >> But it is interesting, that the main bottleneck in numpy solution is the
> >> data copying. I have divided solution on three blocks:
> >>
> >> # block (a):
> >>     s = np.argsort(labels)
> >>
> >> keys, inv = np.unique(labels, return_inverse = True)
> >>
> >> i = inv[s]
> >>
> >> groups_at = np.where(i != np.concatenate(([-1], i[:-1])))[0]
> >>
> >>
> >> # block (b):
> >>     ordered_data = data[:, s]
>
> can you try with numpy.take? Keith and Wes were showing that take is
> much faster than advanced indexing.
>

Good idea.  numpy.take is much faster:

In [35]: data.shape
Out[35]: (1000000, 3)

In [36]: %timeit o = data[s]
10 loops, best of 3: 155 ms per loop

In [37]: %timeit o = take(data, s, axis=0)
10 loops, best of 3: 37.1 ms per loop

Warren

> Josef
>
> >>
> >> # block (c):
> >>     group_sums = np.add.reduceat(ordered_data, groups_at, axis = 1)
> >>
> >> The timing for the blocks is:
> >> block (a):
> >>     0.00138 seconds
> >>
> >> block (b):
> >>     0.29285 seconds
> >>
> >> block (c):
> >>     0.08868 seconds
> >>
> >> The sorting and reduce_at procedures are very fast. But only one line:
> >> "ordered_data = data[:, s]" takes the most time.
> >>
> >> For me it is a bit strange. The reduceat() procedure where summation is
> >> executed is about 3 time faster than the only data copying.
> >>
> >> Alexander
> >>
> >>
> >> On Thu, Feb 2, 2012 at 10:16 PM, Warren Weckesser
> >> <warren.weckesser at enthought.com> wrote:
> >>>
> >>>
> >>>
> >>> On Wed, Feb 1, 2012 at 10:34 AM, Alexander Kalinin
> >>> <alec.kalinin at gmail.com> wrote:
> >>>>
> >>>> Yes, but for large data sets loops is quite slow. I have tried Pandas
> >>>> groupby.sum() and it works faster.
> >>>>
> >>>
> >>>
> >>> Pandas is probably the correct tool to use for this, but it will be
> nice
> >>> when numpy has a native "group-by" capability.
> >>>
> >>> For what its worth (had to scratch the itch, so to speak), the attached
> >>> script provides a "pure numpy" implementation without a python loop.
> The
> >>> output of the script is
> >>>
> >>> In [53]: run pseudo_group_by.py
> >>> Label   Data
> >>>  20    [1 2 3]
> >>>  20    [1 2 4]
> >>>  10    [3 3 1]
> >>>   0    [5 0 0]
> >>>  20    [1 9 0]
> >>>  10    [2 3 4]
> >>>  20    [9 9 1]
> >>>
> >>> Label  Num.   Sum
> >>>   0     1   [5 0 0]
> >>>  10     2   [5 6 5]
> >>>  20     4   [12 22  8]
> >>>
> >>>
> >>> A drawback of the method is that it will make a reordered copy of the
> >>> data.  I haven't compared the performance to pandas.
> >>>
> >>> Warren
> >>>
> >>>
> >>>>
> >>>>
> >>>> 2012/2/1 Frédéric Bastien <nouiz at nouiz.org>
> >>>>>
> >>>>> It will be slow, but you can make a python loop.
> >>>>>
> >>>>> Fred
> >>>>>
> >>>>> On Jan 31, 2012 3:34 PM, "Alexander Kalinin" <alec.kalinin at gmail.com
> >
> >>>>> wrote:
> >>>>>>
> >>>>>> Hello!
> >>>>>>
> >>>>>> I use SciPy in computer graphics applications. My task is to
> calculate
> >>>>>> vertex normals by averaging faces normals. In other words I want to
> >>>>>> accumulate vectors with the same ids. For example,
> >>>>>>
> >>>>>> ids = numpy.array([0, 1, 1, 2])
> >>>>>> n = numpy.array([ [0.1, 0.1, 0.1], [0.1, 0.1, 0.1], [0.1, 0.1, 0.1],
> >>>>>> [0.1, 0.1 0.1] ])
> >>>>>>
> >>>>>> I need result:
> >>>>>> nv = ([ [0.1, 0.1, 0.1], [0.2, 0.2, 0.2], [0.1, 0.1, 0.1]])
> >>>>>>
> >>>>>> The most simple code:
> >>>>>> nv[ids] += n
> >>>>>> does not work, I know about this. For 1D arrays I use
> >>>>>> numpy.bincount(...) function. But this function does not work for
> 2D arrays.
> >>>>>>
> >>>>>> So, my question. What is the best way calculate accumulation sum
> for 2D
> >>>>>> arrays using indirect indexes?
> >>>>>>
> >>>>>> Sincerely,
> >>>>>> Alexander
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> SciPy-User mailing list
> >>>>>> SciPy-User at scipy.org
> >>>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
> >>>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> SciPy-User mailing list
> >>>>> SciPy-User at scipy.org
> >>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
> >>>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> SciPy-User mailing list
> >>>> SciPy-User at scipy.org
> >>>> http://mail.scipy.org/mailman/listinfo/scipy-user
> >>>>
> >>>
> >>>
> >>> _______________________________________________
> >>> SciPy-User mailing list
> >>> SciPy-User at scipy.org
> >>> http://mail.scipy.org/mailman/listinfo/scipy-user
> >>>
> >>
> >>
> >> _______________________________________________
> >> SciPy-User mailing list
> >> SciPy-User at scipy.org
> >> http://mail.scipy.org/mailman/listinfo/scipy-user
> >>
> >
> > I should point out that pandas is not very optimized for a large
> > number of columns like this. I just created a github issue about it:
> >
> > https://github.com/wesm/pandas/issues/745
> >
> > I'll get to it eventually
> >
> > - Wes
> > _______________________________________________
> > SciPy-User mailing list
> > SciPy-User at scipy.org
> > http://mail.scipy.org/mailman/listinfo/scipy-user
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20120204/7a3a3313/attachment.html>