[SciPy-User] Accumulation sum using indirect indexes

Tue Feb 7 17:38:43 EST 2012

On Sun, Feb 5, 2012 at 2:17 AM, Alexander Kalinin
<alec.kalinin at gmail.com> wrote:
> Yes, the numpy.take() is much faster than "fancy" indexing and now "pure
> numpy" solution is two time faster than pandas. Below are timing results:
>
>
> The data shape:
>      (1062, 6348)
>
> Pandas solution:
>     0.16610 seconds
>
> "Pure numpy" solution:
>     0.08907 seconds
>
> Timing of the "pure numpy" by blocks:
> block (a) (sorting and obtaining groups):
>     0.00134 seconds
> block (b) (copy data to the ordered_data):
>     0.05517 seconds
> block (c) (reduceat):
>     0.02698
>
> Alexander.
>
>
> On Sun, Feb 5, 2012 at 4:01 AM, <josef.pktd at gmail.com> wrote:
>>
>> On Sat, Feb 4, 2012 at 2:27 PM, Wes McKinney <wesmckinn at gmail.com> wrote:
>> > On Sat, Feb 4, 2012 at 2:23 PM, Alexander Kalinin
>> > <alec.kalinin at gmail.com> wrote:
>> >> I have checked the performance of the "pure numpy" solution with pandas
>> >> solution on my task. The "pure numpy" solution is about two times
>> >> slower.
>> >>
>> >> The data shape:
>> >>     (1062, 6348)
>> >> Pandas "group by sum" time:
>> >>     0.16588 seconds
>> >> Pure numpy "group by sum" time:
>> >>     0.38979 seconds
>> >>
>> >> But it is interesting, that the main bottleneck in numpy solution is
>> >> the
>> >> data copying. I have divided solution on three blocks:
>> >>
>> >> # block (a):
>> >>     s = np.argsort(labels)
>> >>
>> >> keys, inv = np.unique(labels, return_inverse = True)
>> >>
>> >> i = inv[s]
>> >>
>> >> groups_at = np.where(i != np.concatenate(([-1], i[:-1])))[0]
>> >>
>> >>
>> >> # block (b):
>> >>     ordered_data = data[:, s]
>>
>> can you try with numpy.take? Keith and Wes were showing that take is
>> much faster than advanced indexing.
>>
>> Josef
>>
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>

FWIW I did some refactoring in pandas today and am getting the
following timings now in this use case:

In [12]: df = DataFrame(randn(1062, 6348))

In [13]: labels = np.random.randint(0, 100, size=1062)

In [14]: timeit df.groupby(labels).sum()
10 loops, best of 3: 38.7 ms per loop

comparing with

def numpy_groupby(data, labels, axis=0):
    s = np.argsort(labels)
    keys, inv = np.unique(labels, return_inverse = True)
    i = inv.take(s)
    groups_at = np.where(i != np.concatenate(([-1], i[:-1])))[0]
    ordered_data = data.take(s, axis=axis)
    group_sums = np.add.reduceat(ordered_data, groups_at, axis=axis)

    return group_sums

In [15]: timeit numpy_groupby(df.values, labels)
10 loops, best of 3: 95.4 ms per loop

according to line_profiler, the runtime is being consumed by the reduceat now

In [20]: lprun -f numpy_groupby numpy_groupby(df.values, labels)
Timer unit: 1e-06 s

File: pandas/core/groupby.py
Function: numpy_groupby at line 1511
Total time: 0.108126 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1511                                           def
numpy_groupby(data, labels):
  1512         1          125    125.0      0.1      s = np.argsort(labels)
  1513         1          720    720.0      0.7      keys, inv =
np.unique(labels, return_inverse = True)
  1514         1           13     13.0      0.0      i = inv.take(s)
  1515         1           62     62.0      0.1      groups_at =
np.where(i != np.concatenate(([-1], i[:-1])))[0]
  1516         1        28684  28684.0     26.5      ordered_data =
data.take(s, axis=0)
  1517         1        78519  78519.0     72.6      group_sums =
np.add.reduceat(ordered_data, groups_at, axis=0)
  1518
  1519         1            3      3.0      0.0      return group_sums

The performance of the pure numpy solution will degrade both with the
length of the labels vector and the number of unique elements (because
there are two O(N log N) steps there). In this case it matters less
because there are so many rows / columns to aggregate

The performance of the pure NumPy solution is unsurprisingly much
better when the aggregation is across contiguous memory vs. strided
memory access:

In [41]: labels = np.random.randint(0, 100, size=6348)

In [42]: timeit numpy_groupby(df.values, labels, axis=1)
10 loops, best of 3: 47.4 ms per loop

pandas is slower in this case because it's not giving any
consideration to cache locality:

In [50]: timeit df.groupby(labels, axis=1).sum()
10 loops, best of 3: 79.9 ms per loop

One can only complain so much about 30 lines of Cython code ;) Good
enough for the time being

- Wes