[SciPy-User] Accumulation sum using indirect indexes
Wes McKinney
wesmckinn at gmail.com
Tue Feb 7 17:38:43 EST 2012
On Sun, Feb 5, 2012 at 2:17 AM, Alexander Kalinin
<alec.kalinin at gmail.com> wrote:
> Yes, the numpy.take() is much faster than "fancy" indexing and now "pure
> numpy" solution is two time faster than pandas. Below are timing results:
>
>
> The data shape:
> (1062, 6348)
>
> Pandas solution:
> 0.16610 seconds
>
> "Pure numpy" solution:
> 0.08907 seconds
>
> Timing of the "pure numpy" by blocks:
> block (a) (sorting and obtaining groups):
> 0.00134 seconds
> block (b) (copy data to the ordered_data):
> 0.05517 seconds
> block (c) (reduceat):
> 0.02698
>
> Alexander.
>
>
> On Sun, Feb 5, 2012 at 4:01 AM, <josef.pktd at gmail.com> wrote:
>>
>> On Sat, Feb 4, 2012 at 2:27 PM, Wes McKinney <wesmckinn at gmail.com> wrote:
>> > On Sat, Feb 4, 2012 at 2:23 PM, Alexander Kalinin
>> > <alec.kalinin at gmail.com> wrote:
>> >> I have checked the performance of the "pure numpy" solution with pandas
>> >> solution on my task. The "pure numpy" solution is about two times
>> >> slower.
>> >>
>> >> The data shape:
>> >> (1062, 6348)
>> >> Pandas "group by sum" time:
>> >> 0.16588 seconds
>> >> Pure numpy "group by sum" time:
>> >> 0.38979 seconds
>> >>
>> >> But it is interesting, that the main bottleneck in numpy solution is
>> >> the
>> >> data copying. I have divided solution on three blocks:
>> >>
>> >> # block (a):
>> >> s = np.argsort(labels)
>> >>
>> >> keys, inv = np.unique(labels, return_inverse = True)
>> >>
>> >> i = inv[s]
>> >>
>> >> groups_at = np.where(i != np.concatenate(([-1], i[:-1])))[0]
>> >>
>> >>
>> >> # block (b):
>> >> ordered_data = data[:, s]
>>
>> can you try with numpy.take? Keith and Wes were showing that take is
>> much faster than advanced indexing.
>>
>> Josef
>>
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
FWIW I did some refactoring in pandas today and am getting the
following timings now in this use case:
In [12]: df = DataFrame(randn(1062, 6348))
In [13]: labels = np.random.randint(0, 100, size=1062)
In [14]: timeit df.groupby(labels).sum()
10 loops, best of 3: 38.7 ms per loop
comparing with
def numpy_groupby(data, labels, axis=0):
s = np.argsort(labels)
keys, inv = np.unique(labels, return_inverse = True)
i = inv.take(s)
groups_at = np.where(i != np.concatenate(([-1], i[:-1])))[0]
ordered_data = data.take(s, axis=axis)
group_sums = np.add.reduceat(ordered_data, groups_at, axis=axis)
return group_sums
In [15]: timeit numpy_groupby(df.values, labels)
10 loops, best of 3: 95.4 ms per loop
according to line_profiler, the runtime is being consumed by the reduceat now
In [20]: lprun -f numpy_groupby numpy_groupby(df.values, labels)
Timer unit: 1e-06 s
File: pandas/core/groupby.py
Function: numpy_groupby at line 1511
Total time: 0.108126 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1511 def
numpy_groupby(data, labels):
1512 1 125 125.0 0.1 s = np.argsort(labels)
1513 1 720 720.0 0.7 keys, inv =
np.unique(labels, return_inverse = True)
1514 1 13 13.0 0.0 i = inv.take(s)
1515 1 62 62.0 0.1 groups_at =
np.where(i != np.concatenate(([-1], i[:-1])))[0]
1516 1 28684 28684.0 26.5 ordered_data =
data.take(s, axis=0)
1517 1 78519 78519.0 72.6 group_sums =
np.add.reduceat(ordered_data, groups_at, axis=0)
1518
1519 1 3 3.0 0.0 return group_sums
The performance of the pure numpy solution will degrade both with the
length of the labels vector and the number of unique elements (because
there are two O(N log N) steps there). In this case it matters less
because there are so many rows / columns to aggregate
The performance of the pure NumPy solution is unsurprisingly much
better when the aggregation is across contiguous memory vs. strided
memory access:
In [41]: labels = np.random.randint(0, 100, size=6348)
In [42]: timeit numpy_groupby(df.values, labels, axis=1)
10 loops, best of 3: 47.4 ms per loop
pandas is slower in this case because it's not giving any
consideration to cache locality:
In [50]: timeit df.groupby(labels, axis=1).sum()
10 loops, best of 3: 79.9 ms per loop
One can only complain so much about 30 lines of Cython code ;) Good
enough for the time being
- Wes
More information about the SciPy-User
mailing list