[SciPy-User] Accumulation sum using indirect indexes
Wes McKinney
wesmckinn at gmail.com
Tue Feb 7 18:15:04 EST 2012
On Tue, Feb 7, 2012 at 5:38 PM, Wes McKinney <wesmckinn at gmail.com> wrote:
> On Sun, Feb 5, 2012 at 2:17 AM, Alexander Kalinin
> <alec.kalinin at gmail.com> wrote:
>> Yes, the numpy.take() is much faster than "fancy" indexing and now "pure
>> numpy" solution is two time faster than pandas. Below are timing results:
>>
>>
>> The data shape:
>> (1062, 6348)
>>
>> Pandas solution:
>> 0.16610 seconds
>>
>> "Pure numpy" solution:
>> 0.08907 seconds
>>
>> Timing of the "pure numpy" by blocks:
>> block (a) (sorting and obtaining groups):
>> 0.00134 seconds
>> block (b) (copy data to the ordered_data):
>> 0.05517 seconds
>> block (c) (reduceat):
>> 0.02698
>>
>> Alexander.
>>
>>
>> On Sun, Feb 5, 2012 at 4:01 AM, <josef.pktd at gmail.com> wrote:
>>>
>>> On Sat, Feb 4, 2012 at 2:27 PM, Wes McKinney <wesmckinn at gmail.com> wrote:
>>> > On Sat, Feb 4, 2012 at 2:23 PM, Alexander Kalinin
>>> > <alec.kalinin at gmail.com> wrote:
>>> >> I have checked the performance of the "pure numpy" solution with pandas
>>> >> solution on my task. The "pure numpy" solution is about two times
>>> >> slower.
>>> >>
>>> >> The data shape:
>>> >> (1062, 6348)
>>> >> Pandas "group by sum" time:
>>> >> 0.16588 seconds
>>> >> Pure numpy "group by sum" time:
>>> >> 0.38979 seconds
>>> >>
>>> >> But it is interesting, that the main bottleneck in numpy solution is
>>> >> the
>>> >> data copying. I have divided solution on three blocks:
>>> >>
>>> >> # block (a):
>>> >> s = np.argsort(labels)
>>> >>
>>> >> keys, inv = np.unique(labels, return_inverse = True)
>>> >>
>>> >> i = inv[s]
>>> >>
>>> >> groups_at = np.where(i != np.concatenate(([-1], i[:-1])))[0]
>>> >>
>>> >>
>>> >> # block (b):
>>> >> ordered_data = data[:, s]
>>>
>>> can you try with numpy.take? Keith and Wes were showing that take is
>>> much faster than advanced indexing.
>>>
>>> Josef
>>>
>>
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>
>
> FWIW I did some refactoring in pandas today and am getting the
> following timings now in this use case:
>
> In [12]: df = DataFrame(randn(1062, 6348))
>
> In [13]: labels = np.random.randint(0, 100, size=1062)
>
> In [14]: timeit df.groupby(labels).sum()
> 10 loops, best of 3: 38.7 ms per loop
>
> comparing with
>
> def numpy_groupby(data, labels, axis=0):
> s = np.argsort(labels)
> keys, inv = np.unique(labels, return_inverse = True)
> i = inv.take(s)
> groups_at = np.where(i != np.concatenate(([-1], i[:-1])))[0]
> ordered_data = data.take(s, axis=axis)
> group_sums = np.add.reduceat(ordered_data, groups_at, axis=axis)
>
> return group_sums
>
> In [15]: timeit numpy_groupby(df.values, labels)
> 10 loops, best of 3: 95.4 ms per loop
>
> according to line_profiler, the runtime is being consumed by the reduceat now
>
> In [20]: lprun -f numpy_groupby numpy_groupby(df.values, labels)
> Timer unit: 1e-06 s
>
> File: pandas/core/groupby.py
> Function: numpy_groupby at line 1511
> Total time: 0.108126 s
>
> Line # Hits Time Per Hit % Time Line Contents
> ==============================================================
> 1511 def
> numpy_groupby(data, labels):
> 1512 1 125 125.0 0.1 s = np.argsort(labels)
> 1513 1 720 720.0 0.7 keys, inv =
> np.unique(labels, return_inverse = True)
> 1514 1 13 13.0 0.0 i = inv.take(s)
> 1515 1 62 62.0 0.1 groups_at =
> np.where(i != np.concatenate(([-1], i[:-1])))[0]
> 1516 1 28684 28684.0 26.5 ordered_data =
> data.take(s, axis=0)
> 1517 1 78519 78519.0 72.6 group_sums =
> np.add.reduceat(ordered_data, groups_at, axis=0)
> 1518
> 1519 1 3 3.0 0.0 return group_sums
>
> The performance of the pure numpy solution will degrade both with the
> length of the labels vector and the number of unique elements (because
> there are two O(N log N) steps there). In this case it matters less
> because there are so many rows / columns to aggregate
>
> The performance of the pure NumPy solution is unsurprisingly much
> better when the aggregation is across contiguous memory vs. strided
> memory access:
>
>
> In [41]: labels = np.random.randint(0, 100, size=6348)
>
> In [42]: timeit numpy_groupby(df.values, labels, axis=1)
> 10 loops, best of 3: 47.4 ms per loop
>
> pandas is slower in this case because it's not giving any
> consideration to cache locality:
>
> In [50]: timeit df.groupby(labels, axis=1).sum()
> 10 loops, best of 3: 79.9 ms per loop
>
> One can only complain so much about 30 lines of Cython code ;) Good
> enough for the time being
>
> - Wes
More on this for those interested. These methods start really becoming
different when you aggregate very large 1D arrays. Consider a million
float64s each with a label chosen from 1000 unique labels. You can see
where we start running into problems:
In [9]: data = np.random.randn(1000000, 1)
In [10]: labels = np.random.randint(0, 1000, size=1000000)
In [11]: lprun -f gp.numpy_groupby gp.numpy_groupby(data, labels)
Timer unit: 1e-06 s
File: pandas/core/groupby.py
Function: numpy_groupby at line 1512
Total time: 0.413775 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1512 def
numpy_groupby(data, labels, axis=0):
1513 1 98867 98867.0 23.9 s = np.argsort(labels)
1514 1 286792 286792.0 69.3 keys, inv =
np.unique(labels, return_inverse = True)
1515 1 9617 9617.0 2.3 i = inv.take(s)
1516 1 8059 8059.0 1.9 groups_at =
np.where(i != np.concatenate(([-1], i[:-1])))[0]
1517 1 9365 9365.0 2.3 ordered_data =
data.take(s, axis=axis)
1518 1 1073 1073.0 0.3 group_sums =
np.add.reduceat(ordered_data, groups_at, axis=axis)
1519
1520 1 2 2.0 0.0 return group_sums
In [12]: timeit gp.numpy_groupby(data, labels)
1 loops, best of 3: 410 ms per loop
whereas the hash-table based approach (a la pandas) looks like:
In [13]: df = DataFrame(data)
In [14]: timeit df.groupby(labels).sum()
10 loops, best of 3: 71.5 ms per loop
with
In [17]: %prun -s cumulative -l 15 for _ in xrange(10): df.groupby(labels).sum()
3002 function calls in 0.771 seconds
Ordered by: cumulative time
List reduced from 109 to 15 due to restriction <15>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.025 0.025 0.771 0.771 <string>:1(<module>)
10 0.000 0.000 0.744 0.074 groupby.py:327(sum)
10 0.001 0.000 0.744 0.074
groupby.py:940(_cython_agg_general)
10 0.000 0.000 0.692 0.069 groupby.py:384(_group_info)
10 0.000 0.000 0.408 0.041 groupby.py:620(labels)
10 0.000 0.000 0.408 0.041 groupby.py:641(_make_labels)
10 0.015 0.002 0.408 0.041 algorithms.py:72(factorize)
10 0.314 0.031 0.314 0.031 {method 'get_labels' of
'pandas._tseries.Int64HashTable' objects}
10 0.012 0.001 0.284 0.028
groupby.py:1470(_compress_group_index)
10 0.178 0.018 0.178 0.018 {method
'get_labels_groupby' of 'pandas._tseries.Int64HashTable' objects}
90 0.132 0.001 0.132 0.001 {method 'take' of
'numpy.ndarray' objects}
10 0.000 0.000 0.045 0.004 groupby.py:1432(cython_aggregate)
10 0.044 0.004 0.044 0.004 {pandas._tseries.group_add}
30 0.019 0.001 0.019 0.001 {numpy.core.multiarray.putmask}
20 0.016 0.001 0.016 0.001 {method 'astype' of
'numpy.ndarray' objects}
I'm working on getting rid of the "compress" step which will save
another 30% in this single-key groupby case, unfortunately with the
way I have the code factored it's not completely trivial
(this is all very bleeding-edge pandas, forthcoming in 0.7.0 final)
- Wes
More information about the SciPy-User
mailing list