[Numpy-discussion] Optimizing reduction loops (sum(), prod(), et al.)

Wed Jul 8 18:23:03 EDT 2009

On Wed, Jul 8, 2009 at 4:16 PM, Pauli Virtanen <pav+sp at iki.fi<pav%2Bsp at iki.fi>
> wrote:

> Hi,
>
> Ticket #1143 points out that Numpy's reduction operations are not
> always cache friendly. I worked a bit on tuning them.
>
>
> Just to tickle some interest, a "pathological" case before optimization:
>
>    In [1]: import numpy as np
>    In [2]: x = np.zeros((80000, 256))
>    In [3]: %timeit x.sum(axis=0)
>    10 loops, best of 3: 850 ms per loop
>
> After optimization:
>
>    In [1]: import numpy as np
>    In [2]: x = np.zeros((80000, 256))
>    In [3]: %timeit x.sum(axis=0)
>    10 loops, best of 3: 78.5 ms per loop
>
> For comparison, a reduction operation on a contiguous array of
> the same size:
>
>    In [4]: x = np.zeros((256, 80000))
>    In [5]: %timeit x.sum(axis=1)
>    10 loops, best of 3: 88.9 ms per loop
>

;)


>
> Funnily enough, it's actually slower than the reduction over the
> axis with the larger stride. The improvement factor depends on
> the CPU and its cache size.
>
>
How do the benchmarks compare with just making contiguous copies? Which is
blocking of sort, I suppose.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20090708/9fdbacab/attachment.html>