[Numpy-discussion] 2-d in-place operation performance vs 1-d non in-place

Thu Sep 6 11:07:19 EDT 2007

On 9/6/07, George Sakkis <george.sakkis at gmail.com> wrote:
>
> On Sep 5, 12:29 pm, Francesc Altet <fal... at carabos.com> wrote:
> > A Wednesday 05 September 2007, George Sakkis escrigué:
> >
> >
> >
> > > I was surprised to see that an in-place modification of a 2-d array
> > > turns out to be slower from the respective non-mutating operation on
> > > 1- d arrays, although the latter creates new array objects. Here is
> > > the benchmarking code:
> >
> > > import timeit
> >
> > > for n in 10,100,1000,10000:
> > >    setup = 'from numpy.random import random;' \
> > >            'm=random((%d,2));' \
> > >            'u1=random(%d);' \
> > >            'u2=u1.reshape((u1.size,1))' % (n,n)
> > >    timers = [timeit.Timer(stmt,setup) for stmt in
> > >        # 1-d operations; create new arrays
> > >        'a0 = m[:,0]-u1; a1 = m[:,1]-u1',
> > >        # 2-d in place operation
> > >        'm -= u2'
> > >    ]
> > >    print n, [min(timer.repeat(3,1000)) for timer in timers]
> >
> > > And some results (Python 2.5, WinXP):
> >
> > > 10 [0.010832382327921563, 0.0045706926438974782]
> > > 100 [0.010882668048592767, 0.021704993232380093]
> > > 1000 [0.018272154701226007, 0.19477587235249172]
> > > 10000 [0.073787590322233698, 1.9234369172618306]
> >
> > > So the 2-d in-place modification time grows linearly with the array
> > > size but the 1-d operations are much more efficient, despite
> > > allocating new arrays while doing so. What gives ?
> >
> > This seems the effect of broadcasting u2.  If you were to use a
> > pre-computed broadcasted, you would get rid of such bottleneck:
> >
> > for n in 10,100,1000,10000:
> >    setup = 'import numpy;' \
> >            'm=numpy.random.random((%d,2));' \
> >            'u1=numpy.random.random(%d);' \
> >            'u2=u1[:, numpy.newaxis];' \
> >            'u3=numpy.array([u1,u1]).transpose()' % (n,n)
> >    timers = [timeit.Timer(stmt,setup) for stmt in
> >        # 1-d operations; create new arrays
> >        'a0 = m[:,0]-u1; a1 = m[:,1]-u1',
> >        # 2-d in place operation (using broadcasting)
> >        'm -= u2',
> >        # 2-d in-place operation (not forcing broadcasting)
> >        'm -= u3'
> >    ]
> >    print n, [min(timer.repeat(3,1000)) for timer in timers]
> >
> > gives in my machine:
> >
> > 10 [0.03213191032409668, 0.012019872665405273, 0.0068600177764892578]
> > 100 [0.033048152923583984, 0.06542205810546875, 0.0076580047607421875]
> > 1000 [0.040294170379638672, 0.59892702102661133, 0.014600992202758789]
> > 10000 [0.32667303085327148, 5.9721651077270508, 0.10261106491088867]
>
> Thank you, indeed broadcasting is the bottleneck here. I had the
> impression that broadcasting was a fast operation,

I'm fairly certain that this is correct.

> i.e. it doesn't
> require allocating physically the target array of the broadcast but it
> seems this is not the case.

I don't think it does. However it does change the way the ufuncs stride over
the array when they do the operation, which I suspect is the root of the
problem. First note that all of these operations are (and should be) linear
in n. I played around this a little bit and found a couple of things; first
the axis along which you broadcast matters a lot. That's not surprising;
when you broadcast along the short axis I believe that the inner loop ends
up very short and thus grows a lot of overhead. The other is that in-place
versus not in place makes a big difference. I'm not sure why that is. Here's
the amended benchmark; I use larger values of N so that the linear behavior
of all the examples is clearer.

    import timeit

    for n in 10000,20000,30000,40000,50000:
      setup = """
import numpy
m = numpy.random.random((%d,2))
u1 = numpy.random.random(%d)
u2 = u1[:, numpy.newaxis]
u3 = numpy.array([u1,u1]).transpose()
m2 = numpy.array(m.transpose())
u4 = u1[numpy.newaxis, :]
""" % (n,n)
      timers = [timeit.Timer(stmt,setup) for stmt in
          # 1-d operations; create new arrays
          'a0 = m[:,0]-u1; a1 = m[:,1]-u1',
          # 2-d in place operation (using broadcasting)
          'x = m - u2',
          # 2-d in-place operation (not forcing broadcasting)
          'x = m - u3',
          # transposed using broadcasting
          'x = m2 - u4'
      ]
      print n, [min(timer.repeat(3,100)) for timer in timers]

and the results:

10000 [0.0061748071333088406, 0.091492354475219639, 0.0063329277883082957,
0.0055317086389471415]
20000 [0.01121259824921883, 0.18455949010279238, 0.013805665245163912,
0.010877918841640355]
30000 [0.018362668998434195, 0.27472122853602898, 0.026029285844988426,
0.01910075163184155]
40000 [0.029564092643059592, 0.4027084447883742, 0.12735166061608361,
0.10206883835794844]
50000 [0.059520972637583824, 0.50685380404492797, 0.17739796538386798,
0.13967277964098823]

-- 
.  __
.   |-\
.
.  tim.hochberg at ieee.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20070906/1f9f3542/attachment.html>