[Numpy-discussion] NEP for faster ufuncs

Francesc Alted faltet at pytables.org
Wed Dec 22 15:05:09 EST 2010

A Wednesday 22 December 2010 20:42:54 Mark Wiebe escrigué:
> On Wed, Dec 22, 2010 at 11:16 AM, Francesc Alted 
<faltet at pytables.org>wrote:
> > A Wednesday 22 December 2010 19:52:45 Mark Wiebe escrigué:
> > > On Wed, Dec 22, 2010 at 10:41 AM, Francesc Alted
> > 
> > <faltet at pytables.org>wrote:
> > > > NumPy version 2.0.0.dev-147f817
> > > 
> > > There's your problem, it looks like the PYTHONPATH isn't seeing
> > > your new build for some reason.  That build is off of this
> > > commit in the NumPy master branch:
> > > 
> > > https://github.com/numpy/numpy/commit/147f817eefd5efa56fa26b03953
> > > a51d 533cc27ec
> > 
> > Uh, I think I'm a bit lost here.  I've cloned this repo:
> > 
> > $ git clone git://github.com/m-paradox/numpy.git
> > 
> > Is that wrong?
> That's right, it was my mistake to assume that the page for a branch
> on github would give you that branch.  You need the 'new_iterator'
> branch, so after that clone, you should do this:
> $ git checkout origin/new_iterator

Ah, things go well now:

>>> timeit 3*a+b-(a/c)
10 loops, best of 3: 67.7 ms per loop
>>> timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c)
10 loops, best of 3: 27.8 ms per loop
>>> timeit ne.evaluate("3*a+b-(a/c)")
10 loops, best of 3: 42.8 ms per loop

So, yup, I'm seeing the good speedup here too :-)

> > But you need to transport those small chunks from main memory to
> > cache before you can start doing the computation for this piece,
> > right?  This is what I'm saying that the bottleneck for evaluating
> > arbitrary expressions (like "3*a+b-(a/c)", i.e. not including
> > transcendental functions, nor broadcasting) is memory bandwidth
> > (and more in particular RAM bandwidth).
> In the example expression, I believe the evaluation would go
> something like this.  Assuming the memory allocator keeps giving
> back the same locations to 'luf', all temporary variables will
> already be in cache after the first chunk.
> temp1 = 3 * a             # a is read from main memory
> temp2 = temp1 + b     # b is read from main memory
> temp3 = a / c             # a is already in cache, c is read from
> main memory
> result = temp2 + temp3 # result is written to data from main memory
> So there are 4 reads and writes to chunks from outside of the cache,
> but 12 total reads and writes to chunks, so speeding up the parts
> already in cache would appear to be beneficial.  The benefit will
> get better with more complicated expressions.  I think as long as
> the operation is slower than a memcpy, the RAM bandwidth isn't the
> main bottleneck to be concerned with, but instead produces an upper
> bound on performance.  I'm not sure how to precisely measure that
> overhead, though.

Well, see the timings for the non-broadcasting case:

>>> a = np.random.random((50,50,50,10))
>>> b = np.random.random((50,50,50,10))
>>> c = np.random.random((50,50,50,10))

>>> timeit 3*a+b-(a/c)
10 loops, best of 3: 31.1 ms per loop
>>> timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c)
10 loops, best of 3: 24.5 ms per loop
>>> timeit ne.evaluate("3*a+b-(a/c)")
100 loops, best of 3: 10.4 ms per loop

However, the above comparison is not fair, as numexpr uses all your 
cores by default (2 for the case above).  If we force using only one 

>>> ne.set_num_threads(1)
>>> timeit ne.evaluate("3*a+b-(a/c)")
100 loops, best of 3: 16 ms per loop

which is still faster than luf.  In this case numexpr was not using SSE, 
but in case luf does so, this does not imply better speed.

Francesc Alted

More information about the NumPy-Discussion mailing list