[Numpy-discussion] NEP for faster ufuncs

Wed Dec 22 15:05:09 EST 2010

A Wednesday 22 December 2010 20:42:54 Mark Wiebe escrigué:
> On Wed, Dec 22, 2010 at 11:16 AM, Francesc Alted 
<faltet at pytables.org>wrote:
> > A Wednesday 22 December 2010 19:52:45 Mark Wiebe escrigué:
> > > On Wed, Dec 22, 2010 at 10:41 AM, Francesc Alted
> > 
> > <faltet at pytables.org>wrote:
> > > > NumPy version 2.0.0.dev-147f817
> > > 
> > > There's your problem, it looks like the PYTHONPATH isn't seeing
> > > your new build for some reason.  That build is off of this
> > > commit in the NumPy master branch:
> > > 
> > > https://github.com/numpy/numpy/commit/147f817eefd5efa56fa26b03953
> > > a51d 533cc27ec
> > 
> > Uh, I think I'm a bit lost here.  I've cloned this repo:
> > 
> > $ git clone git://github.com/m-paradox/numpy.git
> > 
> > Is that wrong?
> 
> That's right, it was my mistake to assume that the page for a branch
> on github would give you that branch.  You need the 'new_iterator'
> branch, so after that clone, you should do this:
> 
> $ git checkout origin/new_iterator

Ah, things go well now:

>>> timeit 3*a+b-(a/c)
10 loops, best of 3: 67.7 ms per loop
>>> timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c)
10 loops, best of 3: 27.8 ms per loop
>>> timeit ne.evaluate("3*a+b-(a/c)")
10 loops, best of 3: 42.8 ms per loop

So, yup, I'm seeing the good speedup here too :-)

> > But you need to transport those small chunks from main memory to
> > cache before you can start doing the computation for this piece,
> > right?  This is what I'm saying that the bottleneck for evaluating
> > arbitrary expressions (like "3*a+b-(a/c)", i.e. not including
> > transcendental functions, nor broadcasting) is memory bandwidth
> > (and more in particular RAM bandwidth).
> 
> In the example expression, I believe the evaluation would go
> something like this.  Assuming the memory allocator keeps giving
> back the same locations to 'luf', all temporary variables will
> already be in cache after the first chunk.
> 
> temp1 = 3 * a             # a is read from main memory
> temp2 = temp1 + b     # b is read from main memory
> temp3 = a / c             # a is already in cache, c is read from
> main memory
> result = temp2 + temp3 # result is written to data from main memory
> 
> So there are 4 reads and writes to chunks from outside of the cache,
> but 12 total reads and writes to chunks, so speeding up the parts
> already in cache would appear to be beneficial.  The benefit will
> get better with more complicated expressions.  I think as long as
> the operation is slower than a memcpy, the RAM bandwidth isn't the
> main bottleneck to be concerned with, but instead produces an upper
> bound on performance.  I'm not sure how to precisely measure that
> overhead, though.

Well, see the timings for the non-broadcasting case:

>>> a = np.random.random((50,50,50,10))
>>> b = np.random.random((50,50,50,10))
>>> c = np.random.random((50,50,50,10))

>>> timeit 3*a+b-(a/c)
10 loops, best of 3: 31.1 ms per loop
>>> timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c)
10 loops, best of 3: 24.5 ms per loop
>>> timeit ne.evaluate("3*a+b-(a/c)")
100 loops, best of 3: 10.4 ms per loop

However, the above comparison is not fair, as numexpr uses all your 
cores by default (2 for the case above).  If we force using only one 
core:

>>> ne.set_num_threads(1)
>>> timeit ne.evaluate("3*a+b-(a/c)")
100 loops, best of 3: 16 ms per loop

which is still faster than luf.  In this case numexpr was not using SSE, 
but in case luf does so, this does not imply better speed.

-- 
Francesc Alted