[Numpy-discussion] NEP for faster ufuncs

Wed Dec 22 03:21:39 EST 2010

A Wednesday 22 December 2010 01:53:55 Mark Wiebe escrigué:
> Hello NumPy-ers,
> 
> After some performance analysis, I've designed and implemented a new
> iterator designed to speed up ufuncs and allow for easier
> multi-dimensional iteration.  The new code is fairly large, but
> works quite well already.  If some people could read the NEP and
> give some feedback, that would be great! Here's a link:
> 
> https://github.com/m-paradox/numpy/blob/mw_neps/doc/neps/new-iterator
> -ufunc.rst
> 
> I would also love it if someone could try building the code and play
> around with it a bit.  The github branch is here:
> 
> https://github.com/m-paradox/numpy/tree/new_iterator
> 
> To give a taste of the iterator's functionality, below is an example
> from the NEP for how to implement a "Lambda UFunc."  With just a few
> lines of code, it's possible to replicate something similar to the
> numexpr library (numexpr still gets a bigger speedup, though).  In
> the example expression I chose, execution time went from 138ms to
> 61ms.
> 
> Hopefully this is a good Christmas present for NumPy. :)
> 
> Cheers,
> Mark
> 
> Here is the definition of the ``luf`` function.::
> 
>     def luf(lamdaexpr, *args, **kwargs):
>         """Lambda UFunc
> 
>             e.g.
>             c = luf(lambda i,j:i+j, a, b, order='K',
>                                 casting='safe', buffersize=8192)
> 
>             c = np.empty(...)
>             luf(lambda i,j:i+j, a, b, out=c, order='K',
>                                 casting='safe', buffersize=8192)
>         """
> 
>         nargs = len(args)
>         op = args + (kwargs.get('out',None),)
>         it = np.newiter(op, ['buffered','no_inner_iteration'],
>                 [['readonly','nbo_aligned']]*nargs +
>                                
> [['writeonly','allocate','no_broadcast']],
> order=kwargs.get('order','K'),
>                 casting=kwargs.get('casting','safe'),
>                 buffersize=kwargs.get('buffersize',0))
>         while not it.finished:
>             it[-1] = lamdaexpr(*it[:-1])
>             it.iternext()
> 
>         return it.operands[-1]
> 
> Then, by using ``luf`` instead of straight Python expressions, we
> can gain some performance from better cache behavior.::
> 
>     In [2]: a = np.random.random((50,50,50,10))
>     In [3]: b = np.random.random((50,50,1,10))
>     In [4]: c = np.random.random((50,50,50,1))
> 
>     In [5]: timeit 3*a+b-(a/c)
>     1 loops, best of 3: 138 ms per loop
> 
>     In [6]: timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c)
>     10 loops, best of 3: 60.9 ms per loop
> 
>     In [7]: np.all(3*a+b-(a/c) == luf(lambda a,b,c:3*a+b-(a/c), a, b,
> c)) Out[7]: True

Wow, really nice work!  It would be great if that could make into NumPy 
:-)  Regarding your comment on numexpr being faster, I'm not sure (your 
new_iterator branch does not work for me; it gives me an error like: 
AttributeError: 'module' object has no attribute 'newiter'), but my 
guess is that your approach seems actually faster:

>>> a = np.random.random((50,50,50,10))
>>> b = np.random.random((50,50,1,10))
>>> c = np.random.random((50,50,50,1))
>>> timeit 3*a+b-(a/c)
10 loops, best of 3: 67.5 ms per loop
>>> import numexpr as ne
>>> ne.evaluate("3*a+b-(a/c)
>>> timeit ne.evaluate("3*a+b-(a/c)")
10 loops, best of 3: 42.8 ms per loop

i.e. numexpr is not able to achieve the 2x speedup mark that you are 
getting with ``luf`` (using a Core2 @ 3 GHz here).

-- 
Francesc Alted