[Numpy-discussion] numpy ufuncs and COREPY - any info?

Fri May 22 06:08:21 EDT 2009

A Friday 22 May 2009 11:42:56 Gregor Thalhammer escrigué:
> dmitrey schrieb:
> > hi all,
> > has anyone already tried to compare using an ordinary numpy ufunc vs
> > that one from corepy, first of all I mean the project
> > http://socghop.appspot.com/student_project/show/google/gsoc2009/python/t1
> >24024628235
> >
> > It would be interesting to know what is speedup for (eg) vec ** 0.5 or
> > (if it's possible - it isn't pure ufunc) numpy.dot(Matrix, vec). Or
> > any another example.
>
> I have no experience with the mentioned CorePy, but recently I was
> playing around with accelerated ufuncs using Intels Math Kernel Library
> (MKL). These improvements are now part of the numexpr package
> http://code.google.com/p/numexpr/
> Some remarks on possible speed improvements on recent Intel x86 processors.
> 1) basic arithmetic ufuncs (add, sub, mul, ...) in standard numpy are
> fast (SSE is used) and speed is limited by memory bandwidth.
> 2) the speed of many transcendental functions (exp, sin, cos, pow, ...)
> can be improved by _roughly_ a factor of five (single core) by using the
> MKL. Most of the improvements stem from using faster algorithms with a
> vectorized implementation. Note: the speed improvement depends on a
> _lot_ of other circumstances.
> 3) Improving performance by using multi cores is much more difficult.
> Only for sufficiently large (>1e5) arrays a significant speedup is
> possible. Where a speed gain is possible, the MKL uses several cores.
> Some experimentation showed that adding a few OpenMP constructs you
> could get a similar speedup with numpy.
> 4) numpy.dot uses optimized implementations.

Good points Gregor.  However, I wouldn't say that improving performance by 
using multi cores is *that* difficult, but rather that multi cores can only be 
used efficiently *whenever* the memory bandwith is not a limitation.  An 
example of this is the computation of transcendental functions, where, even 
using vectorized implementations, the computation speed is still CPU-bounded 
in many cases.  And you have experimented yourself very good speed-ups for 
these cases with your implementation of numexpr/MKL :)

Cheers,

-- 
Francesc Alted