[Numpy-discussion] GPU implementation?

Thu May 31 22:36:01 EDT 2007

This is very much worth pursuing.  I have been working on things
related to this on and off at my day job.  I can't say specifically
what I have been doing, but I can make some general comments:

* It is very easy to wrap the different parts of cude using ctypes and
call it from/numpy.

* Compared to a recent fast Intel CPU, the speedups we see are
consistent with what the NVIDIA literature reports:   10-30x is common
and in some cases we have seen up to 170x.

* Certain parts of numpy will be very easy to accelerate:  things
covered by blas, ffts, and ufuncs, random variates -  but each of
these will have very different speedups.

* LAPACK will be tough, extremely tough in some cases.  The main issue
is that various algorithms in LAPACK rely on different levels of BLAS
(1,2, or 3).  The algorithms in LAPACK that primarily use level 1 BLAS
functions (vector operations), like LU-decomp, are probably not worth
porting to the GPU - at least not using the BLAS that NVIDIA provides.
 On the other hand, the algorithms that use more of the level 2 and 3
BLAS functions are probably worth looking at.

* NVIDIA made a design decision in its implementation of cuBLAS and
cuFFT that is somewhat detrimental for certain algorithms.  In their
implementation, the BLAS and FFT routines can _only_ be called from
the CPU, not from code running on the GPU.  Thus if you have an
algorithm that makes many calls to cuBLAS/cuFFT, you pay a large
overhead in having to keep the main flow of the algorithm on the CPU.
It is not uncommon for this overhead to completely erode any speedup
you may have gotten on the GPU.

* For many BLAS calls, the cuBLAS won't be much faster than a good
optimized BLAS from ATLAS or Goto.

Brian

On 5/31/07, Martin Ünsal <martinunsal at gmail.com> wrote:
> I was wondering if anyone has thought about accelerating NumPy with a
> GPU. For example nVidia's CUDA SDK provides a feasible way to offload
> vector math onto the very fast SIMD processors available on the GPU.
> Currently GPUs primarily support single precision floats and are not
> IEEE compliant, but still could be useful for some applications.
>
> If there turns out to be a significant speedup over using the CPU, this
> could be a very accessible way to do scientific and numerical
> computation using GPUs, much easier than coding directly to the GPU APIs.
>
> Martin
>
>
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>