[Numpy-discussion] MKL + CPU, GPU + cuBLAS comparison

Tue Nov 26 05:47:42 EST 2013

HI GUYS,
PLEASE COULD YOU UNSUBSCRIBE ME FROM THESE EMAILS
I cant find the link on the bottom
Thank-you

On 26 November 2013 10:42, Jerome Kieffer <Jerome.Kieffer at esrf.fr> wrote:

> On Tue, 26 Nov 2013 01:02:40 -0800
> "Dinesh Vadhia" <dineshbvadhia at hotmail.com> wrote:
>
> > Probably a loaded question but is there a significant performance
> difference between using MKL (or OpenBLAS) on multi-core cpu's and cuBLAS
> on gpu's.  Does anyone have recent experience or link to an independent
> benchmark?
> >
>
> Using Numpy (Xeon 5520 2.2GHz):
>
> In [1]: import numpy
> In [2]: shape = (450,450,450)
> In [3]: start=numpy.random.random(shape).astype("complex128")
> In [4]: %timeit result = numpy.fft.fftn(start)
> 1 loops, best of 3: 10.2 s per loop
>
> Using FFTw (8 threads (2x quad cores):
>
> In [5]: import fftw3
> In [7]: result = numpy.empty_like(start)
> In [8]: fft = fftw3.Plan(start, result, direction='forward',
> flags=['measure'], nthreads=8)
> In [9]: %timeit fft()
> 1 loops, best of 3: 887 ms per loop
>
> Using CuFFT (GeForce Titan):
> 1) with 2 transfers:
> In [10]: import pycuda,pycuda.gpuarray as gpuarray,scikits.cuda.fft as
> cu_fft,pycuda.autoinit
> In [11]: cuplan = cu_fft.Plan(start.shape, numpy.complex128,
> numpy.complex128)
> In [12]: d_result = gpuarray.empty(start.shape, start.dtype)
> In [13]: d_start = gpuarray.empty(start.shape, start.dtype)
> In [14]: def cuda_fft(start):
>    ....:     d_start.set(start)
>    ....:     cu_fft.fft(d_start, d_result, cuplan)
>    ....:     return d_result.get()
>    ....:
> In [15]: %timeit cuda_fft(start)
> 1 loops, best of 3: 1.7 s per loop
>
> 2) with 1 transfert:
> In [18]: def cuda_fft_2():
>     cu_fft.fft(d_start, d_result, cuplan)
>     return d_result.get()
>    ....:
> In [20]: %timeit cuda_fft_2()
> 1 loops, best of 3: 1.05 s per loop
>
> 3) Without transfer:
> In [22]: def cuda_fft_3():
>     cu_fft.fft(d_start, d_result, cuplan)
>     pycuda.autoinit.context.synchronize()
>    ....:
>
> In [23]: %timeit cuda_fft_3()
> 1 loops, best of 3: 202 ms per loop
>
> Conclusion:
> A Geforce Titan (1000€) can be 4x faster than a couple of Xeon 5520 (2x
> 250€) if your data are already on the GPU.
> Nota: Plan calculation are much faster on GPU then on CPU.
> --
> Jérôme Kieffer
> tel +33 476 882 445
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20131126/cbee10f7/attachment.html>