[Numpy-discussion] MKL + CPU, GPU + cuBLAS comparison

Jerome Kieffer Jerome.Kieffer at esrf.fr
Tue Nov 26 05:42:05 EST 2013


On Tue, 26 Nov 2013 01:02:40 -0800
"Dinesh Vadhia" <dineshbvadhia at hotmail.com> wrote:

> Probably a loaded question but is there a significant performance difference between using MKL (or OpenBLAS) on multi-core cpu's and cuBLAS on gpu's.  Does anyone have recent experience or link to an independent benchmark?
> 

Using Numpy (Xeon 5520 2.2GHz):

In [1]: import numpy
In [2]: shape = (450,450,450)
In [3]: start=numpy.random.random(shape).astype("complex128")
In [4]: %timeit result = numpy.fft.fftn(start)
1 loops, best of 3: 10.2 s per loop

Using FFTw (8 threads (2x quad cores):

In [5]: import fftw3
In [7]: result = numpy.empty_like(start)
In [8]: fft = fftw3.Plan(start, result, direction='forward', flags=['measure'], nthreads=8)
In [9]: %timeit fft()
1 loops, best of 3: 887 ms per loop

Using CuFFT (GeForce Titan):
1) with 2 transfers:
In [10]: import pycuda,pycuda.gpuarray as gpuarray,scikits.cuda.fft as cu_fft,pycuda.autoinit
In [11]: cuplan = cu_fft.Plan(start.shape, numpy.complex128, numpy.complex128)
In [12]: d_result = gpuarray.empty(start.shape, start.dtype)
In [13]: d_start = gpuarray.empty(start.shape, start.dtype)
In [14]: def cuda_fft(start):
   ....:     d_start.set(start)
   ....:     cu_fft.fft(d_start, d_result, cuplan)
   ....:     return d_result.get()
   ....: 
In [15]: %timeit cuda_fft(start)
1 loops, best of 3: 1.7 s per loop

2) with 1 transfert:
In [18]: def cuda_fft_2():
    cu_fft.fft(d_start, d_result, cuplan)
    return d_result.get()
   ....: 
In [20]: %timeit cuda_fft_2()
1 loops, best of 3: 1.05 s per loop

3) Without transfer:
In [22]: def cuda_fft_3():
    cu_fft.fft(d_start, d_result, cuplan)
    pycuda.autoinit.context.synchronize()
   ....:     

In [23]: %timeit cuda_fft_3()
1 loops, best of 3: 202 ms per loop

Conclusion: 
A Geforce Titan (1000€) can be 4x faster than a couple of Xeon 5520 (2x 250€) if your data are already on the GPU.
Nota: Plan calculation are much faster on GPU then on CPU.
-- 
Jérôme Kieffer
tel +33 476 882 445



More information about the NumPy-Discussion mailing list