[Numpy-discussion] NumPy speed tests by NASA

Tue Feb 22 19:21:04 EST 2011

Den 23.02.2011 00:19, skrev Gökhan Sever:
>
> I am guessing ATLAS is thread aware since with N=15000 each of the 
> quad core runs at %100. Probably MKL build doesn't bring much speed 
> advantage in this computation. Any thoughts?
>

There are still things like optimal cache use, SIMD extensions, etc. to 
consider. Some of MKL is hand-tweaked assemby and e.g. very fast on iCore.

Other BLAS implementations to consider are ACML, GotoBLAS2, ACML-GPU, 
and CUBLAS.

GotoBLAS2 is currently the fastest BLAS implementation on x64 CPUs. It 
can e.g. be linked with the reference implementation of LAPACK. GotoBLAS 
is open source and is very easy to build ("just type make").

ACML is probably better than MKL on AMD processors, but not as good as 
MKL on Intel processors, and currently free of charge (an MKL license 
costs $399).

Tthe recet ACML-GPU library can move matrix multiplication (DGEMM and 
friends) to the GPU if there is an ATI (AMD) chip available, and the 
matrices are sufficiently large. The ATI GPU can also be programmed with 
OpenCL, but ACML-GPU just looks like an ordinary BLAS and LAPACK 
implementation (in addition to FFTs and PRNGs), so no special 
programming is needed.

If one has an nVidia GPU, there is the CUBLAS library which implements 
BLAS, but not LAPACK. It has Fortran bindings and can probably be used 
with a reference LAPACK.

Sturla