[Numpy-discussion] testing with amd libm/acml
Francesc Alted
francesc at continuum.io
Thu Nov 8 05:22:08 EST 2012
On 11/8/12 12:35 AM, Chris Barker wrote:
> On Wed, Nov 7, 2012 at 11:41 AM, Neal Becker <ndbecker2 at gmail.com> wrote:
>> Would you expect numexpr without MKL to give a significant boost?
> It can, depending on the use case:
> -- It can remove a lot of uneccessary temporary creation.
> -- IIUC, it works on blocks of data at a time, and thus can keep
> things in cache more when working with large data sets.
Well, the temporaries are still created, but the thing is that, by
working with small blocks at a time, these temporaries fit in CPU cache,
preventing copies into main memory. I like to name this the 'blocking
technique', as explained in slide 26 (and following) in:
https://python.g-node.org/wiki/_media/starving_cpu/starving-cpu.pdf
A better technique is to reduce the block size to the minimal expression
(1 element), so temporaries are stored in registers in CPU instead of
small blocks in cache, hence preventing copies even in *cache*. Numba
(https://github.com/numba/numba) follows this approach, which is pretty
optimal as can be seen in slide 37 of the lecture above.
> -- It can (optionally) use multiple threads for easy parallelization.
No, the *total* amount of cores detected in the system is the default in
numexpr; if you want less, you will need to use
set_num_threads(nthreads) function. But agreed, sometimes using too
many threads could effectively be counter-producing.
--
Francesc Alted
More information about the NumPy-Discussion
mailing list