[Numpy-discussion] testing with amd libm/acml

Thu Nov 8 05:22:08 EST 2012

On 11/8/12 12:35 AM, Chris Barker wrote:
> On Wed, Nov 7, 2012 at 11:41 AM, Neal Becker <ndbecker2 at gmail.com> wrote:
>> Would you expect numexpr without MKL to give a significant boost?
> It can, depending on the use case:
>   -- It can remove a lot of uneccessary temporary creation.
>   -- IIUC, it works on blocks of data at a time, and thus can keep
> things in cache more when working with large data sets.

Well, the temporaries are still created, but the thing is that, by 
working with small blocks at a time, these temporaries fit in CPU cache, 
preventing copies into main memory.  I like to name this the 'blocking 
technique', as explained in slide 26 (and following) in:

https://python.g-node.org/wiki/_media/starving_cpu/starving-cpu.pdf

A better technique is to reduce the block size to the minimal expression 
(1 element), so temporaries are stored in registers in CPU instead of 
small blocks in cache, hence preventing copies even in *cache*.  Numba 
(https://github.com/numba/numba) follows this approach, which is pretty 
optimal as can be seen in slide 37 of the lecture above.

>    -- It can (optionally) use multiple threads for easy parallelization.

No, the *total* amount of cores detected in the system is the default in 
numexpr; if you want less, you will need to use 
set_num_threads(nthreads) function.  But agreed, sometimes using too 
many threads could effectively be counter-producing.

-- 
Francesc Alted