[Numpy-discussion] IDL vs Python parallel computing

Thu May 8 04:10:09 EDT 2014

On 08.05.2014 02:48, Frédéric Bastien wrote:
> Just a quick question/possibility.
> 
> What about just parallelizing ufunc with only 1 inputs that is c or
> fortran contiguous like trigonometric function? Is there a fast path in
> the ufunc mechanism when the input is fortran/c contig? If that is the
> case, it would be relatively easy to add an openmp pragma to parallelize
> that loop, with a condition to a minimum number of element.

opemmp is problematic as it gnu openmp deadlocks on fork (multiprocessing)

I think if we do consider adding support using
multiprocessing.pool.ThreadPool could be a good option.

But it also is not difficult for the user to just write a wrapper
function like this:

parallel_trig(x, func, pool):
   x = x.reshape(s.size / nthreads, -1) # assuming 1d and no remainder
   return array(pool.map(func, x)) # use partial to use the out argument

> 
> Anyway, I won't do it. I'm just outlining what I think is the most easy
> case(depending of NumPy internal that I don't now enough) to implement
> and I think the most frequent (so possible a quick fix for someone with
> the knowledge of that code).
> 
> In Theano, we found in a few CPUs for the addition we need a minimum of
> 200k element for the parallelization of elemwise to be useful. We use
> that number by default for all operation to make it easy. This is user
> configurable. This warenty that with current generation, the threading
> don't slow thing down. I think that this is more important, don't show
> user slow down by default with a new version.
> 
> Fred
> 
> 
> 
> 
> On Wed, May 7, 2014 at 2:27 PM, Julian Taylor
> <jtaylor.debian at googlemail.com <mailto:jtaylor.debian at googlemail.com>>
> wrote:
> 
>     On 07.05.2014 20:11, Sturla Molden wrote:
>     > On 03/05/14 23:56, Siegfried Gonzi wrote:
>     >
>     > A more technical answer is that NumPy's internals does not play very
>     > nicely with multithreading. For examples the array iterators used in
>     > ufuncs store an internal state. Multithreading would imply an
>     excessive
>     > contention for this state, as well as induce false sharing of the
>     > iterator object. Therefore, a multithreaded NumPy would have
>     performance
>     > problems due to synchronization as well as hierachical memory
>     > collisions. Adding multithreading support to the current NumPy core
>     > would just degrade the performance. NumPy will not be able to use
>     > multithreading efficiently unless we redesign the iterators in NumPy
>     > core. That is a massive undertaking which prbably means rewriting most
>     > of NumPy's core C code. A better strategy would be to monkey-patch
>     some
>     > of the more common ufuncs with multithreaded versions.
> 
> 
>     I wouldn't say that the iterator is a problem, the important iterator
>     functions are threadsafe and there is support for multithreaded
>     iteration using NpyIter_Copy so no data is shared between threads.
> 
>     I'd say the main issue is that there simply aren't many functions worth
>     parallelizing in numpy. Most the commonly used stuff is already memory
>     bandwidth bound with only one or two threads.
>     The only things I can think of that would profit is sorting/partition
>     and the special functions like sqrt, exp, log, etc.
> 
>     Generic efficient parallelization would require merging of operations
>     improve the FLOPS/loads ratio. E.g. numexpr and theano are able to do so
>     and thus also has builtin support for multithreading.
> 
>     That being said you can use Python threads with numpy as (especially in
>     1.9) most expensive functions release the GIL. But unless you are doing
>     very flop intensive stuff you will probably have to manually block your
>     operations to the last level cache size if you want to scale beyond one
>     or two threads.