[Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication

Thu Jun 16 14:54:20 EDT 2011

On 06/16/2011 11:44 AM, Christopher Barker wrote:
> NOTE: I'm only taking part in this discussion because it's interesting 
> and I hope to learn something. I do hope the OP chimes back in to 
> clarify his needs, but in the meantime...
>
>
>
> Bruce Southey wrote:
>> Remember that is what the OP wanted to do, not me.
>
> Actually, I don't think that's what the OP wanted -- I think we have a 
> conflict between the need for concrete examples, and the desire to 
> find a generic solution, so I think this is what the OP wants:
>
>
> How to best multiprocess a _generic_ operation that needs to be 
> performed on a lot of arrays. Something like:
>
>
> output = []
> for a in a_bunch_of_arrays:
>   output.append( a_function(a) )
>
>
> More specifically, a_function() is an inner product, *defined by the 
> user*.
>
> So there is no way to optimize the inner product itself (that will be 
> up to the user), nor any way to generally convert the bunch_of_arrays 
> to a single array with a single higher-dimensional operation.
>
> In testing his approach, the OP used a numpy multiply, and a simple, 
> loop-through-the elements multiply, and found that with his 
> multiprocessing calls, the simple loop was a fair bit faster with two 
> processors, but that the numpy one was slower with two processors. Of 
> course, the looping method was much, much, slower than the numpy one 
> in any case.
>
> So Sturla's comments are probably right on:
>
> Sturla Molden wrote:
>
>> "innerProductList = pool.map(myutil.numpy_inner_product, arrayList)"
>>
>> 1.  Here we potentially have a case of false sharing and/or mutex 
>> contention, as the work is too fine grained.  pool.map does not do 
>> any load balancing. If pool.map is to scale nicely, each work item 
>> must take a substantial amount of time. I suspect this is the main 
>> issue.
>
>> 2. There is also the question of when the process pool is spawned. 
>> Though I haven't checked, I suspect it happens prior to calling 
>> pool.map. But if it does not, this is a factor as well, particularly 
>> on Windows (less so on Linux and Apple).
>
> It didn't work well on my Mac, so ti's either not an issue, or not 
> Windows-specific, anyway.
>
>> 3.  "arrayList" is serialised by pickling, which has a significan 
>> overhead.  It's not shared memory either, as the OP's code implies, 
>> but the main thing is the slowness of cPickle.
>
> I'll bet this is a big issue, and one I'm curious about how to 
> address, I have another problem where I need to multi-process, and I'd 
> love to know a way to pass data to the other process and back 
> *without* going through pickle. maybe memmapped files?
>
>> "IPs = N.array(innerProductList)"
>>
>> 4.  numpy.array is a very slow function. The benchmark should 
>> preferably not include this overhead.
>
> I re-ran, moving that out of the timing loop, and, indeed, it helped a 
> lot, but it still takes longer with the multi-processing.
>
> I suspect that the overhead of pickling, etc. is overwhelming the 
> operation itself. That and the load balancing issue that I don't 
> understand!
>
> To test this, I did a little experiment -- creating a "fake" 
> operation, one that simply returns an element from the input array -- 
> so it should take next to no time, and we can time the overhead of the 
> pickling, etc:
>
> $ python shared_mem.py
>
> Using 2 processes
> No shared memory, numpy array multiplication took 0.124427080154 seconds
> Shared memory, numpy array multiplication took 0.586215019226 seconds
>
> No shared memory, fake array multiplication took 0.000391006469727 
> seconds
> Shared memory, fake array multiplication took 0.54935503006 seconds
>
> No shared memory, my array multiplication took 23.5055780411 seconds
> Shared memory, my array multiplication took 13.0932741165 seconds
>
> Bingo!
>
> The overhead of the multi-processing takes about .54 seconds, which 
> explains the slowdown for the numpy method
>
> not so mysterious after all.
>
> Bruce Southey wrote:
>
>> But if everything is *single-threaded* and thread-safe, then you just 
>> create a function and use Anne's very useful handythread.py 
>> (http://www.scipy.org/Cookbook/Multithreading).
>
> This may be worth a try -- though the GIL could well get in the way.
>
>> By the way, if the arrays are sufficiently small, there is a lot of 
>> overhead involved such that there is more time in communication than 
>> computation.
>
> yup -- clearly the case here. I wonder if it's just array size though 
> -- won't cPickle time scale with array size? So it may not be size 
> pe-se, but rather how much computation you need for a given size array.
>
> -Chris
>
> [I've enclosed the OP's slightly altered code]
>
>
>
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
Please see:
http://mail.scipy.org/pipermail/numpy-discussion/2011-June/056766.html
"I'm doing element wise multiplication, basically innerProduct = 
numpy.sum(array1*array2) where array1 and array2 are, in general, 
multidimensional."

Thanks for the code as I forgot that it was sent.

I think there is something weird about these timings(probably because 
these use time and not timeit) - the shared timings should not be 
constant across number of processors. For the numpy multiplication 
approach shows rather constant differences but using np.inner() clearly 
differs with the number of processors used. So I think that numpy may be 
using multiple threads here.

It is far more evident with large arrays:
     arraySize = (3000,200)
     numArrays = 50

Using 1 processes
No shared memory, numpy array multiplication took 0.279149055481 seconds
Shared memory, numpy array multiplication took 1.87239384651 seconds
No shared memory, inner array multiplication took 14.9514381886 seconds
Shared memory, inner array multiplication took 17.0087819099 seconds

Using 4 processes
No shared memory, numpy array multiplication took 0.279071807861 seconds
Shared memory, numpy array multiplication took 1.48242783546 seconds
No shared memory, inner array multiplication took 15.1401138306 seconds
Shared memory, inner array multiplication took 5.2479391098 seconds

Using 8 processes
No shared memory, numpy array multiplication took 0.281194925308 seconds
Shared memory, numpy array multiplication took 1.44942212105 seconds
No shared memory, inner array multiplication took 15.3794519901 seconds
Shared memory, inner array multiplication took 3.51714301109 seconds

Bruce

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110616/219a11c7/attachment.html>