[Numpy-discussion] parallel numpy (by Brian Granger) - any info?

Mon Jan 7 19:02:25 EST 2008

Robert Kern wrote:
> dmitrey wrote:
>   
>> The only one thing I'm very interested in for now - why the most 
>> simplest matrix operations are not implemented to be parallel in numpy 
>> yet (for several-CPU computers, like my AMD Athlon X2). First of all 
>> it's related to matrix multiplication and devision, either point or 
>> matrix (i.e. like A\B, A*B, dot(A,B)).
>>     
>
> Eric Jones has made an attempt.
>
>    http://svn.scipy.org/svn/numpy/branches/multicore/
>
> Unfortunately, the overhead of starting the threads and acquiring/releasing 
> thread locks wipes out most of the performance gains until you get fairly large 
> arrays. It is possible that this comes from the particular implementation, 
> rather than being intrinsic to the problem.
>
>   
Yes, the problem in this implementation is that it uses pthreads for 
synchronization instead of spin locks with a work pool implementation 
tailored to numpy.  The thread synchronization overhead is horrible 
(300,000-400,000 clock cycles) and swamps anything other than very large 
arrays. I have played with spin-lock based solutions that cut this to, 
on average 3000-4000 cylces.  With that, arrays of 1e3-1e4 elements can 
start seeing parallelization benefits.  However, this code hasn't passed 
the mad-scientist tinkering stage...  I haven't touched it in at least 6 
months, and I doubt I'll get back to it very soon (go Brian!).  It did 
look promising for up to 4 processors on some operations (sin, cos, 
etc.) and worth-it-but-less-beneficial on simple operations (+,-,*, 
etc.).  Coupled with something like weave.blitz or numexpr that can (or 
could) compile multiple binary operations into a single kernel, the 
scaling for expressions with multiple simple operations would scale very 
well. 

My tinkering was aimed at a framework that would allow you to write 
little computational kernels in a prescribed way, and then let a numpy 
load-balancer automatically split the work up between worker threads 
that execute these little kernels.  Ideally, this load-balancer would be 
pluggable.  The inner loop for numpy's universal functions is probably 
very close or exactly the interface for  these little kernels. Also, it 
be nice to couple this with weave so that kernels written with weave 
could execute in parallel without user effort.  (This is all like the 
"map" part of map-reduce architecture... The reduce part also need to 
fit in the architecture to generically handle things like sum, etc.)

Getting it to work with all flavors of numpy arrays (contiguous, 
non-contiguous, buffered, etc.) is quite a chore, but the contiguous 
arrays (and perhaps some non-contiguous) offer some relatively low 
hanging fruit.  Here's to hoping Brian's project bears fruit.

I haven't thought about matrix ops much, so I don't know if they would 
fit this (minimally) described architecture.  I am sure that they would 
scale well.

eric