[Numpy-discussion] parallel numpy (by Brian Granger) - any info?
eric jones
eric at enthought.com
Mon Jan 7 19:02:25 EST 2008
Robert Kern wrote:
> dmitrey wrote:
>
>> The only one thing I'm very interested in for now - why the most
>> simplest matrix operations are not implemented to be parallel in numpy
>> yet (for several-CPU computers, like my AMD Athlon X2). First of all
>> it's related to matrix multiplication and devision, either point or
>> matrix (i.e. like A\B, A*B, dot(A,B)).
>>
>
> Eric Jones has made an attempt.
>
> http://svn.scipy.org/svn/numpy/branches/multicore/
>
> Unfortunately, the overhead of starting the threads and acquiring/releasing
> thread locks wipes out most of the performance gains until you get fairly large
> arrays. It is possible that this comes from the particular implementation,
> rather than being intrinsic to the problem.
>
>
Yes, the problem in this implementation is that it uses pthreads for
synchronization instead of spin locks with a work pool implementation
tailored to numpy. The thread synchronization overhead is horrible
(300,000-400,000 clock cycles) and swamps anything other than very large
arrays. I have played with spin-lock based solutions that cut this to,
on average 3000-4000 cylces. With that, arrays of 1e3-1e4 elements can
start seeing parallelization benefits. However, this code hasn't passed
the mad-scientist tinkering stage... I haven't touched it in at least 6
months, and I doubt I'll get back to it very soon (go Brian!). It did
look promising for up to 4 processors on some operations (sin, cos,
etc.) and worth-it-but-less-beneficial on simple operations (+,-,*,
etc.). Coupled with something like weave.blitz or numexpr that can (or
could) compile multiple binary operations into a single kernel, the
scaling for expressions with multiple simple operations would scale very
well.
My tinkering was aimed at a framework that would allow you to write
little computational kernels in a prescribed way, and then let a numpy
load-balancer automatically split the work up between worker threads
that execute these little kernels. Ideally, this load-balancer would be
pluggable. The inner loop for numpy's universal functions is probably
very close or exactly the interface for these little kernels. Also, it
be nice to couple this with weave so that kernels written with weave
could execute in parallel without user effort. (This is all like the
"map" part of map-reduce architecture... The reduce part also need to
fit in the architecture to generically handle things like sum, etc.)
Getting it to work with all flavors of numpy arrays (contiguous,
non-contiguous, buffered, etc.) is quite a chore, but the contiguous
arrays (and perhaps some non-contiguous) offer some relatively low
hanging fruit. Here's to hoping Brian's project bears fruit.
I haven't thought about matrix ops much, so I don't know if they would
fit this (minimally) described architecture. I am sure that they would
scale well.
eric
More information about the NumPy-Discussion
mailing list