[SciPy-User] leastsq and multiprocessing

Thu May 29 12:17:30 EDT 2014

I think this is fundamentally the wrong approach to a parallel leastsq. We
should replace the MINPACK supplied QR-solver with one based on LAPACK.
Then MKL, Accelerate or OpenBLAS will take care of the parallel processing.
This is often the dominating part of the computation. So just parallel
processing in the Python callbacks will not be very scalable. If you really
care about a parallel leastsq, this is where you should put your effort.
The computational compexity here is O(N**3), compared to O(N) for the
callbacks. The bigger the problem, the more the QR part will dominate. 

As for the callback functions that produces residuals and Jacobian, the
easiest solution would be a prange in Cython or Numba, or use Python
threads and release the GIL. I would not use multiprocessing without shared
memory here, because otherwise the IPC overhead will be too big. The
functions that compute the residuals and Jacobian are called repeatedly.
The major IPC overhead is multiprocessing's internal use of pickle to
serialize the ndarrays, not the communication over the pipes. I would
instead just copy data to and from shared memory. You can find a shared
memory system that works with multiprocessing on 
https://github.com/sturlamolden/sharedmem-numpy
Note that it does not remove the pickle overhead, so you should reuse the
shared memory arrays in the Python callbacks. This way the IPC overhead
will be reduced to a memcpy.

Sturla

Frédéric Parrenin <parrenin at ujf-grenoble.fr> wrote:
> Actually, the parallel leastsq code is very unstable on both debian 7 and
> ubuntu 13.10.
> Sometimes it works, sometimes it freezes my computer.
> 
> I would be glad if anybody could explain to me the origin of this problem.
> 
> Best regards,
> 
> Frédéric Parrenin
> 
> 2014-05-23 8:52 GMT+02:00 Frédéric Parrenin <parrenin at ujf-grenoble.fr>:
> 
>> Answering to my own question:
>> Actually, the same code runs on debian 7 instead of ubuntu 13.10 does not
>> slow down my computer. So this may be an ubuntu-specific problem.
>> 
>> For the gain, my program runs in 545 s on one core and in 123 s using 10
>> cores.
>> So it seems there is a factor of 2 performance hit in this case, this is
>> not two bad.
>> 
>> Best regards,
>> 
>> Frédéric Parrenin
>> 
>> 
>> 
>> 
>> 2014-05-23 4:23 GMT+02:00 Matt Newville <newville at cars.uchicago.edu>:
>> 
>>> Hi Frederic,
>>> 
>>> On Thu, May 22, 2014 at 10:20 AM, Frédéric Parrenin <
>>> parrenin at ujf-grenoble.fr> wrote:
>>> 
>>>> Dear all,
>>>> 
>>>> Coming back to an old thread...
>>>> 
>>>> I tried Jeremy's method since it is the easiest to implement.
>>>> Below is the Dfun function I provided to leastsq.
>>>> In my experiment, I used a pool of 6 since I have 8 cores in my PC.
>>>> 
>>>> However, the computer becomes extremely slow, almost unusable, during
>>>> the experiment.
>>>> Do you know why this happens?
>>>> 
>>>> Best regards,
>>>> 
>>>> Frédéric
>>>> 
>>> 
>>> Yes, my observation, based on the code at
>>> 
>>> <a
>>> href="https://github.com/newville/scipy/commit/3d0ac1da3bcd1d34a1bec8226ea0284f04fcb5dc">https://github.com/newville/scipy/commit/3d0ac1da3bcd1d34a1bec8226ea0284f04fcb5dc</a>
>>> 
>>> was that there was about a 10x performance hit.    So, similar to your
>>> observations.
>>> 
>>> This approach assumes that the cost of setting up multiple processes is
>>> small compared to the execution time of the objective function itself.  It
>>> also assumes that having a Jacobian function in Python (as compared to
>>> Fortran) is a small performance hit.  Again, this is more likely to be true
>>> for a time-consuming objective function, and almost certainly not true for
>>> any small test case.
>>> 
>>> I could be persuaded that this approach is still a reasonable idea, but
>>> (at least if implemented in pure Python) all the evidence is that it is
>>> much slower.  Using Cython may help, but I have not tried this.
>>> 
>>> Any multiprocessing approach that includes calling the objective function
>>> from different processes is going to be limited by the "picklability"
>>> issue. To me, this is a fairly significant limitation.  I've been lead to
>>> believe that the Mystic framework may have worked around this problem, but
>>> I don't know the details.
>>> 
>>> Others have suggested that doing the QR factorization with
>>> multiprocessing would be the better approach.  This seems worth trying,
>>> but, In my experience, the bulk of the time is actually spent in the
>>> objective function.
>>> 
>>> --Matt Newville
>>> 
>>> 
>>> _______________________________________________
>>> SciPy-User mailing list
>>> SciPy-User at scipy.org
>>> <a
>>> href="http://mail.scipy.org/mailman/listinfo/scipy-user">http://mail.scipy.org/mailman/listinfo/scipy-user</a>
>>> 
>>> 
>> 
> 
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> <a
> href="http://mail.scipy.org/mailman/listinfo/scipy-user">http://mail.scipy.org/mailman/listinfo/scipy-user</a>