client-server parallellised number crunching

Dan Stromberg drsalists at gmail.com
Tue Apr 26 16:31:02 EDT 2011


On Tue, Apr 26, 2011 at 12:55 PM, Hans Georg Schaathun
<georg at schaathun.net> wrote:
> I wonder if anyone has any experience with this ...
>
> I try to set up a simple client-server system to do some number
> crunching, using a simple ad hoc protocol over TCP/IP.  I use
> two Queue objects on the server side to manage the input and the output
> of the client process.  A basic system running seemingly fine on a single
> quad-core box was surprisingly simple to set up, and it seems to give
> me a reasonable speed-up of a factor of around 3-3.5 using four client
> processes in addition to the master process.  (If anyone wants more
> details, please ask.)
>
> Now, I would like to use remote hosts as well, more precisely, student
> lab boxen which are rather unreliable.  By experience I'd expect to
> lose roughly 4-5 jobs in 100 CPU hours on average.  Thus I need some
> way of detecting lost connections and requeue unfinished tasks,
> avoiding any serious delays in this detection.  What is the best way to
> do this in python?
>
> It is, of course, possible for the master thread upon processing the
> results, to requeue the tasks for any missing results, but it seems
> to me to be a cleaner solution if I could detect disconnects and
> requeue the tasks from the networking threads.  Is that possible
> using python sockets?
>
> Somebody will probably ask why I am not using one of the multiprocessing
> libraries.  I have tried at least two, and got trapped by the overhead
> of passing complex pickled objects across.  Doing it myself has at least
> helped me clarify what can be parallelised effectively.  Now,
> understanding the parallelisable subproblems better, I could try again,
> if I can trust that these libraries can robustly handle lost clients.
> That I don't know if I can.

You probably should assign a unique identifier to each piece of work,
and implement two timeouts - one on your socket, using select or poll
or similar, and one for the pieces of work based on the identifier.

http://gengnosis.blogspot.com/2007/01/level-triggered-and-edge-triggered.html



More information about the Python-list mailing list