[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly

Tue Jul 13 02:43:48 CEST 2010

Greg Brockman <gdb at ksplice.com> added the comment:

> For processes disappearing (if that can at all happen), we could solve
> that by storing the jobs a process has accepted (started working on),
> so if a worker process is lost, we can mark them as failed too.
Sure, this would be reasonable behavior.  I had considered it but decided it as a larger change than I wanted to make without consulting the devs.

> I was already working on this issue last week actually, and I managed
> to do that in a way that works well enough (at least for me):
If I'm reading this right, you catch the exception upon pickling the result (at which point you have the job/i information already; totally reasonable).  I'm worried about the case of unpickling the task failing.  (Namely, the "task = get()" line of the "worker" method.)  Try running the following:
"""
#!/usr/bin/env python
import multiprocessing
p = multiprocessing.Pool(1)
def foo(x):
  pass
p.apply(foo, [1])
"""
And if "task = get()" fails, then the worker doesn't know what the relevant job/i values are.

Anyway, so I guess the question that is forming in my mind is, what sorts of errors do we want to handle, and how do we want to handle them?  My answer is I'd like to handle all possible errors with some behavior that is not "hang forever".  This includes handling children processes dying by signals or os._exit, raising unpickling errors, etc.

I believe my patch provides this functionality.  By adding the extra mechanism that you've written/proposed, we can improve the error handling in specific recoverable cases (which probably constitute the vast majority of real-world cases).

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue9205>
_______________________________________