[issue38084] multiprocessing cannot recover from crashed worker

Davin Potts report at bugs.python.org
Tue Sep 10 11:38:20 EDT 2019


Davin Potts <python at discontinuity.net> added the comment:

Sharing for the sake of documenting a few things going on in this particular example:
* When a PoolWorker process exits in this way (os._exit(anything)), the PoolWorker never gets the chance to send a signal of failure (normally sent via the outqueue) to the MainProcess.
* In the current logic of the MainProcess, Pool._maintain_pool() detects the termination of that PoolWorker process and starts a new PoolWorker process to replace it, maintaining the desired size of Pool.
* The infinite hang observed in this example comes from the original p.map() call performing an unlimited-timeout wait for a result to appear on the outqueue, hence an infinite wait.  This wait is performed in MapResult.get() which does expose a timeout parameter though it is not possible to control through Pool.map().  It is not at all a correct, general solution, but exposing the control on this timeout and setting it to 1.0 seconds permits Steve's repro code snippet to run to completion (no infinite hang, raises a multiprocessing.context.TimeoutError).

----------

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue38084>
_______________________________________


More information about the Python-bugs-list mailing list