[issue38630] subprocess.Popen.send_signal() should poll the process

Mon Dec 9 16:18:18 EST 2019

Nathaniel Smith <njs at pobox.com> added the comment:

>  Thread B thinks the process is still running, so it calls waitid+WNOHANG on a stale PID, with unpredictable results.

I'm pretty sure you mean WNOWAIT, right? The actual reaping step (which might use WNOHANG) is already protected by a lock, but there's a danger that a process might try to perform the blocking-until-exit step (which uses WNOWAIT) on a stale pid.

Good catch!

I think this can be fixed by using a second lock around all of .wait(). But this is a bit tricky because we also need to account for concurrent .poll() calls. So something like:

def wait(self):
    # Don't take the big lock unless the process is actually still running
    # This isn't just an optimization; it's critical for poll() correctness
    if self.poll() is not None:
        return self.returncode

    with self._blocking_wait_lock:
        with self._returncode_lock:
            revalidate pid licenses

        block_until_child_exits()

        with self._returncode_lock:
            reap the child and mark it as reaped

def poll(self):
    # if someone is already blocked waiting for the child, then it definitely
    # hasn't exited yet, so we don't need to call waitpid ourselves.
    # This isn't just an optimization; it's critical for correctness.
    if not self._blocking_wait_lock.acquire(blocking=False):
        return None
    try:
        with self._returncode_lock:
            do the actual poll + returncode updating
    finally:
        self._blocking_wait_lock.release()

Notes:

If there's already a wait() running, and someone calls poll(), then we have to make sure the poll() can't reap the process out from under the wait(). To fix that, poll skips trying in case wait is already running.

But, this could introduce its own race condition: if a process has exited but we haven't noticed yet, and then someone calls wait() and poll() simultaneously, then the wait() might take the lock, then poll() notices the lock is taken, and concludes that the process can't have exited yet. If course wait() will immediately reap the process and drop the lock, but by this point poll() has already returned the wrong information.

The problem is that poll() needs to be able to treat "the blocking wait lock is taken" as implying "the process hasn't exited yet". So to make that implication true, we add a check at the top of wait().

Of course if a process exits while wait() is running, and someone calls poll() in that brief interval between it exiting and wait() noticing, then poll() could again fail to report it exiting. But I think this is fine: it's ok if poll() is a fraction of a second out of date; that race condition is inherent in its API. The problem would be if the fails to notice a process that exited a while ago and is just sitting around waiting to be reaped.

... Ugh but there is still one more race condition here. I think this fixes all the cases involving send_signal, wait, and poll interactions with each other, BUT we broke the case of two threads calling poll() at the same time. One thread will take the blocking_wait_lock, then the other will take this as evidence that someone is blocked in wait() and exit early, which isn't appropriate.

I think we can patch up this last race condition by adding yet one more lock: a simple

with self._poll_lock:

around the whole body of poll(), so that only one thread can call it at a time.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue38630>
_______________________________________