[ python-Bugs-1183780 ] Popen4 wait() fails sporadically with threads

SourceForge.net noreply at sourceforge.net
Wed Jul 11 14:46:38 CEST 2007


Bugs item #1183780, was opened at 2005-04-15 16:27
Message generated for change (Comment added) made by gjb1002
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1183780&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Library
Group: Python 2.4
Status: Closed
Resolution: Accepted
Priority: 5
Private: No
Submitted By: Taale Skogan (tskogan)
Assigned to: Neal Norwitz (nnorwitz)
Summary: Popen4 wait() fails sporadically with threads

Initial Comment:
Calling wait() on a popen2.Popen4 object fails 
intermittently with the error

Traceback (most recent call last):
  ...
  File "/usr/local/lib/python2.3/popen2.py", line 90, in wait
    pid, sts = os.waitpid(self.pid, 0)
OSError: [Errno 10] No child processes

when using threads. 

The problem seems to be a race condition when a thread 
calls wait() on a popen2.Popen4 object. This also apllies 
to Popen3 objects. 

The constructor of Popen4. calls _cleanup() which calls 
poll() which calls the system call waitpid() for all acitve 
child processes. If another thread calls poll() before the 
current thread calls wait() on it's child process and the 
child process has terminated, the child process is no 
longer waitable and the second call to wait() fails.

Code to replicate this behavoir is attached in popen_bug.
py.

Solution: Popen4 and Popen3 should be threadsafe.

Related modules: A seemingly related error occurs with 
Popen from the new subprocess module. Use the -s 
option in the popen_bug.py script to test this. 

Tested on Linux RedHat Enterprise 3 for Python 2.3.3, 
Python 2.3.5 and Python 2.4.1 and Solaris for Python 2.
4.1. The error did not occur on a RedHat 7.3 machine 
with Python 2.3.5. See the attached file popen_bug.py for 
details on the platforms.



----------------------------------------------------------------------

Comment By: Geoffrey Bache (gjb1002)
Date: 2007-07-11 14:46

Message:
Logged In: YES 
user_id=769182
Originator: NO

>From the original description:

"Related modules: A seemingly related error occurs with
Popen from the new subprocess module. Use the -s
option in the popen_bug.py script to test this."

I can be confident it is the same problem as I have used the exact 
same script to reproduce it, and got the exact same stack trace.

As I added, I have also reproduced the "popen" problem using Python 2.4.4
once, though
this is much harder to reproduce. So I claim this is not in fact fixed in
all cases.

I can report separately if you really want, but it seems odd when the
symptoms and test case
are the same.


----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2007-07-10 23:50

Message:
Logged In: YES 
user_id=21627
Originator: NO

If you are seeing a bug in subprocess, please report it separately. This
one has been fixed.

Please don't assume that it is the "same" problem as the one reported
here, unless you have a working patch that proves that it is indeed a
similar problem.

----------------------------------------------------------------------

Comment By: Geoffrey Bache (gjb1002)
Date: 2007-07-10 18:00

Message:
Logged In: YES 
user_id=769182
Originator: NO

As an additional note, I have also reproduced the popen problem using
Python 2.4.4, though only once.
It seems to occur more frequently in subprocess.py.


----------------------------------------------------------------------

Comment By: Geoffrey Bache (gjb1002)
Date: 2007-07-10 17:38

Message:
Logged In: YES 
user_id=769182
Originator: NO

Did this get fixed in subprocess.py? The patches all seem to be for
popen2.

I have been observing similar problems in subprocess.py, so I downloaded
the test script and ran it with the -s option. It didn't work out of the
box, I had to pass "shell=True" to the subprocess.Popen before it did
anything at all. Which also made me wonder if the subprocess variant of
this problem got forgotten.

Having done that, I have observed failures using Python 2.4.4 on Red Hat
EL3 Linux, and also using Python 2.4.3 on Red Hat EL4 linux. Most of the
time it works, sometimes it hangs forever, and sometimes we get something
that look like this:

Started 20 threads
Exception in thread Thread-19:
Traceback (most recent call last):
  File "/usr/lib/python2.4/threading.py", line 442, in __bootstrap
    self.run()
  File "popen_bug.py", line 53, in run
    pipe.wait()
  File "/usr/lib/python2.4/subprocess.py", line 1007, in wait
    pid, sts = os.waitpid(self.pid, 0)
OSError: [Errno 10] No child processes

P.S. Googling for "[Errno 10] No child processes" suggests others have
this problem. There have been long discussions on the Zope list as to why
some people on Linux get exceptions that look like this, for example.



----------------------------------------------------------------------

Comment By: cheops (atila-cheops)
Date: 2006-04-10 16:55

Message:
Logged In: YES 
user_id=1276121

see patch # 1467770 for subprocess.py library module

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2006-03-24 09:16

Message:
Logged In: YES 
user_id=21627

Committed as 43286. I also added .cmd to Popen4.

----------------------------------------------------------------------

Comment By: Neal Norwitz (nnorwitz)
Date: 2006-03-24 08:56

Message:
Logged In: YES 
user_id=33168

It makes sense to remove from _active on ECHILD.

I wondered the same thing about waitpid(), but left it as it
was.  I don't believe it's possible for waitpid to return
any pid other than what you ask for unless the O/S is very,
very broken.

This patch is fine with me, feel free to check it in.  BTW,
nice comments and precondition checks in the test.

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2006-03-24 08:40

Message:
Logged In: YES 
user_id=21627

This looks all fine.

As a further issue, I think _cleanup should also clean
processes which already have been waited on. So if waitpid
gives ECHILD (in _cleanup), I think the object should get
removed from _active - otherwise, it would stay there
forever. Of course, care is then need to avoid __del__
adding it back to _active.

Putting these all together, I propose v3 of the patch.

Another aspect that puzzles me is the repeated test that
waitpid() really returns the pid you asked for. How could it
not? If it fails, you get an os.error.

----------------------------------------------------------------------

Comment By: Neal Norwitz (nnorwitz)
Date: 2006-03-24 06:17

Message:
Logged In: YES 
user_id=33168

I agree with your comment about setting self.sts to 0.  That
was the problem I alluded to on python-dev.

Although I dislike __del__, this does seem like an
appropriate place to do the modification of _active.

Note that currently all os.error's are swallowed in poll().
 I'm not sure if that was the best idea, but that's the
current interface.  wait() does *not* catch any exceptions.

I wasn't really sure what to do about threads.  The threads
could always manage their calls into a popen object like you
propose (don't try to handle simultaneous calls to poll and
wait).  Another question I had was if popen should be
deprecated in favor of subprocess?

I've attached a patch which I think implements your
suggestion.  It also seems to fix all the problems.

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2006-03-24 01:37

Message:
Logged In: YES 
user_id=21627

I don't understand why you are setting self.sts to 0 if wait
fails: most likely, there was a simultaneous call to .poll,
which should have set self.sts to the real return value. So
we should return that instead.

I think the whole issue can be avoid if we use resurrection:
If __del__ would put unwaited objects into _active, rather
than __init__, it would not happen that _cleanup polls a pid
which a thread still intends to wait for. In fact, it would
be sufficient to only put the pid into _active (avoiding the
need for resurrection).

If then a thread calls poll explicitly, and another calls
wait, they deserve to lose (with ECHILD). I would claim the
same error exists if part of the application calls
os.wait[3,4](), and another calls .wait later - they also
deserve the exception.

With that approach, I don't think further thread
synchronization would be needed.

----------------------------------------------------------------------

Comment By: Neal Norwitz (nnorwitz)
Date: 2006-03-23 09:41

Message:
Logged In: YES 
user_id=33168

The attached patch fixes the problem for me.  It also
addresses another issue where wait could be called from
outside the popen2 module.  I'm not sure this is the best
solution.  I'm not sure there really is a good solution. 
Perhaps it's best to allow an exception to be raised?

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1183780&group_id=5470


More information about the Python-bugs-list mailing list