[Python-Dev] Stable buildbots

Sun Jan 30 21:50:36 CET 2011

Paul Moore <p.f.moore at gmail.com> writes:

> Presumably, you're inserting a pskill command somewhere into the
> actual build process. I don't know much about buildbot, but I thought
> that was controlled by the master and/or the Python build scripts,
> neither of which I can change.
>
> If I want to add a pskill command just after a build/test has run
> (i.e., about where kill_python runs at the moment) how do I do that?

I haven't been able to - as you say there's no good way to hook into
the build process in real time as the changes have to be external or
they'll get zapped on the next checkout.  I suppose you could rapidly
try to monitor the output of the build slave log file, but then you
risk killing a process from a next step if you miss something or are
too slow.  And I've had cases (after long periods of continuous
runtime) where the build slave log stops being generated even while
the slave is running fine.

Anyway, in the absence of changes to the build tree, I finally gave up
and now run an external script (see below) that whacks any python_d
process it finds running for more than 2 hours (arbitrary choice).  I
considered trying to dig deeper to identify processes with no logical
test parent (more similar to the build kill_python itself), but
decided it was too much effort for the minimal extra gain.  So not
terribly different from your once a day pskill, though as you say if
you arbitrarily kill all python_d processes at any given point in
time, you risk interrupting an active test.

So the AutoIt script covers pop-ups and the script below cleans up
hung processes.  On the subject of pop-ups, I'm not sure but if you
find your service not showing them try enabling the "Allow service to
interact with the desktop" option in the service definition.  In my
experience though if a service can't perform a UI interaction, the
interaction just fails, so I wouldn't expect the process to get stuck
in that case.

Anyway, in my case the kill script itself is Cygwin/bash based, but
using the pstools tools, and conceptually just kills (pskill) any
python_d process identified as having been running for 2 or more hours
of wall time (via pslist):

          - - - - - - - - - - - - - - - - - - - - - - - - -
#!/bin/sh
#
# kill_python.sh
#
# Quick 'n dirty script to watch for python_d processes that exceed a few
# hours of runtime, then kill then assuming they're hung
#

PROC="python_d"
TIMEOUT="2"

while [ 1 ]; do

    echo "`date` Checking..."

    PIDS=`pslist 2>&1 | grep "^$PROC" | awk -v TIMEOUT=$TIMEOUT '{split($NF,fields,":"); if (int(fields[1]) >= int(TIMEOUT)) {print $2}}'`

    if [ "$PIDS" ]; then
	echo ===== `date`
	for pid in $PIDS; do
	    pslist $pid 2>&1 | grep "^$PROC"
	    pskill $pid
	done
	echo =====
    fi

    sleep 300
done
          - - - - - - - - - - - - - - - - - - - - - - - - -

It's a kludge, but as you say, for us to impose this on the build
slave side requires it to be outside of the build tree.  I've been
running it for about a month now and it seems to be doing the job.  I
run a similar script on OSX (my Tiger slave also sometimes sees stuck
processes, though they just burn CPU rather than interfere with
tests), but there I can identify stranded python_d processes if they
are owned by init, so the script can react more quickly.

I'm pretty sure the best long term fix is to move the kill processing
into the clean script (as per issue 9973) rather than where it
currently is in the build script, but so far I don't think the idea
has been able to attract the interest of anyone who can actually
commit such a change.  (See also the Dec continuation of this thread -
http://www.mail-archive.com/python-dev@python.org/msg54389.html)

I had also created issue 10641 from when I thought I found a problem
with kill_python, but that turned out incorrect, and in subsequent
tests kill_python in the build tree always worked, so the core issue
seems to always be the failure to run it at all as opposed to it not
working.

For now though, these two external "monitors" seem to have helped
contain the number of manual operations I have to do on my two Windows
slaves.  (Though recently I've begun seeing two new sorts of pop-ups
under Windows 7 but both related to memory, so I think I just need to
give my VM a little more memory)

-- David