[Python-Dev] Green buildbot failure.

Mon Aug 12 00:49:45 CEST 2013

Victor Stinner <victor.stinner at gmail.com> writes:

> test.regrtest uses faulthandler.dump_traceback_later() to stop the
> test after a timeout if --timeout command line option is used.

The slave doesn't actually control the test parameters, which come
from build/Tools/buildbot/test.bat (which runs build/PCBuild/rt.bat)
plus anything sent from the master.  But no, it doesn't look like that
flow is currently using --timeout, so the main timeout in place is
that from the buildbot slave processing (currently 3900s and based on
output activity by the process under test).

Windows buildbots also have an additional "kill" path where the build
scripts build and execute a separate kill_python_d executable (in
PCBuild) to kill off any python_d process.  It does have some
sequencing issues (it runs during the build stage rather than clean)
but no matter where it is used, being part of the build sequence risks
it being skipped if the master/slave connection breaks mid-test.

For some additional background, see email threads:

http://mail.python.org/pipermail/python-dev/2010-November/105585.html
http://mail.python.org/pipermail/python-dev/2010-December/106510.html
http://mail.python.org/pipermail/python-dev/2011-January/107776.html

Anyway, the termination in this particular case is completely separate
from buildbot processing.  It's a small script combining pslist/pskill
from sysinternals (as pskill proved always able to kill the processes)
and just looking for old python_d processes that just runs constantly
in the background.

My Windows buildbots have three additional layers of termination
handling (beyond the standard buildbot timeout and kill_python in the
test itself):

  1. Modification to buildbot slave code to prevent Windows process and
     file dialogs.
  2. Auto-it script in the background to acknowledge C RTL dialogs that
     the prior step doesn't block.  (There have been past discussions
     about having Python itself disable RTL dialogs in test builds)
  3. The external watchdog script as a fail-safe.

The first two cases will definitely be recognized as test failures, since
while the dialogs are suppressed/acknowledged, the triggering code will
receive a failure result.

The purpose of the watchdog script was to handle cases encountered for
which the normal termination processing (buildbot or python itself)
simply didn't seem to work.  The buildbot slave/master thought the
test ended or aborted, so started new tests, but a process remained
stuck in memory from the prior test.  The frequency of occurrence
varied over time, but during some periods was a major pain in the neck
adversely affecting buildbot stability.

Not sure if faulthandler's approach to process termination would have
more luck, or if it would even run if, for example, the process was
stuck in the RTL or at the Win32 layer.

I'd certainly be willing to retire the watchdog scripts (as long as I
don't just end up firefighting stuck processes again), but I suspect
the first challenge would be to figure out how to simulate an
appropriately stuck process that would have required the watchdog
script previously, given that it was never really obvious why they
were hung.

-- David