[Python-Dev] Most 3.x buildbots are green again, please don't break them and watch them!

Wed Apr 13 07:40:44 EDT 2016

Hi,

Last months, most 3.x buildbots failed randomly. Some of them were
always failing. I spent some time to fix almost all Windows and Linux
buildbots. There were a lot of different issues.

So please try to not break buildbots again and remind to watch them sometimes:

  http://buildbot.python.org/all/waterfall?category=3.x.stable&category=3.x.unstable

Next weeks, I will try to backport some fixes to Python 3.5 (if
needed) to make these buildbots more stable too.

Python 2.7 buildbots are also in a sad state (ex: test_marshal
segfaults on Windows, see issue #25264). But it's not easy to get a
Windows with the right compiler to develop on Python 2.7 on Windows.

--

Maybe it's time to move more 3.x buildbots to the "stable" category?
http://buildbot.python.org/all/waterfall?category=3.x.stable

By the way, I don't understand why "AMD64 OpenIndiana 3.x" is
considered as stable since it's failing with multiple issues since
many months and nobody is working on these failures. I suggest to move
this buildbot back to the unstable category.

--

We have many offline buildbots. What's the status of these buildbots?
Should we expect that they come back soon?

Or would it be possible to hide them? It would help to check the
status of all buildbots.

--

Failing buildbots:

- AMD64 FreeBSD CURRENT 3.x: http://bugs.python.org/issue26566 -- I
installed a fresh FreeBSD CURRENT in a VM and I'm unable to reproduce
failures. Maybe the buildbot slave is oudated and FreeBSD must be
upgraded?

- AMD64 OpenIndiana 3.x, x86 OpenIndiana 3.x: test_socket failures on
sendfile. Sorry but I'm not really interested by this OS.

- PPC64 AIX 3.x: failing tests: test_httplib, test_httpservers,
test_socket, test_distutils, test_asyncio, (...); random timeout
failure in test_eintr, etc. I don't have access to AIX and I'm not
interested to acquire an AIX license, nor to install it. I'm not sure
that it's useful to have an AIX buildbot and no core developer have
access to AIX, and nobody is working on AIX failures. Maybe HP wants
to help us to support AIX? (Provide manpower, access to AIX servers,
or something like that.)

- x86 OpenBSD 3.x: 5 tests failed, test_crypt test_socket test_ssl
test_strptime test_time. This OS needs some love ;-)

- the 4 ICC buildbots are failing with stack overflow, segfault, etc.
Again, I'm not sure that these buildbots are useful since it looks
like we don't support this compiler yet. Or does it help to work on
supporting this compiler? Who is working on ICC support?

--

FYI I also made some enhancements on regrtest (our test runner for the
test suite), mostly to debug failures:

- display the duration of tests taking longer than 30 seconds
- new timestamp prefix, used to debug buildbot hangs
- when parallel tests are interrupted, display progress on waiting for
completion
- add timeout to main process when using -jN: it should help to debug
buildbot hang
- "Run tests in parallel using 3 child processes" or "Run tests
sequentially" message which helps to understand how tests are running.
There is the -j1 trap which has no effect: tests are still run
sequentially. By the way, I proposed to really use subprocesses when
-j1 is used: http://bugs.python.org/issue25285

The default timeout changed from 1 hour to 15 min, it's the maximum
duration to run a single test file (ex: test_os.py). On my Linux box,
running the whole test suite in parallel (10 child processes for my 4
CPU cores with hyperthreading) with Python compiled in debug mode
(slow) takes 4 min 37 sec.

Tell me if the default timeout is too low. It can be configured per
buildbot if needed (TESTTIMEOUT env var).

--

By the way, I'm always surprised by the huge difference of time needed
to run a build on the different slaves: from a few minutes to more
than 3 hours. The fatest Windows slave takes 28 minutes (run tests in
parallel using 4 child processes), whereas the 3 others (run tests
sequentially and) take between 2 hours and more than 3 hours! Why
running tests on Windows takes so long?

Maybe we should make sure that no buildbot run tests sequentially,
because it creates a lot of annoying side effects (even if sometimes
it helps to find tricky bugs, sometimes bugs restricted to the tests
themself) and because a lot of time simply wait a few seconds. So
running mutliple tests in parallel don't burn your CPU, it's just
faster. IMHO the risk of random timeout failures is low compared to
the speedup.

--

The most interesting bug was a deadlock in locale.setlocale() on
Windows 7: the bug made the buildbot to hang "sometimes" (randomly).
Jeremy Kloth identified the bug, but Steve Dower noticed us that it's
already fixed in Visual Studio 2015 Update 1: so please update VS if
it's not the case yet. Steve added a post-build test to check if the
ucrtbase/ucrtbased DLL has the known bug.
=> http://bugs.python.org/issue26624

Victor