[Python-buildbots] Fleet health

Thu Jun 11 07:45:45 CEST 2015

On 11/06/2015 3:08 PM, Zachary Ware wrote:
> Hi all,
> 
> The health of our buildbot fleet is frankly a bit depressing at the
> moment.  Of the 44 buildslaves [1], 25 (!) are currently down.
> Several of the ones that are up are routinely failing some step, which
> may or may not be the fault of the slave itself.
> 
> I just wanted to touch base with everybody and ask that you give your
> slaves a quick once-over to make sure they're working properly, or
> give an update on why they may be down and when (or if) they can be
> expected to be back up.  In cases where a slave is down for an
> extended period of time, I'd like to clean up the waterfall view by
> temporarily removing those builders (and in cases where a slave is
> down for good, I'd like to clean up the list of slaves as well).
> 
> If there's anything that can be done to help on the master side, let me know!
> 
> Thanks,
> 

Thanks for setting this up Zach.

I've kept my slaves (koobs-*) updated since they were brought online,
including running the latest buildbot releases (currently latest), and
updating to the latest branch versions (read: future next release) of
FreeBSD. There have been no issues doing so.

I think there's a few things that can be done to improve the situation,
both in the short and longer term:

* Progressively update all buildbots to the latest buildbot version.
This allows any new features/configurations to be used, with less risk
of incompatible changes.

* Recreate the 'stable' builders list to account for buildslave fleet
changes since it was last modified.

* Use the (new) 'stable' buildbot list to block releases (if its not
being done now, or not being observed), forcing failing tests to be
fixed. "All Green or No-Go". This is critical.

OR,

* Remove the distinction between stable/unstable builders, remove
unconnected / long-time-flaky slaves. The definition of flaky should be
that the slave is broken, not the builds on the slave.

* Block releases if !All-Green

I'd go for first prize in this regard and remove as many distinctions as
possible differentiating buildslaves. It's not surprising that certain
(many?) buildbots are disregarded as unimportant and ignored.

tldr: All buildbots should either be critical to release engineering and
quality assurance, or not and removed. We as buildbot providers should
be held accountable for our part in that. It is upto Python (Core) to
set the standard for what the expectation is.

Additionally:

Right now each os-arch combination is a standalone bot/config and highly
static in nature.

The biggest gain I can see to be had is to evolve the master build
configuration to:

* Segment/Class build configurations on the master to gain greater
coverage of under-tested components and new build-types. Some examples are:

  * --shared builds vs non-shared builds
  * using system ffi, vs not (this might even help de-vendor libffi!)
  * compiler: gcc vs clang (FreeBSD has both on 9.x)
  * Architecture builders (x86_64, x86-32, mips, arm, blah)

Python would benefit by:

 * Allowing each buildslave to be used in multiple build classes
 * Greater coverage in build related infrastructure (notoriously
problematic)
 * Allow a 'build class' oriented view of build results, rather than
just by OS.

Once a new builder class is created, it is then just a matter of adding
in the buildslaves that support that buildtype or features.

I'm on IRC (koobs @ #python-dev freenode) if anyone wants to chat
further about these and other ideas.

--
Regards,

Kubilay
FreeBSD/Python