From zachary.ware+pydev at gmail.com Thu Jun 11 07:08:32 2015 From: zachary.ware+pydev at gmail.com (Zachary Ware) Date: Thu, 11 Jun 2015 00:08:32 -0500 Subject: [Python-buildbots] Fleet health Message-ID: Hi all, The health of our buildbot fleet is frankly a bit depressing at the moment. Of the 44 buildslaves [1], 25 (!) are currently down. Several of the ones that are up are routinely failing some step, which may or may not be the fault of the slave itself. I just wanted to touch base with everybody and ask that you give your slaves a quick once-over to make sure they're working properly, or give an update on why they may be down and when (or if) they can be expected to be back up. In cases where a slave is down for an extended period of time, I'd like to clean up the waterfall view by temporarily removing those builders (and in cases where a slave is down for good, I'd like to clean up the list of slaves as well). If there's anything that can be done to help on the master side, let me know! Thanks, -- Zach [1]http://buildbot.python.org/all/buildslaves From rosuav at gmail.com Thu Jun 11 07:17:01 2015 From: rosuav at gmail.com (Chris Angelico) Date: Thu, 11 Jun 2015 15:17:01 +1000 Subject: [Python-buildbots] Fleet health In-Reply-To: References: Message-ID: On Thu, Jun 11, 2015 at 3:08 PM, Zachary Ware wrote: > Hi all, > > The health of our buildbot fleet is frankly a bit depressing at the > moment. Of the 44 buildslaves [1], 25 (!) are currently down. > Several of the ones that are up are routinely failing some step, which > may or may not be the fault of the slave itself. Hi Zach! Thanks for setting up this list. If nothing else, it's the obvious place to ask questions like this... In terms of monitoring our slaves, is there any easy way to say "show me all the ones on this hardware"? Currently, I have a bookmarked page that looks like this: http://buildbot.python.org/all/waterfall?builder=AMD64+Debian+root+2.7&builder=AMD64+Debian+root+3.3&builder=AMD64+Debian+root+3.4&builder=AMD64+Debian+root+3.x&builder=AMD64+Debian+root+custom&reload=none (And I think that's out of date now, since there would be a 3.5 buildbot as well as 3.x.) There's this page: http://buildbot.python.org/all/buildslaves/angelico-debian-amd64 but I don't know of a simple way to ask for the waterfall view of all of those slaves. ChrisA From koobs at FreeBSD.org Thu Jun 11 07:45:45 2015 From: koobs at FreeBSD.org (Kubilay Kocak) Date: Thu, 11 Jun 2015 15:45:45 +1000 Subject: [Python-buildbots] Fleet health In-Reply-To: References: Message-ID: <55792089.3060409@FreeBSD.org> On 11/06/2015 3:08 PM, Zachary Ware wrote: > Hi all, > > The health of our buildbot fleet is frankly a bit depressing at the > moment. Of the 44 buildslaves [1], 25 (!) are currently down. > Several of the ones that are up are routinely failing some step, which > may or may not be the fault of the slave itself. > > I just wanted to touch base with everybody and ask that you give your > slaves a quick once-over to make sure they're working properly, or > give an update on why they may be down and when (or if) they can be > expected to be back up. In cases where a slave is down for an > extended period of time, I'd like to clean up the waterfall view by > temporarily removing those builders (and in cases where a slave is > down for good, I'd like to clean up the list of slaves as well). > > If there's anything that can be done to help on the master side, let me know! > > Thanks, > Thanks for setting this up Zach. I've kept my slaves (koobs-*) updated since they were brought online, including running the latest buildbot releases (currently latest), and updating to the latest branch versions (read: future next release) of FreeBSD. There have been no issues doing so. I think there's a few things that can be done to improve the situation, both in the short and longer term: * Progressively update all buildbots to the latest buildbot version. This allows any new features/configurations to be used, with less risk of incompatible changes. * Recreate the 'stable' builders list to account for buildslave fleet changes since it was last modified. * Use the (new) 'stable' buildbot list to block releases (if its not being done now, or not being observed), forcing failing tests to be fixed. "All Green or No-Go". This is critical. OR, * Remove the distinction between stable/unstable builders, remove unconnected / long-time-flaky slaves. The definition of flaky should be that the slave is broken, not the builds on the slave. * Block releases if !All-Green I'd go for first prize in this regard and remove as many distinctions as possible differentiating buildslaves. It's not surprising that certain (many?) buildbots are disregarded as unimportant and ignored. tldr: All buildbots should either be critical to release engineering and quality assurance, or not and removed. We as buildbot providers should be held accountable for our part in that. It is upto Python (Core) to set the standard for what the expectation is. Additionally: Right now each os-arch combination is a standalone bot/config and highly static in nature. The biggest gain I can see to be had is to evolve the master build configuration to: * Segment/Class build configurations on the master to gain greater coverage of under-tested components and new build-types. Some examples are: * --shared builds vs non-shared builds * using system ffi, vs not (this might even help de-vendor libffi!) * compiler: gcc vs clang (FreeBSD has both on 9.x) * Architecture builders (x86_64, x86-32, mips, arm, blah) Python would benefit by: * Allowing each buildslave to be used in multiple build classes * Greater coverage in build related infrastructure (notoriously problematic) * Allow a 'build class' oriented view of build results, rather than just by OS. Once a new builder class is created, it is then just a matter of adding in the buildslaves that support that buildtype or features. I'm on IRC (koobs @ #python-dev freenode) if anyone wants to chat further about these and other ideas. -- Regards, Kubilay FreeBSD/Python From rdmurray at bitdance.com Thu Jun 11 14:52:38 2015 From: rdmurray at bitdance.com (R. David Murray) Date: Thu, 11 Jun 2015 08:52:38 -0400 Subject: [Python-buildbots] Fleet health In-Reply-To: <55792089.3060409@FreeBSD.org> References: <55792089.3060409@FreeBSD.org> Message-ID: <20150611125243.74CD8B180E9@webabinitio.net> On Thu, 11 Jun 2015 15:45:45 +1000, Kubilay Kocak wrote: > I've kept my slaves (koobs-*) updated since they were brought online, > including running the latest buildbot releases (currently latest), and > updating to the latest branch versions (read: future next release) of > FreeBSD. There have been no issues doing so. Yes, thank you for your great work with your slaves. > * Progressively update all buildbots to the latest buildbot version. > This allows any new features/configurations to be used, with less risk > of incompatible changes. We should at least be coordinating on what minimum version is running on the stable set...on the other hand, I don't think we've upgraded the master in a while :) > * Recreate the 'stable' builders list to account for buildslave fleet > changes since it was last modified. > > * Use the (new) 'stable' buildbot list to block releases (if its not > being done now, or not being observed), forcing failing tests to be > fixed. "All Green or No-Go". This is critical. The release managers do pay attention to the stable set. A failing test doesn't necessarily block an alpha, beta, or even an early stage rc, depending on the nature of the failing test (the release manager uses their judgement). There are a number of reasons to try to improve the state of the stable fleet, and this will require a multi-pronged effort, not the least of which is improvements in the flaky tests. > OR, > > * Remove the distinction between stable/unstable builders, remove > unconnected / long-time-flaky slaves. The definition of flaky should be > that the slave is broken, not the builds on the slave. I think the stable/unstable split is important. A buildbot can be unstable for two reasons: the buildbot itself is flaky, as you say, or the tests are failing because the platform (or whatever other factor the slave was set up to test) is not completely supported yet. Having buildbots for the latter category is important. They shouldn't block releases, but they should be available to facilitate working on making Python work better. The snakebite hosts, for example, were in the latter category initially, though it is questionable whether anyone other than Trent was interested in working on getting the tests to pass :). Unfortunately snakebite is a lower priority for Trent now, and hardware issues have taken a number (most?) of them offline, and they should probably be deleted or at least commented out, depending on what Trent plans to do with them in the future. For flaky buildbots in your sense, we should have a conversation with the owner. The goal should be to either get it to be non-flaky, or delete it. Of course, almost all of this is volunteer work, so the timeframes over which this happens may be a bit longer that would be ideal :) > * Block releases if !All-Green As I said above, this is the goal, but it is always the release managers' call. > I'd go for first prize in this regard and remove as many distinctions as > possible differentiating buildslaves. It's not surprising that certain > (many?) buildbots are disregarded as unimportant and ignored. Slaves that are in the unstable set *should* be ignored in general, except by those people interested in working on making them stable. > tldr: All buildbots should either be critical to release engineering and > quality assurance, or not and removed. We as buildbot providers should > be held accountable for our part in that. It is upto Python (Core) to > set the standard for what the expectation is. As noted above, there is also the category of "being worked on", which is not critical to release engineering, but is the pathway to taking a buildslave from "not working yet" to being part of the stable set. If no progress is being made over an extended period, though, we should indeed probably do a delete. Such a bot can be re-added when someone or ones show up with a renewed interested in whatever the project was :) > Additionally: > > Right now each os-arch combination is a standalone bot/config and highly > static in nature. > > The biggest gain I can see to be had is to evolve the master build > configuration to: > > * Segment/Class build configurations on the master to gain greater > coverage of under-tested components and new build-types. Some examples are: > > * --shared builds vs non-shared builds > * using system ffi, vs not (this might even help de-vendor libffi!) > * compiler: gcc vs clang (FreeBSD has both on 9.x) > * Architecture builders (x86_64, x86-32, mips, arm, blah) > > Python would benefit by: > > * Allowing each buildslave to be used in multiple build classes > * Greater coverage in build related infrastructure (notoriously > problematic) > * Allow a 'build class' oriented view of build results, rather than > just by OS. > > Once a new builder class is created, it is then just a matter of adding > in the buildslaves that support that buildtype or features. This is an interesting idea. The big disadvantage is that right now each buildslave runs one build job per modified release. Under the above scenario they would need to run multiple builds per modified release, which results in a small combinatoric explosion. Not all slave machines are up to that task. So that would be another consideration as to whether to include a particular machine in more than one column of the matrix. However, we should clean up what we've got before we venture into that area, I think. --David From zachary.ware+pydev at gmail.com Thu Jun 11 18:52:38 2015 From: zachary.ware+pydev at gmail.com (Zachary Ware) Date: Thu, 11 Jun 2015 11:52:38 -0500 Subject: [Python-buildbots] Fleet health In-Reply-To: References: Message-ID: On Thu, Jun 11, 2015 at 12:17 AM, Chris Angelico wrote: > In terms of monitoring our slaves, is there any easy way to say "show > me all the ones on this hardware"? Currently, I have a bookmarked page > that looks like this: > > http://buildbot.python.org/all/waterfall?builder=AMD64+Debian+root+2.7&builder=AMD64+Debian+root+3.3&builder=AMD64+Debian+root+3.4&builder=AMD64+Debian+root+3.x&builder=AMD64+Debian+root+custom&reload=none > > (And I think that's out of date now, since there would be a 3.5 > buildbot as well as 3.x.) And no longer a 3.3 :) > There's this page: > > http://buildbot.python.org/all/buildslaves/angelico-debian-amd64 > > but I don't know of a simple way to ask for the waterfall view of all > of those slaves. I'm not sure if there is one; your bookmark may be the best you can get for that. -- Zach From stefan at bytereef.org Thu Jun 11 21:08:44 2015 From: stefan at bytereef.org (s.krah) Date: Thu, 11 Jun 2015 19:08:44 +0000 Subject: [Python-buildbots] Fleet health In-Reply-To: References: Message-ID: <14de403eb15.cccdd79e362413.1339482046637837841@bytereef.org> Hi, thanks for taking care of this! My buildbots had been ultra-stable for several years until last year a certain group of people decided to attack individual infrastructure contributions very publicly. As a consequence my bots can be deleted. Stefan Krah From zachary.ware+pydev at gmail.com Thu Jun 11 21:20:59 2015 From: zachary.ware+pydev at gmail.com (Zachary Ware) Date: Thu, 11 Jun 2015 14:20:59 -0500 Subject: [Python-buildbots] Fleet health In-Reply-To: <55792089.3060409@FreeBSD.org> References: <55792089.3060409@FreeBSD.org> Message-ID: On Thu, Jun 11, 2015 at 12:45 AM, Kubilay Kocak wrote: > Thanks for setting this up Zach. You're welcome :) > I've kept my slaves (koobs-*) updated since they were brought online, > including running the latest buildbot releases (currently latest), and > updating to the latest branch versions (read: future next release) of > FreeBSD. There have been no issues doing so. And thank you for that. > I think there's a few things that can be done to improve the situation, > both in the short and longer term: > > * Progressively update all buildbots to the latest buildbot version. > This allows any new features/configurations to be used, with less risk > of incompatible changes. I'm for that. I'm also for updating the master, but that's going to take some extra work (we have our own patches to the master). It'll take a while to get everything updated, though. > * Recreate the 'stable' builders list to account for buildslave fleet > changes since it was last modified. I also agree with this. I'll go through and make up a list at some point, which we can then discuss here (or if anybody else wants to make up such a list before I have a chance, please do :)). > Additionally: > > Right now each os-arch combination is a standalone bot/config and highly > static in nature. > > The biggest gain I can see to be had is to evolve the master build > configuration to: > > * Segment/Class build configurations on the master to gain greater > coverage of under-tested components and new build-types. Some examples are: > > * --shared builds vs non-shared builds > * using system ffi, vs not (this might even help de-vendor libffi!) > * compiler: gcc vs clang (FreeBSD has both on 9.x) > * Architecture builders (x86_64, x86-32, mips, arm, blah) > > Python would benefit by: > > * Allowing each buildslave to be used in multiple build classes > * Greater coverage in build related infrastructure (notoriously > problematic) > * Allow a 'build class' oriented view of build results, rather than > just by OS. > > Once a new builder class is created, it is then just a matter of adding > in the buildslaves that support that buildtype or features. This sounds interesting, and I'd like to hear more about how exactly you would set this up. However, I agree with David that we need to be sure not to overload the slaves with less guts, and also that we should hold off on this kind of change until the other points mentioned above are addressed. -- Zach From zachary.ware+pydev at gmail.com Thu Jun 11 21:34:24 2015 From: zachary.ware+pydev at gmail.com (Zachary Ware) Date: Thu, 11 Jun 2015 14:34:24 -0500 Subject: [Python-buildbots] Fleet health In-Reply-To: <20150611070803.GA3730@mail.codigo23.net> References: <20150611070803.GA3730@mail.codigo23.net> Message-ID: On Thu, Jun 11, 2015 at 2:08 AM, Francisco de Borja Lopez Rio wrote: > I'm maintaining the i386 OpenBSD buildbot: > > http://buildbot.python.org/all/buildslaves/borja-openbsd-x86 > > It has been down for the past few weeks because Mercurial is not able to > clone the cpython repository anymore on that box. I keep that system running > a latest snapshot of the system, so the tests could be run on the latest > changes to things like libressl, and it seems at one point something got > broken with Mercurial. More info in this thread in the ports at openbsd mailing > list: > > http://marc.info/?t=143204993200003&r=1&w=2 > > I'm still trying to find out what happens there, as soon as I can fix that, > the buildslave will be back online. This may be a red herring, but when did your problems start in relation to http://hg.python.org automatically redirecting to https://hg.python.org? And though it probably makes no difference, I did just update the master last week to us https://h.p.o instead of making hg handle the redirect. > Very nice idea. IIRC I've mentioned on #python-devel in freenode why this one > was down, but maybe that is not the best place for such notifications. Hence this list :) > One last question, will we use this list to discuss stuff regarding the > buildbots/slaves? Yes, that should be fine. If things get too high traffic (and of no use to anybody but the slave owner and whoever is adjusting the master), things can always be taken off-list. > i.e., in this openbsd slave I still see builders called "openbsd 5.5 3.x" and > such, while that is not exactly true, my system does not run a fixed version of > openbsd, but the latest version available every few weeks. Dunno if this would > be the place to mention that kind of stuff or not (or if they matter really). I'll try to fix that name. What would be the most accurate (succinct) name for it? Just "x86 OpenBSD 3.x" (for example), or "x86 Latest OpenBSD"? -- Zach From zachary.ware+pydev at gmail.com Thu Jun 11 21:37:53 2015 From: zachary.ware+pydev at gmail.com (Zachary Ware) Date: Thu, 11 Jun 2015 14:37:53 -0500 Subject: [Python-buildbots] Fleet health In-Reply-To: <14de403eb15.cccdd79e362413.1339482046637837841@bytereef.org> References: <14de403eb15.cccdd79e362413.1339482046637837841@bytereef.org> Message-ID: On Thu, Jun 11, 2015 at 2:08 PM, s.krah wrote: > Hi, > > thanks for taking care of this! My buildbots had been ultra-stable for several years > until last year a certain group of people decided to attack individual infrastructure > contributions very publicly. > > As a consequence my bots can be deleted. I am sorry to hear that. -- Zach From ncoghlan at gmail.com Fri Jun 12 07:00:57 2015 From: ncoghlan at gmail.com (Nick Coghlan) Date: Fri, 12 Jun 2015 15:00:57 +1000 Subject: [Python-buildbots] Fleet health In-Reply-To: References: Message-ID: On 11 June 2015 at 15:08, Zachary Ware wrote: > Hi all, > > The health of our buildbot fleet is frankly a bit depressing at the > moment. Of the 44 buildslaves [1], 25 (!) are currently down. > Several of the ones that are up are routinely failing some step, which > may or may not be the fault of the slave itself. I'd suggest dropping my current RHEL buildbot for the time being, and I'll look at setting up a better maintained replacement later (perhaps as part of the Fedora or CentOS QA infrastructure). Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From zachary.ware+pydev at gmail.com Fri Jun 12 08:12:43 2015 From: zachary.ware+pydev at gmail.com (Zachary Ware) Date: Fri, 12 Jun 2015 01:12:43 -0500 Subject: [Python-buildbots] Fleet health In-Reply-To: References: Message-ID: On Fri, Jun 12, 2015 at 12:00 AM, Nick Coghlan wrote: > On 11 June 2015 at 15:08, Zachary Ware wrote: >> Hi all, >> >> The health of our buildbot fleet is frankly a bit depressing at the >> moment. Of the 44 buildslaves [1], 25 (!) are currently down. >> Several of the ones that are up are routinely failing some step, which >> may or may not be the fault of the slave itself. > > I'd suggest dropping my current RHEL buildbot for the time being, and > I'll look at setting up a better maintained replacement later (perhaps > as part of the Fedora or CentOS QA infrastructure). Ok, I've removed your RHEL slave. Thanks for letting me know! -- Zach From dje.gcc at gmail.com Fri Jun 12 15:46:01 2015 From: dje.gcc at gmail.com (David Edelsohn) Date: Fri, 12 Jun 2015 09:46:01 -0400 Subject: [Python-buildbots] Fleet health In-Reply-To: References: Message-ID: On Thu, Jun 11, 2015 at 1:08 AM, Zachary Ware wrote: > Hi all, > > The health of our buildbot fleet is frankly a bit depressing at the > moment. Of the 44 buildslaves [1], 25 (!) are currently down. > Several of the ones that are up are routinely failing some step, which > may or may not be the fault of the slave itself. > > I just wanted to touch base with everybody and ask that you give your > slaves a quick once-over to make sure they're working properly, or > give an update on why they may be down and when (or if) they can be > expected to be back up. In cases where a slave is down for an > extended period of time, I'd like to clean up the waterfall view by > temporarily removing those builders (and in cases where a slave is > down for good, I'd like to clean up the list of slaves as well). > > If there's anything that can be done to help on the master side, let me know! Hi, Zach Thanks for setting up this discussion list. Internally, IBM has been discussing how to improve its participation in Open Source Software CI Testers to ensure that IBM POWER and IBM System z have better coverage. IBM may be able to help with hosting an open build service of diverse systems. We don't have a complete solution, but maybe we can use the Python Buildbot Fleet concept and this group to develop a solution. Thanks, David From ncoghlan at gmail.com Sat Jun 13 04:41:17 2015 From: ncoghlan at gmail.com (Nick Coghlan) Date: Sat, 13 Jun 2015 12:41:17 +1000 Subject: [Python-buildbots] Fleet health In-Reply-To: References: Message-ID: On 12 June 2015 at 23:46, David Edelsohn wrote: > Internally, IBM has been discussing how to improve its participation > in Open Source Software CI Testers to ensure that IBM POWER and IBM > System z have better coverage. IBM may be able to help with hosting > an open build service of diverse systems. We don't have a complete > solution, but maybe we can use the Python Buildbot Fleet concept and > this group to develop a solution. That would be very handy, as there's a patch to add AF_IUCV support in http://bugs.python.org/issue23830, which we don't currently have a way to test upstream. Neale suggested on that issue that the Linux Foundation might also be able to help out with s390x access, but having multiple test systems for any given architecture would be a good thing. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia From dje.gcc at gmail.com Sat Jun 13 15:18:23 2015 From: dje.gcc at gmail.com (David Edelsohn) Date: Sat, 13 Jun 2015 09:18:23 -0400 Subject: [Python-buildbots] Fleet health In-Reply-To: References: Message-ID: On Fri, Jun 12, 2015 at 10:41 PM, Nick Coghlan wrote: > On 12 June 2015 at 23:46, David Edelsohn wrote: >> Internally, IBM has been discussing how to improve its participation >> in Open Source Software CI Testers to ensure that IBM POWER and IBM >> System z have better coverage. IBM may be able to help with hosting >> an open build service of diverse systems. We don't have a complete >> solution, but maybe we can use the Python Buildbot Fleet concept and >> this group to develop a solution. > > That would be very handy, as there's a patch to add AF_IUCV support in > http://bugs.python.org/issue23830, which we don't currently have a way > to test upstream. > > Neale suggested on that issue that the Linux Foundation might also be > able to help out with s390x access, but having multiple test systems > for any given architecture would be a good thing. I already am running CPython buildbots on two separate zSeries Linux systems, two PPC64 Linux system, and one PPC64 AIX system. Thanks, David