Merge ~wgrant/launchpad:buildd-manager-nicer-retries into launchpad:master
Status: | Merged |
---|---|
Approved by: | William Grant |
Approved revision: | 272d66a896f028186c79cf1474eab1505f6febfe |
Merge reported by: | Otto Co-Pilot |
Merged at revision: | not available |
Proposed branch: | ~wgrant/launchpad:buildd-manager-nicer-retries |
Merge into: | launchpad:master |
Prerequisite: | ~wgrant/launchpad:buildd-manager-failure-metrics |
Diff against target: |
312 lines (+155/-6) 2 files modified
lib/lp/buildmaster/manager.py (+40/-0) lib/lp/buildmaster/tests/test_manager.py (+115/-6) |
Related bugs: |
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Colin Watson (community) | Approve | ||
Review via email: mp+454694@code.launchpad.net |
Commit message
Cope more gracefully with intermittent builder glitches
Description of the change
buildd-manager would previously immediately count any single scan
failure against the builder and job. This meant that three glitches --
say, network timeouts -- over the course of job would result in the
build being requeued. A builder's failure count is reset on successful
dispatch, but a job's deliberately isn't since we want to fail builds
that are repeatedly killing builders. This meant that a single network
glitch in the second attempt at a build would cause it to be failed.
This added layer of failure counting substantially reduces the
likelihood of those two scenarios, by requiring five consecutive
unsuccessful scans before a single failure is counted against a builder
or job. This means that brief network interruptions, or indeed temporary
insanity on buildd-manager's part, should no longer cause builds to be
requeued or failed at all.
The only significant downside of this change is that recovery from
legitimate failures will now take a few minutes longer. But that's much
less of a concern with the very large build farm we have nowadays.
Very nice. Thanks!