Merge lp:~julian-edwards/launchpad/builder-reset-fail-bug-563353 into lp:launchpad
Status: | Merged |
---|---|
Approved by: | Jonathan Lange |
Approved revision: | no longer in the source branch. |
Merged at revision: | not available |
Proposed branch: | lp:~julian-edwards/launchpad/builder-reset-fail-bug-563353 |
Merge into: | lp:launchpad |
Diff against target: |
428 lines (+198/-36) 3 files modified
lib/lp/buildmaster/manager.py (+33/-17) lib/lp/buildmaster/tests/test_manager.py (+96/-19) lib/lp/services/twistedsupport/processmonitor.py (+69/-0) |
To merge this branch: | bzr merge lp:~julian-edwards/launchpad/builder-reset-fail-bug-563353 |
Related bugs: |
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Jonathan Lange (community) | Approve | ||
Review via email: mp+23772@code.launchpad.net |
Description of the change
Dear reviewer, if you're not familiar with twisted this branch is going to
make your head spin. Sorry. If you are familiar, it will still make your
head spin, because it's Soyuz, and it's one of the most complicated parts of
Soyuz. Sorry again.
= Summary =
Disable virtual builders that fail to reset
== Proposed fix ==
As per bug 563353, sometimes a PPA builder will fail to reset because the
reset command to its host fails. Right now, this is ignored and we blindly
try again and again and the builder remains idle because the dispatcher
ignores it on each scan once the reset fails.
What we need to do is to fail the builder and insert the results of the reset
command into the builder.failnotes column so that the buildd-admins can see
that the builder failed, and why.
== Pre-implementation notes ==
Chatted at length to jml about this since he's a twisted expert and I'm not!
Basically the buildd-manager is using the wrong way to time out a process as
we don't get to see its stdout/stderr.
The old way of using run_process_
in the Deferred callback.
jml explained to me how to set up a protocol.
TimeoutMixin so that we can capture a process's output and fire the timer in
the future if it didn't complete fast enough.
== Implementation details ==
I've implemented a new class "ProcessWithTim
external process while catching its stdout, stderr and return code, and
killing it if it overruns the timeout specified.
This could potentially move to lp.services in the future and gain its own unit
tests, but that is out of scope for this branch since it's mostly geared to
what the buildd-manager is doing.
Other changes:
* resumeSlave() changed to use the new process spawning. It also takes an
optional twisted.
be moved around at will, which makes testing timeouts much faster!
* ResetDispatchResult now fails the builder (this is the class that
encapsulates the actions on builder reset failures) and pokes the failnotes on
the builder based on the stdout/stderr from the process.
== Tests ==
Lots of test updated in test_manager - some of them had some nasty bugs!
* generally, I've changed "d.addCallback" or "d.addErrBack" calls to
d.addBoth. The problem with the former ones is that it allows the code to do
the wrong thing and the tests don't notice if a failure callback is generated
when it's only expecting success, or vice-versa.
* config.pop moved to addCleanup() so that it's guaranteed to clean up.
Config changing is one of those nasty gotchas that's persistent across tests.
bin/test -cvv test_manager
== Demo and Q/A ==
Dogfood will get a beating when it's up.
= Launchpad lint =
I'm completely ignoring all this crack - pyflakes is returning total rubbish.
Checking for conflicts. and issues in doctests and templates.
Running jslint, xmllint, pyflakes, and pylint.
Using normal rules.
Linting changed files:
lib/lp/
lib/lp/
== Pyflakes notices ==
lib/lp/
418: local variable 'candidate' is assigned to but never used
== Pylint notices ==
lib/lp/
8: [F0401] Unable to import 'transaction'
18: [F0401] Unable to import 'zope.component'
19: [F0401] Unable to import 'zope.security.
21: [F0401] Unable to import 'canonical.
22: [F0401] Unable to import 'canonical.config'
23: [F0401] Unable to import 'canonical.
24: [F0401] Unable to import 'canonical.
25: [F0401] Unable to import 'canonical.
27: [F0401] Unable to import 'lp.buildmaster
28: [F0401] Unable to import 'lp.buildmaster
29: [F0401] Unable to import 'lp.buildmaster
30: [F0401] Unable to import 'lp.buildmaster
33: [F0401] Unable to import 'lp.buildmaster
34: [F0401] Unable to import 'lp.registry.
35: [F0401] Unable to import 'lp.soyuz.
36: [F0401] Unable to import 'lp.soyuz.
37: [F0401] Unable to import 'lp.soyuz.
518: [F0401, TestBuilddManag
'lp.services.
lib/lp/
20: [F0401] Unable to import 'transaction'
29: [F0401] Unable to import 'zope.component'
31: [F0401] Unable to import 'canonical.config'
32: [F0401] Unable to import 'canonical.
33: [F0401] Unable to import 'canonical.
34: [F0401] Unable to import 'lp.buildmaster
235: [F0401, FailDispatchRes
'lp.buildmaster
255: [F0401, ResetDispatchRe
'lp.buildmaster
391: [F0401, BuilddManager.scan] Unable to import
'lp.buildmaster
Hey Julian,
My head didn't spin while doing this review!
It's great to see the tests becoming faster *and* simpler while at the same
time the code is becoming more correct.
I have a bunch of comments, mostly shallow.
Thanks!
jml
=== modified file 'lib/lp/ buildmaster/ manager. py'
* Move ProcessWithTimeout into lp.services. twistedsupport. It's generic er.manager.
enough to be re-used, and no one will ever find it in
lp.buildmast
* Properly speaking, a class this big ought to have unit tests. I won't
insist on it -- follow your heart.
* 'deferred', 'outBuf', 'errBuf' and 'processTransport' should all begin
with underscores if they are not used outside of the class. If they are
used, they should be documented with :ivar deferred: etc as part of the
class docstring.
* 'outBuf', 'errBuf' and 'processTransport' should be formatted with
underscores instead of spaces.
* The line 'self.outReceived = self.outBuf.write' and its err twin will
probably be a little hard to follow for someone who isn't familiar with
ProcessProtocol. Could you please add a comment explaining what's going on?
e.g.
# outReceived and errReceived are callback methods on ProcessProtocol.
# All we want to do when we receive stuff from stdout or stderr is store
# it for later.
* ProcessWithTime out.spawnProces s might as well take *args and **kwargs and spawnProcess.
pass them on to reactor.
* In ProcessWithTime out.timeoutConn ection, I'm nearly certain that method
does nothing to record a TimeoutError -- is the docstring correct?
* You can get rid of a lot of the if/else business in checkResume by
not using 'addBoth' to add it as a callback to the Deferred.
* In the following line:
+ self.logger.error( '%s resume failure: %s' % (slave, error_text))
The space after "error(" should be deleted for PEP 8 glory.
=== modified file 'lib/lp/ buildmaster/ tests/test_ manager. py'
* The "from twisted. internet. error import" line can be joined with the next
* I don't think you need to do self.assertIsIn stance( failure. value, tuple).
You don't actually care if it's a tuple, you just care if it will unpack
to the right values.
* I don't grok the 'builderok' change at the end of the test file -- what's
going on there?
* Possibly related, but I don't see anything here that tests the new failBuilder( self.info) " command that you've added.
"builder.
Hope this helps,
jml