On Monday 18 October 2010 13:21:45 you wrote: > Review: Needs Fixing > Hey Julian, > > The new manager is far, far more readable than before. The hard work has > paid off. \o/ > I mention a lot of small things in the comments below. I'd really > appreciate it if you could address them all, since now is the best > opportunity we'll have in a while to make this code comprehensible to > others. I shall endeavour to do so. > > # Copyright 2009 Canonical Ltd. This software is licensed under the > > # GNU Affero General Public License version 3 (see the file LICENSE). > > > > """Soyuz buildd slave manager logic.""" > > > > __metaclass__ = type > > > > __all__ = [ > > > > 'BuilddManager', > > 'BUILDD_MANAGER_LOG_NAME', > > 'buildd_success_result_map', > > ] > > > > import logging > > > > import transaction > > from twisted.application import service > > from twisted.internet import ( > > > > defer, > > reactor, > > ) > > > > from twisted.internet.task import LoopingCall > > from twisted.python import log > > from zope.component import getUtility > > > > from lp.buildmaster.enums import BuildStatus > > from lp.buildmaster.interfaces.buildfarmjobbehavior import ( > > > > BuildBehaviorMismatch, > > ) > > > > from lp.buildmaster.interfaces.builder import ( > > > > BuildDaemonError, > > BuildSlaveFailure, > > CannotBuild, > > CannotResumeHost, > > ) > > > > BUILDD_MANAGER_LOG_NAME = "slave-scanner" > > > > > > buildd_success_result_map = { > > > > 'ensurepresent': True, > > 'build': 'BuilderStatus.BUILDING', > > } > > You can delete this now. Yay. (Don't forget the __all__ too). Right! > > > class SlaveScanner: > > """A manager for a single builder.""" > > > > SCAN_INTERVAL = 5 > > Can you please add a comment explaining what this means, what unit it's in, > and hinting at why 5 is a good number for it? Done. > > > def startCycle(self): > > """Scan the builder and dispatch to it or deal with failures.""" > > self.loop = LoopingCall(self._startCycle) > > self.stopping_deferred = self.loop.start(self.SCAN_INTERVAL) > > return self.stopping_deferred > > > > def _startCycle(self): > > # Same as _startCycle but the next cycle is not scheduled. This > > # is so tests can initiate a single scan. > > This comment is obsolete. Also, there's probably a better name than > "_startCycle", since this is pretty much doing the scan. Perhaps > 'oneCycle'. I named it singleCycle() > > > def _scanFailed(self, failure): > > # Trap known exceptions and print a message without a > > # stack trace in that case, or if we don't know about it, > > # include the trace. > > This comment is also obsolete. Although the traceback/no-traceback logic > is still here, it's hardly the point of the method. Spruced up as discussed in IRC. > > > # Paranoia. > > transaction.abort() > > Can you please explain in the comment exactly what you are being paranoid > about? Yup, done. > > > error_message = failure.getErrorMessage() > > if failure.check( > > > > BuildSlaveFailure, CannotBuild, BuildBehaviorMismatch, > > CannotResumeHost, BuildDaemonError): > > self.logger.info("Scanning failed with: %s" % error_message) > > > > else: > > self.logger.info("Scanning failed with: %s\n%s" % > > > > (failure.getErrorMessage(), failure.getTraceback())) > > > > builder = get_builder(self.builder_name) > > Shouldn't this line be in the try: block below? After all, if the database > disappears... I don't think it matters but I've moved it anyway as it looks better if nothing else :) > > > # Decide if we need to terminate the job or fail the > > # builder. > > > > try: > > builder.gotFailure() > > builder.getCurrentBuildFarmJob().gotFailure() > > self.logger.info( > > > > "builder failure count: %s, job failure count: %s" % ( > > > > builder.failure_count, > > builder.getCurrentBuildFarmJob().failure_count)) > > > > assessFailureCounts(builder, failure.getErrorMessage()) > > transaction.commit() > > > > except: > > # Catastrophic code failure! Not much we can do. > > self.logger.error( > > > > "Miserable failure when trying to examine failure > > counts:\n", exc_info=True) > > > > transaction.abort() > > > > def scan(self): > > """Probe the builder and update/dispatch/collect as appropriate. > > > > The whole method is wrapped in a transaction, but we do partial > > commits to avoid holding locks on tables. > > > > :return: A `BuilderSlave` if we dispatched a job to it, or None. > > Can you please change this to document that it returns a Deferred. Done. > > > """ > > # We need to re-fetch the builder object on each cycle as the > > # Storm store is invalidated over transaction boundaries. > > This method is complicated enough that I think it would benefit from a > prose summary of what it does. > > For example: > > If the builder is OK [XXX - I don't know what this actually means - jml], > then update its status [XXX - this seems redundant, didn't we just check > that it was OK? - jml]. If we think there's an active build on the > builder, then check to see if it's done (builder.updateBuild). If it > is done, then check that it's available and not in manual mode, then > dispatch the job to the build. > > Well, that's my best guess. I think it's only worth describing the happy > case, since the unusual cases will be easy enough to follow in the code. I've written something - let me know if you think it's easy enough to follow. It does require some knowledge of how the farm works and I don't think code comments are the best place for explaining that. > > > self.builder = get_builder(self.builder_name) > > > > if self.builder.builderok: > > d = self.builder.updateStatus(self.logger) > > > > else: > > d = defer.succeed(None) > > > > def got_update_status(ignored): > This would be better named 'status_updated', since you haven't actually got > anything from anywhere. Ok. > > > transaction.commit() > > Could you please add a comment explaining why you commit here. Ok. > > > # See if we think there's an active build on the builder. > > buildqueue = self.builder.getBuildQueue() > > > > # Scan the slave and get the logtail, or collect the build if > > # it's ready. Yes, "updateBuild" is a bad name. > > > > if buildqueue is not None: > > return self.builder.updateBuild(buildqueue) > > > > def got_available(available): > > if not available: > > job = self.builder.currentjob > > > > if job is not None and not self.builder.builderok: > > self.logger.info( > > > > "%s was made unavailable, resetting attached " > > "job" % self.builder.name) > > > > job.reset() > > transaction.commit() > > > > return > > > > # See if there is a job we can dispatch to the builder slave. > > > > d = self.builder.findAndStartJob() > > > > def job_started(candidate): > > if self.builder.currentjob is not None: > > # After a successful dispatch we can reset the > > # failure_count. > > self.builder.resetFailureCount() > > transaction.commit() > > return self.builder.slave > > > > else: > > return None > > > > return d.addCallback(job_started) > > Since got_available is called after 'got_update_build', it would make this > function read more clearly to move it below got_update_build. Ok. > > > def got_update_build(ignored): > build_updated would be a better name, since you don't really "got" anything > here. Ok. > > > transaction.commit() > > Can you please add a comment explaining why you commit here? Ok. > > > # If the builder is in manual mode, don't dispatch anything. > > > > if self.builder.manual: > > self.logger.debug( > > > > '%s is in manual mode, not dispatching.' % > > self.builder.name) > > > > return > > > > # If the builder is marked unavailable, don't dispatch > > anything. # Additionaly, because builders can be removed > > from the pool at # any time, we need to see if we think > > there was a build running # on it before it was marked > > unavailable. In this case we reset # the build thusly > > forcing it to get re-dispatched to another # builder. > > > > return self.builder.isAvailable().addCallback(got_available) > > > > d.addCallback(got_update_status) > > d.addCallback(got_update_build) > > return d > > All in all, that's vastly more readable than the version in trunk. > Congratulations. I can't take all the credit, give yourself some! But anyway.... HELL YEAH! > > class BuilddManager(service.Service): > > """Main Buildd Manager service class.""" > > > > def __init__(self, clock=None): > > self.builder_slaves = [] > > self.logger = self._setupLogger() > > Given that _setupLogger changes global state, it's better to put it in > startService. Nocando (which is a bit near Katmandu) - the NewBuildersScanner needs it there. (More below on this section of code) > > > self.new_builders_scanner = NewBuildersScanner( > > > > manager=self, clock=clock) > > > > def _setupLogger(self): > > """Setup a 'slave-scanner' logger that redirects to twisted. > > FWIW, "Setup" is a noun. "Set up" is a verb. Setup is not a word, it's a typo. > > > It is going to be used locally and within the thread running > > the scan() method. > > There's no threading here, is there? No - it's a docstring copied from the old code. Fixed, anyway. > > > def startService(self): > > """Service entry point, called when the application starts.""" > > > > # Get a list of builders and set up scanners on each one. > > > > # Avoiding circular imports. > > from lp.buildmaster.interfaces.builder import IBuilderSet > > builder_set = getUtility(IBuilderSet) > > builders = [builder.name for builder in builder_set] > > self.addScanForBuilders(builders) > > self.new_builders_scanner.scheduleScan() > > If you change the constructor of NewBuildersScanner to set current_builders > to [], rather than doing a query, then you could delete everything but the > last line of this method without sacrificing either correctness or > performance. As discussed on IRC I'll defer for now this since there's a weird bug to do with builder detection causing dispatch failures. > > # Events will now fire in the SlaveScanner objects to scan each > > # builder. > > > > def stopService(self): > > """Callback for when we need to shut down.""" > > # All the SlaveScanner objects need to be halted gracefully. > > deferreds = [slave.stopping_deferred for slave in > > self.builder_slaves] > > deferreds.append(self.new_builders_scanner.stopping_deferred) > > > > self.new_builders_scanner.loop.stop() > > > > for slave in self.builder_slaves: > > slave.loop.stop() > > There's an asymmetry here. Given that 'slave' and 'new_builders' have > startCycle methods, they should probably have stopCycle methods too, rather > than having their insides poked. Ok fixed. > > class TestSlaveScannerScan(TrialTestCase): > > """Tests `SlaveScanner.scan` method. > > > > This method uses the old framework for scanning and dispatching > > builds. """ > > layer = LaunchpadZopelessLayer > > > > def setUp(self): > > """Setup TwistedLayer, TrialTestCase and BuilddSlaveTest. > > > > Also adjust the sampledata in a way a build can be dispatched to > > 'bob' builder. > > """ > > from lp.soyuz.tests.test_publishing import SoyuzTestPublisher > > TwistedLayer.testSetUp() > > Why not use TwistedLaunchpadZopelessLayer? Why not indeed. > > > def _resetBuilder(self, builder): > > """Reset the given builder and its job.""" > > > > builder.builderok = True > > job = builder.currentjob > > > > if job is not None: > > job.reset() > > > > transaction.commit() > > Why isn't this a method on builder? There's no reason for it to be there, this is a testing-only procedure, since we'd never do this on a real builder. > > > def _assertFailureCounting(self, builder_count, job_count, > > > > expected_builder_count, expected_job_count): > > # If scan() fails with an exception, failure_counts should be > > # incremented. What we do with the results of the failure > > # counts is tested below separately, this test just makes sure > > that # scan() is setting the counts. > > > > def failing_scan(): > > return defer.fail(Exception("fake exception")) > > > > scanner = self._getScanner() > > scanner.scan = failing_scan > > scanner.scheduleNextScanCycle = FakeMethod() > > Why assign a FakeMethod here? There's no scanner.scheduleNextScanCycle any > more. (This is one reason I prefer state verification to behaviour > verification). I missed that one, I'll remove it, thanks. > > > def test_fail_to_resume_slave_resets_job(self): > > # If an attempt to resume and dispatch a slave fails, it should > > # reset the job via job.reset() > > > > # Make a slave with a failing resume() method. > > slave = OkSlave() > > slave.resume = lambda: deferLater( > > > > reactor, 0, defer.fail, Failure(('out', 'err', 1))) > > Why is deferLater being used here? I've explained this before :-) I don't recall the exact details. I just remember getting thoroughly pissed off in Prague for 2 whole days trying to work out why the test was failing. I narrowed it down to defer.fail firing immediately rather than later. > > > class TestBuilddManager(TrialTestCase): > > layer = LaunchpadZopelessLayer > > > > def _stub_out_scheduleNextScanCycle(self): > > # stub out the code that adds a callLater, so that later tests > > # don't get surprises. > > self.patch(SlaveScanner, 'startCycle', FakeMethod()) > > Rather than stubbing out startCycle, you could instead call > manager.stopService() in the tearDown. I tried that but got a million* failures. I'd like to fix this but I don't have the energy to do it now :( > > > class TestNewBuilders(TrialTestCase): > > """Test detecting of new builders.""" > > > > layer = LaunchpadZopelessLayer > > This needs to use TwistedLaunchpadZopelessLayer, otherwise it will screw up > signal handlers. Fixed. What's the difference anyway? > > > def _getScanner(self, manager=None, clock=None): > > return NewBuildersScanner(manager=manager, clock=clock) > > > > def test_init_stores_existing_builders(self): > > # Make sure that NewBuildersScanner initialises itself properly > > # by storing a list of existing builders. > > all_builders = [builder.name for builder in > > getUtility(IBuilderSet)] builder_scanner = self._getScanner() > > self.assertEqual(all_builders, builder_scanner.current_builders) > > As mentioned above, if you change the behaviour of the constructor to not > do a query, then you can delete duplicated code. > > > def test_scheduleScan(self): > > # Test that scheduleScan calls the "scan" method. > > clock = task.Clock() > > builder_scanner = self._getScanner(clock=clock) > > builder_scanner.scan = FakeMethod() > > builder_scanner.scheduleScan() > > > > advance = NewBuildersScanner.SCAN_INTERVAL + 1 > > clock.advance(advance) > > self.assertNotEqual( > > > > 0, builder_scanner.scan.call_count, > > "scheduleScan did not schedule anything") > > The mocking in this test doesn't help much. It's not that much harder to > test that if new builders are added to the database and the clock is > advanced then those builders are registered with the manager. That's also > the behaviour you are actually interested in. The behaviour is tested in test_scan; this test makes sure that test_scan is actually called. I don't understand your point. :( > > > def test_scan(self): > > # See if scan detects new builders. > > > > # stub out the addScanForBuilders and scheduleScan methods since > > # they use callLater; we only want to assert that they get > > # called. > > This comment is slightly incorrect, since it doesn't stub out scheduleScan. Yes it does :) > Also, callLater is only called very indirectly by addScanForBuilders. I've removed the whole comment since it was of little value. > > > def fake_checkForNewBuilders(): > > return "new_builders" > > > > def fake_addScanForBuilders(new_builders): > > self.assertEqual("new_builders", new_builders) > > > > clock = task.Clock() > > builder_scanner = self._getScanner(BuilddManager(), clock=clock) > > builder_scanner.checkForNewBuilders = fake_checkForNewBuilders > > builder_scanner.manager.addScanForBuilders = > > fake_addScanForBuilders builder_scanner.scheduleScan = > > FakeMethod() > > > > builder_scanner.scan() > > advance = NewBuildersScanner.SCAN_INTERVAL + 1 > > clock.advance(advance) > > This never actually checks that checkForNewBuilders or addScanForBuilders > is called. > > There's missing test coverage here: > * There are no tests that demonstrate that the slave scanners' > cycles are started when addScanForBuilders is called. > * There are no tests that demonstrate that new builders found in > scan() will no longer be new in the next run of scan() It was brand new code that you previously reviewed in the last "multi-scan" branch. I appreciate your thoroughness, but I've got zero energy to fix this code right now, sorry. :( > * assessFailureCounts is completely untested The counts are very thoroughly tested in test_manager, but there's no test for actions based on the counts. I intend to change the behaviour anyway so I'll sort that out at the same time. > * stopService has no tests. We've already discussed that this is hard / impossible. I'd be very happy if you could help though. > I guess there's also a general theme of using fewer FakeMethods. > FakeMethod is a useful testing tool, but that every time we use it we > incur a risk, because we are changing the behaviour of the System Under > Test in order to test its behaviour. Better to test the output, rather > than method calls, since the output is where the value is. > > I look forward to your reply. Stuff that's left to do in this branch: 1. Change assessFailureCounts() to cope with lack of a job, more builder leniency, and add a test. 2. Fix the remaining places that catch errors, to send them up to _scanFailed() instead. Finally - sorry about some of the negativity above, I'm coming down with something and feeling a bit crap which has made my brain turn to jelly. I'm also getting a bit fed up with this branch now. Thanks for helping this far. J