On Monday 18 October 2010 13:21:45 you wrote:
> Review: Needs Fixing
> Hey Julian,
> 
> The new manager is far, far more readable than before.  The hard work has
> paid off.

\o/

> I mention a lot of small things in the comments below.  I'd really
> appreciate it if you could address them all, since now is the best
> opportunity we'll have in a while to make this code comprehensible to
> others.

I shall endeavour to do so.

> > # Copyright 2009 Canonical Ltd.  This software is licensed under the
> > # GNU Affero General Public License version 3 (see the file LICENSE).
> > 
> > """Soyuz buildd slave manager logic."""
> > 
> > __metaclass__ = type
> > 
> > __all__ = [
> > 
> >     'BuilddManager',
> >     'BUILDD_MANAGER_LOG_NAME',
> >     'buildd_success_result_map',
> >     ]
> > 
> > import logging
> > 
> > import transaction
> > from twisted.application import service
> > from twisted.internet import (
> > 
> >     defer,
> >     reactor,
> >     )
> > 
> > from twisted.internet.task import LoopingCall
> > from twisted.python import log
> > from zope.component import getUtility
> > 
> > from lp.buildmaster.enums import BuildStatus
> > from lp.buildmaster.interfaces.buildfarmjobbehavior import (
> > 
> >     BuildBehaviorMismatch,
> >     )
> > 
> > from lp.buildmaster.interfaces.builder import (
> > 
> >     BuildDaemonError,
> >     BuildSlaveFailure,
> >     CannotBuild,
> >     CannotResumeHost,
> >     )
> > 
> > BUILDD_MANAGER_LOG_NAME = "slave-scanner"
> > 
> > 
> > buildd_success_result_map = {
> > 
> >     'ensurepresent': True,
> >     'build': 'BuilderStatus.BUILDING',
> >     }
> 
> You can delete this now. Yay. (Don't forget the __all__ too).

Right!

> 
> > class SlaveScanner:
> >     """A manager for a single builder."""
> >     
> >     SCAN_INTERVAL = 5
> 
> Can you please add a comment explaining what this means, what unit it's in,
> and hinting at why 5 is a good number for it?

Done.

> 
> >     def startCycle(self):
> >         """Scan the builder and dispatch to it or deal with failures."""
> >         self.loop = LoopingCall(self._startCycle)
> >         self.stopping_deferred = self.loop.start(self.SCAN_INTERVAL)
> >         return self.stopping_deferred
> >     
> >     def _startCycle(self):
> >         # Same as _startCycle but the next cycle is not scheduled.  This
> >         # is so tests can initiate a single scan.
> 
> This comment is obsolete.  Also, there's probably a better name than
> "_startCycle", since this is pretty much doing the scan.  Perhaps
> 'oneCycle'.

I named it singleCycle()

> 
> >     def _scanFailed(self, failure):
> >         # Trap known exceptions and print a message without a
> >         # stack trace in that case, or if we don't know about it,
> >         # include the trace.
> 
> This comment is also obsolete.  Although the traceback/no-traceback logic
> is still here, it's hardly the point of the method.

Spruced up as discussed in IRC.

> 
> >         # Paranoia.
> >         transaction.abort()
> 
> Can you please explain in the comment exactly what you are being paranoid
> about?

Yup, done.

> 
> >         error_message = failure.getErrorMessage()
> >         if failure.check(
> >         
> >             BuildSlaveFailure, CannotBuild, BuildBehaviorMismatch,
> >             CannotResumeHost, BuildDaemonError):
> >             self.logger.info("Scanning failed with: %s" % error_message)
> >         
> >         else:
> >             self.logger.info("Scanning failed with: %s\n%s" %
> >             
> >                 (failure.getErrorMessage(), failure.getTraceback()))
> >         
> >         builder = get_builder(self.builder_name)
> 
> Shouldn't this line be in the try: block below?  After all, if the database
> disappears...

I don't think it matters but I've moved it anyway as it looks better if 
nothing else :)

> 
> >         # Decide if we need to terminate the job or fail the
> >         # builder.
> >         
> >         try:
> >             builder.gotFailure()
> >             builder.getCurrentBuildFarmJob().gotFailure()
> >             self.logger.info(
> >             
> >                 "builder failure count: %s, job failure count: %s" % (
> >                 
> >                     builder.failure_count,
> >                     builder.getCurrentBuildFarmJob().failure_count))
> >             
> >             assessFailureCounts(builder, failure.getErrorMessage())
> >             transaction.commit()
> >         
> >         except:
> >             # Catastrophic code failure! Not much we can do.
> >             self.logger.error(
> >             
> >                 "Miserable failure when trying to examine failure
> >                 counts:\n", exc_info=True)
> >             
> >             transaction.abort()
> >     
> >     def scan(self):
> >         """Probe the builder and update/dispatch/collect as appropriate.
> >         
> >         The whole method is wrapped in a transaction, but we do partial
> >         commits to avoid holding locks on tables.
> >         
> >         :return: A `BuilderSlave` if we dispatched a job to it, or None.
> 
> Can you please change this to document that it returns a Deferred.

Done.

> 
> >         """
> >         # We need to re-fetch the builder object on each cycle as the
> >         # Storm store is invalidated over transaction boundaries.
> 
> This method is complicated enough that I think it would benefit from a
> prose summary of what it does.
> 
> For example:
> 
>   If the builder is OK [XXX - I don't know what this actually means - jml],
>   then update its status [XXX - this seems redundant, didn't we just check
>   that it was OK? - jml].  If we think there's an active build on the
>   builder, then check to see if it's done (builder.updateBuild).  If it
>   is done, then check that it's available and not in manual mode, then
>   dispatch the job to the build.
> 
> Well, that's my best guess.  I think it's only worth describing the happy
> case, since the unusual cases will be easy enough to follow in the code.

I've written something - let me know if you think it's easy enough to follow.  
It does require some knowledge of how the farm works and I don't think code 
comments are the best place for explaining that.

> 
> >         self.builder = get_builder(self.builder_name)
> >         
> >         if self.builder.builderok:
> >             d = self.builder.updateStatus(self.logger)
> >         
> >         else:
> >             d = defer.succeed(None)
> >         
> >         def got_update_status(ignored):
> This would be better named 'status_updated', since you haven't actually got
> anything from anywhere.

Ok.

> 
> >             transaction.commit()
> 
> Could you please add a comment explaining why you commit here.

Ok.

> 
> >             # See if we think there's an active build on the builder.
> >             buildqueue = self.builder.getBuildQueue()
> >             
> >             # Scan the slave and get the logtail, or collect the build if
> >             # it's ready.  Yes, "updateBuild" is a bad name.
> >             
> >             if buildqueue is not None:
> >                 return self.builder.updateBuild(buildqueue)
> >         
> >         def got_available(available):
> >             if not available:
> >                 job = self.builder.currentjob
> >                 
> >                 if job is not None and not self.builder.builderok:
> >                     self.logger.info(
> >                     
> >                         "%s was made unavailable, resetting attached "
> >                         "job" % self.builder.name)
> >                     
> >                     job.reset()
> >                     transaction.commit()
> >                 
> >                 return
> >             
> >             # See if there is a job we can dispatch to the builder slave.
> >             
> >             d = self.builder.findAndStartJob()
> >             
> >             def job_started(candidate):
> >                 if self.builder.currentjob is not None:
> >                     # After a successful dispatch we can reset the
> >                     # failure_count.
> >                     self.builder.resetFailureCount()
> >                     transaction.commit()
> >                     return self.builder.slave
> >                 
> >                 else:
> >                     return None
> >             
> >             return d.addCallback(job_started)
> 
> Since got_available is called after 'got_update_build', it would make this
> function read more clearly to move it below got_update_build.

Ok.

> 
> >         def got_update_build(ignored):
> build_updated would be a better name, since you don't really "got" anything
> here.

Ok.

> 
> >             transaction.commit()
> 
> Can you please add a comment explaining why you commit here?

Ok.

> 
> >             # If the builder is in manual mode, don't dispatch anything.
> >             
> >             if self.builder.manual:
> >                 self.logger.debug(
> >                 
> >                     '%s is in manual mode, not dispatching.' %
> >                     self.builder.name)
> >                 
> >                 return
> >             
> >             # If the builder is marked unavailable, don't dispatch
> >             anything. # Additionaly, because builders can be removed
> >             from the pool at # any time, we need to see if we think
> >             there was a build running # on it before it was marked
> >             unavailable. In this case we reset # the build thusly
> >             forcing it to get re-dispatched to another # builder.
> >             
> >             return self.builder.isAvailable().addCallback(got_available)
> >         
> >         d.addCallback(got_update_status)
> >         d.addCallback(got_update_build)
> >         return d
> 
> All in all, that's vastly more readable than the version in trunk.
> Congratulations.

I can't take all the credit, give yourself some!

But anyway.... HELL YEAH!

> > class BuilddManager(service.Service):
> >     """Main Buildd Manager service class."""
> >     
> >     def __init__(self, clock=None):
> >         self.builder_slaves = []
> >         self.logger = self._setupLogger()
> 
> Given that _setupLogger changes global state, it's better to put it in
> startService.

Nocando (which is a bit near Katmandu) - the NewBuildersScanner needs it 
there.  (More below on this section of code)

> 
> >         self.new_builders_scanner = NewBuildersScanner(
> >         
> >             manager=self, clock=clock)
> >     
> >     def _setupLogger(self):
> >         """Setup a 'slave-scanner' logger that redirects to twisted.
> 
> FWIW, "Setup" is a noun. "Set up" is a verb.

Setup is not a word, it's a typo.

> 
> >         It is going to be used locally and within the thread running
> >         the scan() method.
> 
> There's no threading here, is there?

No - it's a docstring copied from the old code.  Fixed, anyway.

> 
> >     def startService(self):
> >         """Service entry point, called when the application starts."""
> >         
> >         # Get a list of builders and set up scanners on each one.
> >         
> >         # Avoiding circular imports.
> >         from lp.buildmaster.interfaces.builder import IBuilderSet
> >         builder_set = getUtility(IBuilderSet)
> >         builders = [builder.name for builder in builder_set]
> >         self.addScanForBuilders(builders)
> >         self.new_builders_scanner.scheduleScan()
> 
> If you change the constructor of NewBuildersScanner to set current_builders
> to [], rather than doing a query, then you could delete everything but the
> last line of this method without sacrificing either correctness or
> performance.

As discussed on IRC I'll defer for now this since there's a weird bug to do 
with builder detection causing dispatch failures.

> >         # Events will now fire in the SlaveScanner objects to scan each
> >         # builder.
> >     
> >     def stopService(self):
> >         """Callback for when we need to shut down."""
> >         # All the SlaveScanner objects need to be halted gracefully.
> >         deferreds = [slave.stopping_deferred for slave in
> >         self.builder_slaves]
> >         deferreds.append(self.new_builders_scanner.stopping_deferred)
> >         
> >         self.new_builders_scanner.loop.stop()
> >         
> >         for slave in self.builder_slaves:
> >             slave.loop.stop()
> 
> There's an asymmetry here.  Given that 'slave' and 'new_builders' have
> startCycle methods, they should probably have stopCycle methods too, rather
> than having their insides poked.

Ok fixed.

> > class TestSlaveScannerScan(TrialTestCase):
> >     """Tests `SlaveScanner.scan` method.
> >     
> >     This method uses the old framework for scanning and dispatching
> >     builds. """
> >     layer = LaunchpadZopelessLayer
> >     
> >     def setUp(self):
> >         """Setup TwistedLayer, TrialTestCase and BuilddSlaveTest.
> >         
> >         Also adjust the sampledata in a way a build can be dispatched to
> >         'bob' builder.
> >         """
> >         from lp.soyuz.tests.test_publishing import SoyuzTestPublisher
> >         TwistedLayer.testSetUp()
> 
> Why not use TwistedLaunchpadZopelessLayer?

Why not indeed.

> 
> >     def _resetBuilder(self, builder):
> >         """Reset the given builder and its job."""
> >         
> >         builder.builderok = True
> >         job = builder.currentjob
> >         
> >         if job is not None:
> >             job.reset()
> >         
> >         transaction.commit()
> 
> Why isn't this a method on builder?

There's no reason for it to be there, this is a testing-only procedure, since 
we'd never do this on a real builder.

> 
> >     def _assertFailureCounting(self, builder_count, job_count,
> >     
> >                                expected_builder_count, 
expected_job_count):
> >         # If scan() fails with an exception, failure_counts should be
> >         # incremented.  What we do with the results of the failure
> >         # counts is tested below separately, this test just makes sure
> >         that # scan() is setting the counts.
> >         
> >         def failing_scan():
> >             return defer.fail(Exception("fake exception"))
> >         
> >         scanner = self._getScanner()
> >         scanner.scan = failing_scan
> >         scanner.scheduleNextScanCycle = FakeMethod()
> 
> Why assign a FakeMethod here?  There's no scanner.scheduleNextScanCycle any
> more.  (This is one reason I prefer state verification to behaviour
> verification).

I missed that one, I'll remove it, thanks.

> 
> >     def test_fail_to_resume_slave_resets_job(self):
> >         # If an attempt to resume and dispatch a slave fails, it should
> >         # reset the job via job.reset()
> >         
> >         # Make a slave with a failing resume() method.
> >         slave = OkSlave()
> >         slave.resume = lambda: deferLater(
> >         
> >             reactor, 0, defer.fail, Failure(('out', 'err', 1)))
> 
> Why is deferLater being used here?

I've explained this before :-)

I don't recall the exact details.  I just remember getting thoroughly pissed 
off in Prague for 2 whole days trying to work out why the test was failing.  I 
narrowed it down to defer.fail firing immediately rather than later.

> 
> > class TestBuilddManager(TrialTestCase):
> >     layer = LaunchpadZopelessLayer
> >     
> >     def _stub_out_scheduleNextScanCycle(self):
> >         # stub out the code that adds a callLater, so that later tests
> >         # don't get surprises.
> >         self.patch(SlaveScanner, 'startCycle', FakeMethod())
> 
> Rather than stubbing out startCycle, you could instead call
> manager.stopService() in the tearDown.

I tried that but got a million* failures.  I'd like to fix this but I don't 
have the energy to do it now :(

> 
> > class TestNewBuilders(TrialTestCase):
> >     """Test detecting of new builders."""
> >     
> >     layer = LaunchpadZopelessLayer
> 
> This needs to use TwistedLaunchpadZopelessLayer, otherwise it will screw up
> signal handlers.

Fixed.

What's the difference anyway?

> 
> >     def _getScanner(self, manager=None, clock=None):
> >         return NewBuildersScanner(manager=manager, clock=clock)
> >     
> >     def test_init_stores_existing_builders(self):
> >         # Make sure that NewBuildersScanner initialises itself properly
> >         # by storing a list of existing builders.
> >         all_builders = [builder.name for builder in
> >         getUtility(IBuilderSet)] builder_scanner = self._getScanner()
> >         self.assertEqual(all_builders, builder_scanner.current_builders)
> 
> As mentioned above, if you change the behaviour of the constructor to not
> do a query, then you can delete duplicated code.
> 
> >     def test_scheduleScan(self):
> >         # Test that scheduleScan calls the "scan" method.
> >         clock = task.Clock()
> >         builder_scanner = self._getScanner(clock=clock)
> >         builder_scanner.scan = FakeMethod()
> >         builder_scanner.scheduleScan()
> >         
> >         advance = NewBuildersScanner.SCAN_INTERVAL + 1
> >         clock.advance(advance)
> >         self.assertNotEqual(
> >         
> >             0, builder_scanner.scan.call_count,
> >             "scheduleScan did not schedule anything")
> 
> The mocking in this test doesn't help much.  It's not that much harder to
> test that if new builders are added to the database and the clock is
> advanced then those builders are registered with the manager.  That's also
> the behaviour you are actually interested in.

The behaviour is tested in test_scan; this test makes sure that test_scan is 
actually called.  I don't understand your point. :(

> 
> >     def test_scan(self):
> >         # See if scan detects new builders.
> >         
> >         # stub out the addScanForBuilders and scheduleScan methods since
> >         # they use callLater; we only want to assert that they get
> >         # called.
> 
> This comment is slightly incorrect, since it doesn't stub out scheduleScan.

Yes it does :)

> Also, callLater is only called very indirectly by addScanForBuilders.

I've removed the whole comment since it was of little value.

> 
> >         def fake_checkForNewBuilders():
> >             return "new_builders"
> >         
> >         def fake_addScanForBuilders(new_builders):
> >             self.assertEqual("new_builders", new_builders)
> >         
> >         clock = task.Clock()
> >         builder_scanner = self._getScanner(BuilddManager(), clock=clock)
> >         builder_scanner.checkForNewBuilders = fake_checkForNewBuilders
> >         builder_scanner.manager.addScanForBuilders =
> >         fake_addScanForBuilders builder_scanner.scheduleScan =
> >         FakeMethod()
> >         
> >         builder_scanner.scan()
> >         advance = NewBuildersScanner.SCAN_INTERVAL + 1
> >         clock.advance(advance)
> 
> This never actually checks that checkForNewBuilders or addScanForBuilders
> is called.
> 
> There's missing test coverage here:
>   * There are no tests that demonstrate that the slave scanners'
>     cycles are started when addScanForBuilders is called.
>   * There are no tests that demonstrate that new builders found in
>     scan() will no longer be new in the next run of scan()

It was brand new code that you previously reviewed in the last "multi-scan" 
branch.  I appreciate your thoroughness, but I've got zero energy to fix this 
code right now, sorry. :(

>   * assessFailureCounts is completely untested

The counts are very thoroughly tested in test_manager, but there's no test for 
actions based on the counts.  I intend to change the behaviour anyway so I'll 
sort that out at the same time.

>   * stopService has no tests.

We've already discussed that this is hard / impossible.  I'd be very happy if 
you could help though.

> I guess there's also a general theme of using fewer FakeMethods. 
> FakeMethod is a useful testing tool, but that every time we use it we
> incur a risk, because we are changing the behaviour of the System Under
> Test in order to test its behaviour.  Better to test the output, rather
> than method calls, since the output is where the value is.
> 
> I look forward to your reply.


Stuff that's left to do in this branch:
 1. Change assessFailureCounts() to cope with lack of a job, more builder 
leniency, and add a test.
 2. Fix the remaining places that catch errors, to send them up to 
_scanFailed() instead.

Finally - sorry about some of the negativity above, I'm coming down with 
something and feeling a bit crap which has made my brain turn to jelly.  I'm 
also getting a bit fed up with this branch now.

Thanks for helping this far.
J