Hey Julian,

This looks like a big step toward build farm reliability -- thank you.

Some of my comments are asking for further explanation, a few make suggestions
for moving some code around, and the rest are the normal stylistic gotchas.

Out of curiosity, did you discuss this change in detail with anyone before
landing?

cheers,
jml

> === modified file 'lib/lp/buildmaster/configure.zcml'
> --- lib/lp/buildmaster/configure.zcml	2010-06-15 14:34:50 +0000
> +++ lib/lp/buildmaster/configure.zcml	2010-08-18 17:01:23 +0000
> @@ -50,6 +50,9 @@
>          <require
>              permission="launchpad.Edit"
>              set_attributes="status"/>
> +        <require
> +            permission="launchpad.Admin"
> +            set_attributes="failure_count"/>
>      </class>
>      <securedutility
>          component="lp.buildmaster.model.buildfarmjob.BuildFarmJob"
>

I guess this means that whatever increments this runs with admin-level
privileges.  Is that really such a good idea?


> === modified file 'lib/lp/buildmaster/manager.py'
> --- lib/lp/buildmaster/manager.py	2010-08-02 16:00:50 +0000
> +++ lib/lp/buildmaster/manager.py	2010-08-18 17:01:23 +0000
...
> @@ -157,6 +158,52 @@
>          if job is not None:
>              job.reset()
>  
> +    def _getBuilder(self):
> +        # Helper to return the builder given the slave for this request.
> +        # Avoiding circular imports.
> +        from lp.buildmaster.interfaces.builder import IBuilderSet
> +        return getUtility(IBuilderSet)[self.slave.name]
> +

I notice that there's very similar code below.

I'd suggest making a top-level function like this:

  def get_builder(slave_name):
      # Helper to return the builder given the slave for this request.
      # Avoiding circular imports.
      from lp.buildmaster.interfaces.builder import IBuilderSet
      return getUtility(IBuilderSet)[self.slave.name]

And using that here instead, as well as in SlaveScanner.

> +    def assessFailureCounts(self, builder=None):
> +        """View builder/job failure_count and work out which needs to die.
> +

There's spurious whitespace on this line.

> +        :return: True if we disabled something, False if we did not.

This seems like an odd thing to return. It doesn't seem to be used in the
code. What's it for?

> +        """
> +        # Avoiding circular imports.

Not avoiding circular imports here any more :)

> +        if builder is None:
> +            builder = self._getBuilder()
> +        build_job = builder.currentjob.specific_job.build
> +
> +        if builder.failure_count == build_job.failure_count:
> +            # This is either the first failure for this job on this
> +            # builder, or by some chance the job was re-dispatched to
> +            # the same builder.  This make it impossible to determine

"makes"

> +            # whether the job or the builder is at fault, so don't fail
> +            # either.  We reset the builder and job to try again.

It's unclear to me why the two counts being equal could imply that this is the
first failure for the job -- surely a count of 1 would signify that?

Perhaps the comment should say something like:
  # If the failure count for the builder is the same as the
  # failure count for the job being built, then we cannot
  # tell whether the job or the builder is at fault. The best
  # we can do is try them both again, and hope that the job
  # runs against a different builder.

...
> @@ -243,8 +282,21 @@
>              error = Failure()
>              self.logger.info("Scanning failed with: %s\n%s" %
>                  (error.getErrorMessage(), error.getTraceback()))
> -            # We should probably detect continuous failures here and mark
> -            # the builder down.
> +
> +            # Avoid circular import.
> +            from lp.buildmaster.interfaces.builder import IBuilderSet
> +            builder = getUtility(IBuilderSet)[self.builder_name]
> +

The get_builder helper would be useful here.

> +            # Decide if we need to terminate the job or fail the
> +            # builder.
> +            self._incrementFailureCounts(builder)
> +            self.logger.info(
> +                "builder failure count: %s, job failure count: %s" % ( 
> +                    builder.failure_count,
> +                    builder.currentjob.specific_job.build.failure_count))
> +            BaseDispatchResult(slave=None).assessFailureCounts(builder)

This line makes me wonder why assessFailureCounts is on BaseDispatchResult at
all.  Perhaps it should be a method on Builder that takes an optional 'info'
parameter?

The above code would then read:

  builder.gotFailure()
  self.logger.info(...)
  builder.assessFailureCounts()

Also, perhaps there should be a current_build property on Builder.

  @property
  def current_build(self):
      return self.currentjob.specific_job.build

Since the buildd-manager seems to do builder.currentjob.specific_job.build an
awful lot.

> @@ -440,6 +495,11 @@
>          self.logger.error('%s resume failure: %s' % (slave, error_text))
>          return self.reset_result(slave, error_text)
>  
> +    def _incrementFailureCounts(self, builder):
> +        # Avoid circular import.

I don't think you are avoiding a circular import here.

> +        builder.failure_count += 1
> +        builder.currentjob.specific_job.build.failure_count += 1
> +

Perhaps this should be a method on Builder. e.g.

  class Builder:

     # ...

     def gotFailure(self):
         self.failure_count += 1
         self.currentjob.specific_job.build.failure_count += 1

Likewise, there could also be a gotSuccess() that resets to 0.

> @@ -447,6 +507,9 @@
>          `FailDispatchResult`, if it was a communication failure, simply
>          reset the slave by returning a `ResetDispatchResult`.
>          """
> +        from lp.buildmaster.interfaces.builder import IBuilderSet
> +        builder = getUtility(IBuilderSet)[slave.name]
> +

As mentioned above: builder = get_builder(slave.name)

> === modified file 'lib/lp/buildmaster/tests/test_builder.py'
> --- lib/lp/buildmaster/tests/test_builder.py	2010-08-12 17:33:08 +0000
> +++ lib/lp/buildmaster/tests/test_builder.py	2010-08-18 17:01:23 +0000
...
> @@ -32,7 +34,17 @@
>  class TestBuilder(TestCaseWithFactory):
>      """Basic unit tests for `Builder`."""
>  
> -    layer = LaunchpadZopelessLayer
> +    layer = DatabaseFunctionalLayer
> +
> +    def test_providesInterface(self):
> +        # Builder provides IBuilder
> +        builder = self.factory.makeBuilder()
> +        self.assertProvides(builder, IBuilder)
> +
> +    def test_default_values(self):
> +        builder = self.factory.makeBuilder()
> +        flush_database_updates()
> +        self.assertEqual(0, builder.failure_count)
>

Why is flush_database_updates() necessary? Could you please put the answer in
a comment.


> === modified file 'lib/lp/buildmaster/tests/test_manager.py'
> --- lib/lp/buildmaster/tests/test_manager.py	2010-08-06 10:48:49 +0000
> +++ lib/lp/buildmaster/tests/test_manager.py	2010-08-18 17:01:23 +0000
...
> @@ -360,9 +360,17 @@
>              self.assertFalse(result.processed)
>          return d.addCallback(check_result)
>  
> -    def testCheckDispatch(self):
> -        """`SlaveScanner.checkDispatch` is chained after dispatch requests.
> -
> +    def _setUpSlaveAndBuilder(self):
> +        # Helper function to set up a builder and its recording slave.
> +        slave = RecordingSlave('bob', 'http://foo.buildd:8221/', 'foo.host')
> +        bob_builder = getUtility(IBuilderSet)[slave.name]
> +        return slave, bob_builder
> +

There's already a slave called 'bob' in the sampledata? Could you please add a
constant to lp.testing.sampledata like:

  BUILD_SLAVE = RecordingSlave('bob', 'http://foo.buildd:8221/', 'foo.host')

or

  STANDARD_BUILD_SLAVE_NAME = 'bob'

or whatever is most appropriate.

Also, it might be a good idea to have this method guarantee that the failure
counts are zero.

> +    def test_scan_assesses_failure_exceptions(self):
> +        # If scan() fails with an exception, failure_counts should be
> +        # incremented and tested.
> +        def fake_scan():
> +            raise Exception("fake exception")
> +        manager = self._getManager()
> +        manager.scan = fake_scan

Consider calling the function "failing_scan" rather than "fake_scan".

> +        manager.scheduleNextScanCycle = FakeMethod()
> +        self.patch(BaseDispatchResult, 'assessFailureCounts', FakeMethod())

Why are you stubbing out assessFailureCounts here?

> +    def assertBuilderIsClean(self, builder):
> +        # Check that the builder is ready for a new build.
> +        self.assertTrue(builder.builderok)
> +        self.assertTrue(builder.failnotes is None)
> +        self.assertTrue(builder.currentjob is None)
>

Please use assertIs(None, builder.failnotes) instead of assertTrue. Same for
currentjob.  It gets you more helpful error messages when something breaks.

> @@ -809,32 +890,82 @@
>          result = ResetDispatchResult(slave)
>          result()
>  
> -        self.assertJobIsClean(job_id)
> +        buildqueue = getUtility(IBuildQueueSet).get(buildqueue_id)
> +        self.assertBuildqueueIsClean(buildqueue)
>  
>          # XXX Julian
>          # Disabled test until bug 586362 is fixed.
>          #self.assertFalse(builder.builderok)
> -        self.assertEqual(None, builder.currentjob)
> +        self.assertBuilderIsClean(builder)
>  
>      def testFailDispatchResult(self):
> -        """`FailDispatchResult` excludes the builder from pool.
> -
> -        It marks the build as failed (builderok=False) and clean any
> -        existing jobs.
> -        """
> +        # Test that `FailDispatchResult` calls assessFailureCounts().

This comment would be more helpful if it had:

  # `FailDispatchResult` calls `assessFailureCounts()` so that ...

I don't really know what the "so that" is.

>          builder, job_id = self._getBuilder('bob')
>  
>          # Setup a interaction to satisfy 'write_transaction' decorator.
>          login(ANONYMOUS)
>          slave = RecordingSlave(builder.name, builder.url, builder.vm_host)
>          result = FailDispatchResult(slave, 'does not work!')
> +        result.assessFailureCounts = FakeMethod()

Why are you stubbing it out here?

> +        self.assertEqual(0, result.assessFailureCounts.call_count)

This assertion seems a little redundant.


> === modified file 'lib/lp/soyuz/configure.zcml'
> --- lib/lp/soyuz/configure.zcml	2010-08-16 21:34:11 +0000
> +++ lib/lp/soyuz/configure.zcml	2010-08-18 17:01:23 +0000
> @@ -511,6 +511,15 @@
>              permission="launchpad.Edit"
>              set_attributes="log date_finished date_started builder
>                              status dependencies upload_log"/>
> +
> +        <!-- XXX bigjools 2010-07-27 bug=570939
> +             This should not be required once the old BuildFarmJob stuff is
> +             removed when the Translation Template jobs and Recipe jobs
> +             use the new infrastructure -->
> +        <require
> +            permission="launchpad.Admin"
> +            set_attributes="failure_count"/>
> +

Thanks for including the comment.  It would be good to say why it is required
now.

jml