Merge into db-devel : buildd-failure-counting : Code : Launchpad itself

Reviewer	Review Type	Date Requested	Status
Jonathan Lange (community)		2010-08-02	Approve on 2010-08-24
Stuart Bishop (community)	db	2010-08-02	Approve on 2010-08-03
Robert Collins	db	2010-08-02	Pending
Review via email: mp+31556@code.launchpad.net

Revision history for this message

Stuart Bishop (stub) wrote on 2010-08-03:

#

Fine. patch-2207-78-0.sql.

review: Approve (db)

Revision history for this message

Jonathan Lange (jml) wrote on 2010-08-19:

#

Download full text (11.0 KiB)

Hey Julian,

This looks like a big step toward build farm reliability -- thank you.

Some of my comments are asking for further explanation, a few make suggestions
for moving some code around, and the rest are the normal stylistic gotchas.

Out of curiosity, did you discuss this change in detail with anyone before
landing?

cheers,
jml

> === modified file 'lib/lp/buildmaster/configure.zcml'
> --- lib/lp/buildmaster/configure.zcml 2010-06-15 14:34:50 +0000
> +++ lib/lp/buildmaster/configure.zcml 2010-08-18 17:01:23 +0000
> @@ -50,6 +50,9 @@
> <require
> permission="launchpad.Edit"
> set_attributes="status"/>
> + <require
> + permission="launchpad.Admin"
> + set_attributes="failure_count"/>
> </class>
> <securedutility
> component="lp.buildmaster.model.buildfarmjob.BuildFarmJob"
>

I guess this means that whatever increments this runs with admin-level
privileges. Is that really such a good idea?

> === modified file 'lib/lp/buildmaster/manager.py'
> --- lib/lp/buildmaster/manager.py 2010-08-02 16:00:50 +0000
> +++ lib/lp/buildmaster/manager.py 2010-08-18 17:01:23 +0000
...
> @@ -157,6 +158,52 @@
> if job is not None:
> job.reset()
>
> + def _getBuilder(self):
> + # Helper to return the builder given the slave for this request.
> + # Avoiding circular imports.
> + from lp.buildmaster.interfaces.builder import IBuilderSet
> + return getUtility(IBuilderSet)[self.slave.name]
> +

I notice that there's very similar code below.

I'd suggest making a top-level function like this:

  def get_builder(slave_name):
      # Helper to return the builder given the slave for this request.
      # Avoiding circular imports.
      from lp.buildmaster.interfaces.builder import IBuilderSet
      return getUtility(IBuilderSet)[self.slave.name]

And using that here instead, as well as in SlaveScanner.

> + def assessFailureCounts(self, builder=None):
> + """View builder/job failure_count and work out which needs to die.
> +

There's spurious whitespace on this line.

> + :return: True if we disabled something, False if we did not.

This seems like an odd thing to return. It doesn't seem to be used in the
code. What's it for?

> + """
> + # Avoiding circular imports.

Not avoiding circular imports here any more :)

> + if builder is None:
> + builder = self._getBuilder()
> + build_job = builder.currentjob.specific_job.build
> +
> + if builder.failure_count == build_job.failure_count:
> + # This is either the first failure for this job on this
> + # builder, or by some chance the job was re-dispatched to
> + # the same builder. This make it impossible to determine

"makes"

> + # whether the job or the builder is at fault, so don't fail
> + # either. We reset the builder and job to try again.

It's unclear to me why the two counts being equal could imply that this is the
first failure for the job -- surely a count of 1 would signify that?

Perhaps the comment should say something like:
# If the failure count ...

Hey Julian,

This looks like a big step toward build farm reliability -- thank you.

Some of my comments are asking for further explanation, a few make suggestions
for moving some code around, and the rest are the normal stylistic gotchas.

Out of curiosity, did you discuss this change in detail with anyone before
landing?

cheers,
jml

> === modified file 'lib/lp/buildmaster/configure.zcml'
> --- lib/lp/buildmaster/configure.zcml	2010-06-15 14:34:50 +0000
> +++ lib/lp/buildmaster/configure.zcml	2010-08-18 17:01:23 +0000
> @@ -50,6 +50,9 @@
>          <require
>              permission="launchpad.Edit"
>              set_attributes="status"/>
> +        <require
> +            permission="launchpad.Admin"
> +            set_attributes="failure_count"/>
>      </class>
>      <securedutility
>          component="lp.buildmaster.model.buildfarmjob.BuildFarmJob"
>

I guess this means that whatever increments this runs with admin-level
privileges.  Is that really such a good idea?

> === modified file 'lib/lp/buildmaster/manager.py'
> --- lib/lp/buildmaster/manager.py	2010-08-02 16:00:50 +0000
> +++ lib/lp/buildmaster/manager.py	2010-08-18 17:01:23 +0000
...
> @@ -157,6 +158,52 @@
>          if job is not None:
>              job.reset()
>  
> +    def _getBuilder(self):
> +        # Helper to return the builder given the slave for this request.
> +        # Avoiding circular imports.
> +        from lp.buildmaster.interfaces.builder import IBuilderSet
> +        return getUtility(IBuilderSet)[self.slave.name]
> +

I notice that there's very similar code below.

I'd suggest making a top-level function like this:

def get_builder(slave_name):
      # Helper to return the builder given the slave for this request.
      # Avoiding circular imports.
      from lp.buildmaster.interfaces.builder import IBuilderSet
      return getUtility(IBuilderSet)[self.slave.name]

And using that here instead, as well as in SlaveScanner.

> +    def assessFailureCounts(self, builder=None):
> +        """View builder/job failure_count and work out which needs to die.
> +

There's spurious whitespace on this line.

> +        :return: True if we disabled something, False if we did not.

This seems like an odd thing to return. It doesn't seem to be used in the
code. What's it for?

> +        """
> +        # Avoiding circular imports.

Not avoiding circular imports here any more :)

> +        if builder is None:
> +            builder = self._getBuilder()
> +        build_job = builder.currentjob.specific_job.build
> +
> +        if builder.failure_count == build_job.failure_count:
> +            # This is either the first failure for this job on this
> +            # builder, or by some chance the job was re-dispatched to
> +            # the same builder.  This make it impossible to determine

"makes"

> +            # whether the job or the builder is at fault, so don't fail
> +            # either.  We reset the builder and job to try again.

It's unclear to me why the two counts being equal could imply that this is the
first failure for the job -- surely a count of 1 would signify that?

Perhaps the comment should say something like:
  # If the failure count for the builder is the same as the
  # failure count for the job being built, then we cannot
  # tell whether the job or the builder is at fault. The best
  # we can do is try them both again, and hope that the job
  # runs against a different builder.

...
> @@ -243,8 +282,21 @@
>              error = Failure()
>              self.logger.info("Scanning failed with: %s\n%s" %
>                  (error.getErrorMessage(), error.getTraceback()))
> -            # We should probably detect continuous failures here and mark
> -            # the builder down.
> +
> +            # Avoid circular import.
> +            from lp.buildmaster.interfaces.builder import IBuilderSet
> +            builder = getUtility(IBuilderSet)[self.builder_name]
> +

The get_builder helper would be useful here.

> +            # Decide if we need to terminate the job or fail the
> +            # builder.
> +            self._incrementFailureCounts(builder)
> +            self.logger.info(
> +                "builder failure count: %s, job failure count: %s" % ( 
> +                    builder.failure_count,
> +                    builder.currentjob.specific_job.build.failure_count))
> +            BaseDispatchResult(slave=None).assessFailureCounts(builder)

This line makes me wonder why assessFailureCounts is on BaseDispatchResult at
all.  Perhaps it should be a method on Builder that takes an optional 'info'
parameter?

The above code would then read:

builder.gotFailure()
  self.logger.info(...)
  builder.assessFailureCounts()

Also, perhaps there should be a current_build property on Builder.

@property
  def current_build(self):
      return self.currentjob.specific_job.build

Since the buildd-manager seems to do builder.currentjob.specific_job.build an
awful lot.

> @@ -440,6 +495,11 @@
>          self.logger.error('%s resume failure: %s' % (slave, error_text))
>          return self.reset_result(slave, error_text)
>  
> +    def _incrementFailureCounts(self, builder):
> +        # Avoid circular import.

I don't think you are avoiding a circular import here.

> +        builder.failure_count += 1
> +        builder.currentjob.specific_job.build.failure_count += 1
> +

Perhaps this should be a method on Builder. e.g.

class Builder:

# ...

def gotFailure(self):
         self.failure_count += 1
         self.currentjob.specific_job.build.failure_count += 1

Likewise, there could also be a gotSuccess() that resets to 0.

> @@ -447,6 +507,9 @@
>          `FailDispatchResult`, if it was a communication failure, simply
>          reset the slave by returning a `ResetDispatchResult`.
>          """
> +        from lp.buildmaster.interfaces.builder import IBuilderSet
> +        builder = getUtility(IBuilderSet)[slave.name]
> +

As mentioned above: builder = get_builder(slave.name)

> === modified file 'lib/lp/buildmaster/tests/test_builder.py'
> --- lib/lp/buildmaster/tests/test_builder.py	2010-08-12 17:33:08 +0000
> +++ lib/lp/buildmaster/tests/test_builder.py	2010-08-18 17:01:23 +0000
...
> @@ -32,7 +34,17 @@
>  class TestBuilder(TestCaseWithFactory):
>      """Basic unit tests for `Builder`."""
>  
> -    layer = LaunchpadZopelessLayer
> +    layer = DatabaseFunctionalLayer
> +
> +    def test_providesInterface(self):
> +        # Builder provides IBuilder
> +        builder = self.factory.makeBuilder()
> +        self.assertProvides(builder, IBuilder)
> +
> +    def test_default_values(self):
> +        builder = self.factory.makeBuilder()
> +        flush_database_updates()
> +        self.assertEqual(0, builder.failure_count)
>

Why is flush_database_updates() necessary? Could you please put the answer in
a comment.

> === modified file 'lib/lp/buildmaster/tests/test_manager.py'
> --- lib/lp/buildmaster/tests/test_manager.py	2010-08-06 10:48:49 +0000
> +++ lib/lp/buildmaster/tests/test_manager.py	2010-08-18 17:01:23 +0000
...
> @@ -360,9 +360,17 @@
>              self.assertFalse(result.processed)
>          return d.addCallback(check_result)
>  
> -    def testCheckDispatch(self):
> -        """`SlaveScanner.checkDispatch` is chained after dispatch requests.
> -
> +    def _setUpSlaveAndBuilder(self):
> +        # Helper function to set up a builder and its recording slave.
> +        slave = RecordingSlave('bob', 'http://foo.buildd:8221/', 'foo.host')
> +        bob_builder = getUtility(IBuilderSet)[slave.name]
> +        return slave, bob_builder
> +

There's already a slave called 'bob' in the sampledata? Could you please add a
constant to lp.testing.sampledata like:

BUILD_SLAVE = RecordingSlave('bob', 'http://foo.buildd:8221/', 'foo.host')

or

STANDARD_BUILD_SLAVE_NAME = 'bob'

or whatever is most appropriate.

Also, it might be a good idea to have this method guarantee that the failure
counts are zero.

> +    def test_scan_assesses_failure_exceptions(self):
> +        # If scan() fails with an exception, failure_counts should be
> +        # incremented and tested.
> +        def fake_scan():
> +            raise Exception("fake exception")
> +        manager = self._getManager()
> +        manager.scan = fake_scan

Consider calling the function "failing_scan" rather than "fake_scan".

> +        manager.scheduleNextScanCycle = FakeMethod()
> +        self.patch(BaseDispatchResult, 'assessFailureCounts', FakeMethod())

Why are you stubbing out assessFailureCounts here?

> +    def assertBuilderIsClean(self, builder):
> +        # Check that the builder is ready for a new build.
> +        self.assertTrue(builder.builderok)
> +        self.assertTrue(builder.failnotes is None)
> +        self.assertTrue(builder.currentjob is None)
>

Please use assertIs(None, builder.failnotes) instead of assertTrue. Same for
currentjob.  It gets you more helpful error messages when something breaks.

> @@ -809,32 +890,82 @@
>          result = ResetDispatchResult(slave)
>          result()
>  
> -        self.assertJobIsClean(job_id)
> +        buildqueue = getUtility(IBuildQueueSet).get(buildqueue_id)
> +        self.assertBuildqueueIsClean(buildqueue)
>  
>          # XXX Julian
>          # Disabled test until bug 586362 is fixed.
>          #self.assertFalse(builder.builderok)
> -        self.assertEqual(None, builder.currentjob)
> +        self.assertBuilderIsClean(builder)
>  
>      def testFailDispatchResult(self):
> -        """`FailDispatchResult` excludes the builder from pool.
> -
> -        It marks the build as failed (builderok=False) and clean any
> -        existing jobs.
> -        """
> +        # Test that `FailDispatchResult` calls assessFailureCounts().

This comment would be more helpful if it had:

# `FailDispatchResult` calls `assessFailureCounts()` so that ...

I don't really know what the "so that" is.

>          builder, job_id = self._getBuilder('bob')
>  
>          # Setup a interaction to satisfy 'write_transaction' decorator.
>          login(ANONYMOUS)
>          slave = RecordingSlave(builder.name, builder.url, builder.vm_host)
>          result = FailDispatchResult(slave, 'does not work!')
> +        result.assessFailureCounts = FakeMethod()

Why are you stubbing it out here?

> +        self.assertEqual(0, result.assessFailureCounts.call_count)

This assertion seems a little redundant.

> === modified file 'lib/lp/soyuz/configure.zcml'
> --- lib/lp/soyuz/configure.zcml	2010-08-16 21:34:11 +0000
> +++ lib/lp/soyuz/configure.zcml	2010-08-18 17:01:23 +0000
> @@ -511,6 +511,15 @@
>              permission="launchpad.Edit"
>              set_attributes="log date_finished date_started builder
>                              status dependencies upload_log"/>
> +
> +        
> +        <require
> +            permission="launchpad.Admin"
> +            set_attributes="failure_count"/>
> +

Thanks for including the comment.  It would be good to say why it is required
now.

jml

review: Needs Fixing

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2010-08-19:

#

Download full text (14.2 KiB)

On Thursday 19 August 2010 13:37:40 you wrote:
> Review: Needs Fixing
> Hey Julian,
>
> This looks like a big step toward build farm reliability -- thank you.
>
> Some of my comments are asking for further explanation, a few make
> suggestions for moving some code around, and the rest are the normal
> stylistic gotchas.
>
> Out of curiosity, did you discuss this change in detail with anyone before
> landing?

It's been so long since I started it I honestly can't remember. I have vague
recollections of mentioning failure counts to various people.

> > === modified file 'lib/lp/buildmaster/configure.zcml'
> > --- lib/lp/buildmaster/configure.zcml 2010-06-15 14:34:50 +0000
> > +++ lib/lp/buildmaster/configure.zcml 2010-08-18 17:01:23 +0000
> > @@ -50,6 +50,9 @@
> >
> > <require
> >
> > permission="launchpad.Edit"
> > set_attributes="status"/>
> >
> > + <require
> > + permission="launchpad.Admin"
> > + set_attributes="failure_count"/>
> >
> > </class>
> > <securedutility
> >
> > component="lp.buildmaster.model.buildfarmjob.BuildFarmJob"
>
> I guess this means that whatever increments this runs with admin-level
> privileges. Is that really such a good idea?

For now yes, because it's only ever changed in "zopeless" scripts. That is,
Zope requres this to work, but doesn't actually apply any level of security to
it. :/

I made it admin to discourage changing it in appservers. There are other
script-only-changing declarations like this, but zope.Public, with warnings in
the zcml about exposing it externally! I prefer the extra protection up
front...

> > === modified file 'lib/lp/buildmaster/manager.py'
> > --- lib/lp/buildmaster/manager.py 2010-08-02 16:00:50 +0000
> > +++ lib/lp/buildmaster/manager.py 2010-08-18 17:01:23 +0000
>
> ...
>
> > @@ -157,6 +158,52 @@
> >
> > if job is not None:
> > job.reset()
> >
> > + def _getBuilder(self):
> > + # Helper to return the builder given the slave for this request.
> > + # Avoiding circular imports.
> > + from lp.buildmaster.interfaces.builder import IBuilderSet
> > + return getUtility(IBuilderSet)[self.slave.name]
> > +
>
> I notice that there's very similar code below.
>
> I'd suggest making a top-level function like this:
>
> def get_builder(slave_name):
> # Helper to return the builder given the slave for this request.
> # Avoiding circular imports.
> from lp.buildmaster.interfaces.builder import IBuilderSet
> return getUtility(IBuilderSet)[self.slave.name]
>
> And using that here instead, as well as in SlaveScanner.

Meh, I meant to do that and forgot, thanks.

>
> > + def assessFailureCounts(self, builder=None):
> > + """View builder/job failure_count and work out which needs to
> > die. +
>
> There's spurious whitespace on this line.

Grar, my vim setup has gone weird on me, it always used to tell me about
trailing spaces.

>
> > + :return: True if we disabled something, False if we did not.
>
> This seems like an odd thing to return. It doesn't seem to be used in the
> code. W...

On Thursday 19 August 2010 13:37:40 you wrote:
> Review: Needs Fixing
> Hey Julian,
> 
> This looks like a big step toward build farm reliability -- thank you.
> 
> Some of my comments are asking for further explanation, a few make
> suggestions for moving some code around, and the rest are the normal
> stylistic gotchas.
> 
> Out of curiosity, did you discuss this change in detail with anyone before
> landing?

It's been so long since I started it I honestly can't remember.  I have vague 
recollections of mentioning failure counts to various people.

> > === modified file 'lib/lp/buildmaster/configure.zcml'
> > --- lib/lp/buildmaster/configure.zcml	2010-06-15 14:34:50 +0000
> > +++ lib/lp/buildmaster/configure.zcml	2010-08-18 17:01:23 +0000
> > @@ -50,6 +50,9 @@
> > 
> >          <require
> >          
> >              permission="launchpad.Edit"
> >              set_attributes="status"/>
> > 
> > +        <require
> > +            permission="launchpad.Admin"
> > +            set_attributes="failure_count"/>
> > 
> >      </class>
> >      <securedutility
> >      
> >          component="lp.buildmaster.model.buildfarmjob.BuildFarmJob"
> 
> I guess this means that whatever increments this runs with admin-level
> privileges.  Is that really such a good idea?

For now yes, because it's only ever changed in "zopeless" scripts.  That is, 
Zope requres this to work, but doesn't actually apply any level of security to 
it.  :/

I made it admin to discourage changing it in appservers.  There are other 
script-only-changing declarations like this, but zope.Public, with warnings in 
the zcml about exposing it externally!  I prefer the extra protection up 
front...

> > === modified file 'lib/lp/buildmaster/manager.py'
> > --- lib/lp/buildmaster/manager.py	2010-08-02 16:00:50 +0000
> > +++ lib/lp/buildmaster/manager.py	2010-08-18 17:01:23 +0000
> 
> ...
> 
> > @@ -157,6 +158,52 @@
> > 
> >          if job is not None:
> >              job.reset()
> > 
> > +    def _getBuilder(self):
> > +        # Helper to return the builder given the slave for this request.
> > +        # Avoiding circular imports.
> > +        from lp.buildmaster.interfaces.builder import IBuilderSet
> > +        return getUtility(IBuilderSet)[self.slave.name]
> > +
> 
> I notice that there's very similar code below.
> 
> I'd suggest making a top-level function like this:
> 
>   def get_builder(slave_name):
>       # Helper to return the builder given the slave for this request.
>       # Avoiding circular imports.
>       from lp.buildmaster.interfaces.builder import IBuilderSet
>       return getUtility(IBuilderSet)[self.slave.name]
> 
> And using that here instead, as well as in SlaveScanner.

Meh, I meant to do that and forgot, thanks.

> 
> > +    def assessFailureCounts(self, builder=None):
> > +        """View builder/job failure_count and work out which needs to
> > die. +
> 
> There's spurious whitespace on this line.

Grar, my vim setup has gone weird on me, it always used to tell me about 
trailing spaces.

> 
> > +        :return: True if we disabled something, False if we did not.
> 
> This seems like an odd thing to return. It doesn't seem to be used in the
> code. What's it for?

For the tests.

> 
> > +        """
> > +        # Avoiding circular imports.
> 
> Not avoiding circular imports here any more :)

I suck at cleaning up after myself :)

> 
> > +        if builder is None:
> > +            builder = self._getBuilder()
> > +        build_job = builder.currentjob.specific_job.build
> > +
> > +        if builder.failure_count == build_job.failure_count:
> > +            # This is either the first failure for this job on this
> > +            # builder, or by some chance the job was re-dispatched to
> > +            # the same builder.  This make it impossible to determine
> 
> "makes"
> 
> > +            # whether the job or the builder is at fault, so don't fail
> > +            # either.  We reset the builder and job to try again.
> 
> It's unclear to me why the two counts being equal could imply that this is
> the first failure for the job -- surely a count of 1 would signify that?

My intention was that it would be read as "1 and 1" but since you failed to 
understand that then I suck and I need to re-write it.

> 
> Perhaps the comment should say something like:
>   # If the failure count for the builder is the same as the
>   # failure count for the job being built, then we cannot
>   # tell whether the job or the builder is at fault. The best
>   # we can do is try them both again, and hope that the job
>   # runs against a different builder.
>

That'll do.

> ...
> 
> > @@ -243,8 +282,21 @@
> > 
> >              error = Failure()
> >              self.logger.info("Scanning failed with: %s\n%s" %
> >              
> >                  (error.getErrorMessage(), error.getTraceback()))
> > 
> > -            # We should probably detect continuous failures here and
> > mark -            # the builder down.
> > +
> > +            # Avoid circular import.
> > +            from lp.buildmaster.interfaces.builder import IBuilderSet
> > +            builder = getUtility(IBuilderSet)[self.builder_name]
> > +
> 
> The get_builder helper would be useful here.

Fixed.

> 
> > +            # Decide if we need to terminate the job or fail the
> > +            # builder.
> > +            self._incrementFailureCounts(builder)
> > +            self.logger.info(
> > +                "builder failure count: %s, job failure count: %s" % (
> > +                    builder.failure_count,
> > +                   
> > builder.currentjob.specific_job.build.failure_count)) +           
> > BaseDispatchResult(slave=None).assessFailureCounts(builder)
> 
> This line makes me wonder why assessFailureCounts is on BaseDispatchResult
> at all.  Perhaps it should be a method on Builder that takes an optional
> 'info' parameter?
> 
> The above code would then read:
> 
>   builder.gotFailure()
>   self.logger.info(...)
>   builder.assessFailureCounts()

I don't think we should continue to bloat model classes with code that is only 
used in one place.  This is a buildd-manager-specific function.

Anyway, bugger, I had intended to factor it out to a standalone function as I 
also realised that that line is ridiculous and again forgot :/

> 
> Also, perhaps there should be a current_build property on Builder.
> 
>   @property
>   def current_build(self):
>       return self.currentjob.specific_job.build

I don't really like that either I'm afraid!  It's masking three complicated 
queries here and dangerous to have as a property as it lulls you into a false 
sense of it really being a cheap operation.

I don't mind making it a regular method though, getCurrentBuildFarmJob()

> Since the buildd-manager seems to do builder.currentjob.specific_job.build
> an awful lot.

A few :)

> 
> > @@ -440,6 +495,11 @@
> > 
> >          self.logger.error('%s resume failure: %s' % (slave, error_text))
> >          return self.reset_result(slave, error_text)
> > 
> > +    def _incrementFailureCounts(self, builder):
> > +        # Avoid circular import.
> 
> I don't think you are avoiding a circular import here.

Jeez :/

> 
> > +        builder.failure_count += 1
> > +        builder.currentjob.specific_job.build.failure_count += 1
> > +
> 
> Perhaps this should be a method on Builder. e.g.
> 
>   class Builder:
> 
>      # ...
> 
>      def gotFailure(self):
>          self.failure_count += 1
>          self.currentjob.specific_job.build.failure_count += 1
> 
> Likewise, there could also be a gotSuccess() that resets to 0.

For the same reasons as before, I don't think this belongs on IBuilder.  Not 
to mention it's not just manipulating builder properties, it's changing the 
job's properties.

> 
> > @@ -447,6 +507,9 @@
> > 
> >          `FailDispatchResult`, if it was a communication failure, simply
> >          reset the slave by returning a `ResetDispatchResult`.
> >          """
> > 
> > +        from lp.buildmaster.interfaces.builder import IBuilderSet
> > +        builder = getUtility(IBuilderSet)[slave.name]
> > +
> 
> As mentioned above: builder = get_builder(slave.name)

Yup, fixed.

> 
> > === modified file 'lib/lp/buildmaster/tests/test_builder.py'
> > --- lib/lp/buildmaster/tests/test_builder.py	2010-08-12 17:33:08 +0000
> > +++ lib/lp/buildmaster/tests/test_builder.py	2010-08-18 17:01:23 +0000
> 
> ...
> 
> > @@ -32,7 +34,17 @@
> > 
> >  class TestBuilder(TestCaseWithFactory):
> >      """Basic unit tests for `Builder`."""
> > 
> > -    layer = LaunchpadZopelessLayer
> > +    layer = DatabaseFunctionalLayer
> > +
> > +    def test_providesInterface(self):
> > +        # Builder provides IBuilder
> > +        builder = self.factory.makeBuilder()
> > +        self.assertProvides(builder, IBuilder)
> > +
> > +    def test_default_values(self):
> > +        builder = self.factory.makeBuilder()
> > +        flush_database_updates()
> > +        self.assertEqual(0, builder.failure_count)
> 
> Why is flush_database_updates() necessary? Could you please put the answer
> in a comment.

It makes sure the Storm cache gets the values that the database initialises 
for new objects.

> > === modified file 'lib/lp/buildmaster/tests/test_manager.py'
> > --- lib/lp/buildmaster/tests/test_manager.py	2010-08-06 10:48:49 +0000
> > +++ lib/lp/buildmaster/tests/test_manager.py	2010-08-18 17:01:23 +0000
> 
> ...
> 
> > @@ -360,9 +360,17 @@
> > 
> >              self.assertFalse(result.processed)
> >          
> >          return d.addCallback(check_result)
> > 
> > -    def testCheckDispatch(self):
> > -        """`SlaveScanner.checkDispatch` is chained after dispatch
> > requests. -
> > +    def _setUpSlaveAndBuilder(self):
> > +        # Helper function to set up a builder and its recording slave.
> > +        slave = RecordingSlave('bob', 'http://foo.buildd:8221/',
> > 'foo.host') +        bob_builder = getUtility(IBuilderSet)[slave.name]
> > +        return slave, bob_builder
> > +
> 
> There's already a slave called 'bob' in the sampledata?

There's a builder called 'bob'.  There are no slaves.

There's also another called 'frog'.

> Could you please
> add a constant to lp.testing.sampledata like:
> 
>   BUILD_SLAVE = RecordingSlave('bob', 'http://foo.buildd:8221/',
> 'foo.host')
> 
> or
> 
>   STANDARD_BUILD_SLAVE_NAME = 'bob'
> 
> or whatever is most appropriate.
> 
> Also, it might be a good idea to> have this method guarantee that the
> failure counts are zero.

I considered that, but the methods that call this one set the counts to 
varying values.

> 
> > +    def test_scan_assesses_failure_exceptions(self):
> > +        # If scan() fails with an exception, failure_counts should be
> > +        # incremented and tested.
> > +        def fake_scan():
> > +            raise Exception("fake exception")
> > +        manager = self._getManager()
> > +        manager.scan = fake_scan
> 
> Consider calling the function "failing_scan" rather than "fake_scan".

Right, done.

> 
> > +        manager.scheduleNextScanCycle = FakeMethod()
> > +        self.patch(BaseDispatchResult, 'assessFailureCounts',
> > FakeMethod())
> 
> Why are you stubbing out assessFailureCounts here?

Because at the bottom of the method you'll see that it's checking the 
call_count.

> 
> > +    def assertBuilderIsClean(self, builder):
> > +        # Check that the builder is ready for a new build.
> > +        self.assertTrue(builder.builderok)
> > +        self.assertTrue(builder.failnotes is None)
> > +        self.assertTrue(builder.currentjob is None)
> 
> Please use assertIs(None, builder.failnotes) instead of assertTrue. Same
> for currentjob.  It gets you more helpful error messages when something
> breaks.

Agh, my bad.  I knew that.

> 
> > @@ -809,32 +890,82 @@
> > 
> >          result = ResetDispatchResult(slave)
> >          result()
> > 
> > -        self.assertJobIsClean(job_id)
> > +        buildqueue = getUtility(IBuildQueueSet).get(buildqueue_id)
> > +        self.assertBuildqueueIsClean(buildqueue)
> > 
> >          # XXX Julian
> >          # Disabled test until bug 586362 is fixed.
> >          #self.assertFalse(builder.builderok)
> > 
> > -        self.assertEqual(None, builder.currentjob)
> > +        self.assertBuilderIsClean(builder)
> > 
> >      def testFailDispatchResult(self):
> > -        """`FailDispatchResult` excludes the builder from pool.
> > -
> > -        It marks the build as failed (builderok=False) and clean any
> > -        existing jobs.
> > -        """
> > +        # Test that `FailDispatchResult` calls assessFailureCounts().
> 
> This comment would be more helpful if it had:
> 
>   # `FailDispatchResult` calls `assessFailureCounts()` so that ...
> 
> I don't really know what the "so that" is.

Done.

> 
> >          builder, job_id = self._getBuilder('bob')
> >          
> >          # Setup a interaction to satisfy 'write_transaction' decorator.
> >          login(ANONYMOUS)
> >          slave = RecordingSlave(builder.name, builder.url,
> >          builder.vm_host) result = FailDispatchResult(slave, 'does not
> >          work!')
> > 
> > +        result.assessFailureCounts = FakeMethod()
> 
> Why are you stubbing it out here?

So that I can see if it got called.

> 
> > +        self.assertEqual(0, result.assessFailureCounts.call_count)
> 
> This assertion seems a little redundant.

Well, it's fairly standard testing practice, no?  How else do I know that 
calling result() has changed the count?

> 
> > === modified file 'lib/lp/soyuz/configure.zcml'
> > --- lib/lp/soyuz/configure.zcml	2010-08-16 21:34:11 +0000
> > +++ lib/lp/soyuz/configure.zcml	2010-08-18 17:01:23 +0000
> > @@ -511,6 +511,15 @@
> > 
> >              permission="launchpad.Edit"
> >              set_attributes="log date_finished date_started builder
> >              
> >                              status dependencies upload_log"/>
> > 
> > +
> > +        
> > +        <require
> > +            permission="launchpad.Admin"
> > +            set_attributes="failure_count"/>
> > +
> 
> Thanks for including the comment.  It would be good to say why it is
> required now.

That's what the bug reference is for ;)  But I changed it anyway.

Phew, big diff attached.  Thanks!

partial.diff

Revision history for this message

Stuart Bishop (stub) wrote on 2010-08-20:

#

Allocated new DB patch number patch-2208-04-0.sql

Revision history for this message

Jonathan Lange (jml) wrote on 2010-08-20:

#

Download full text (10.0 KiB)

On Thu, Aug 19, 2010 at 5:13 PM, Julian Edwards
<email address hidden> wrote:
> On Thursday 19 August 2010 13:37:40 you wrote:
>> Review: Needs Fixing
>> Hey Julian,
>>
>> This looks like a big step toward build farm reliability -- thank you.
>>

Thanks for the changes. There are a few questions below, but I reckon
this'll be the last round.

>> Some of my comments are asking for further explanation, a few make
>> suggestions for moving some code around, and the rest are the normal
>> stylistic gotchas.
>>
>> Out of curiosity, did you discuss this change in detail with anyone before
>> landing?
>
> It's been so long since I started it I honestly can't remember. I have vague
> recollections of mentioning failure counts to various people.
>

No worries. I ask because we have failure counting mechanisms in other
places in Launchpad (e.g. the branch puller) and it might have been
interesting to share patterns, variable names and the like.

>> > === modified file 'lib/lp/buildmaster/configure.zcml'
>> > --- lib/lp/buildmaster/configure.zcml 2010-06-15 14:34:50 +0000
>> > +++ lib/lp/buildmaster/configure.zcml 2010-08-18 17:01:23 +0000
>> > @@ -50,6 +50,9 @@
>> >
>> > <require
>> >
>> > permission="launchpad.Edit"
>> > set_attributes="status"/>
>> >
>> > + <require
>> > + permission="launchpad.Admin"
>> > + set_attributes="failure_count"/>
>> >
>> > </class>
>> > <securedutility
>> >
>> > component="lp.buildmaster.model.buildfarmjob.BuildFarmJob"
>>
>> I guess this means that whatever increments this runs with admin-level
>> privileges. Is that really such a good idea?
>
> For now yes, because it's only ever changed in "zopeless" scripts. That is,
> Zope requres this to work, but doesn't actually apply any level of security to
> it. :/
>
> I made it admin to discourage changing it in appservers. There are other
> script-only-changing declarations like this, but zope.Public, with warnings in
> the zcml about exposing it externally! I prefer the extra protection up
> front...
>

Makes sense. I mean, a better permission would be some kind of
"Forbidden" permission, I guess.

>> > === modified file 'lib/lp/buildmaster/manager.py'
>> > --- lib/lp/buildmaster/manager.py 2010-08-02 16:00:50 +0000
>> > +++ lib/lp/buildmaster/manager.py 2010-08-18 17:01:23 +0000
...
>> > + :return: True if we disabled something, False if we did not.
>>
>> This seems like an odd thing to return. It doesn't seem to be used in the
>> code. What's it for?
>
> For the tests.
>

Yeah, but what about the tests? Or rather, why do the tests care about
something that the code doesn't?

>> > + # Decide if we need to terminate the job or fail the
>> > + # builder.
>> > + self._incrementFailureCounts(builder)
>> > + self.logger.info(
>> > + "builder failure count: %s, job failure count: %s" % (
>> > + builder.failure_count,
>> > +
>> > builder.currentjob.specific_job.build.failure_count)) +
>> > BaseDispatchResult(slave=None).assessFailureCounts(builder)
>>
>> This line makes me wonder why ass...

On Thu, Aug 19, 2010 at 5:13 PM, Julian Edwards
<julian.edwards@canonical.com> wrote:
> On Thursday 19 August 2010 13:37:40 you wrote:
>> Review: Needs Fixing
>> Hey Julian,
>>
>> This looks like a big step toward build farm reliability -- thank you.
>>

Thanks for the changes. There are a few questions below, but I reckon
this'll be the last round.

>> Some of my comments are asking for further explanation, a few make
>> suggestions for moving some code around, and the rest are the normal
>> stylistic gotchas.
>>
>> Out of curiosity, did you discuss this change in detail with anyone before
>> landing?
>
> It's been so long since I started it I honestly can't remember.  I have vague
> recollections of mentioning failure counts to various people.
>

No worries. I ask because we have failure counting mechanisms in other
places in Launchpad (e.g. the branch puller) and it might have been
interesting to share patterns, variable names and the like.

>> > === modified file 'lib/lp/buildmaster/configure.zcml'
>> > --- lib/lp/buildmaster/configure.zcml       2010-06-15 14:34:50 +0000
>> > +++ lib/lp/buildmaster/configure.zcml       2010-08-18 17:01:23 +0000
>> > @@ -50,6 +50,9 @@
>> >
>> >          <require
>> >
>> >              permission="launchpad.Edit"
>> >              set_attributes="status"/>
>> >
>> > +        <require
>> > +            permission="launchpad.Admin"
>> > +            set_attributes="failure_count"/>
>> >
>> >      </class>
>> >      <securedutility
>> >
>> >          component="lp.buildmaster.model.buildfarmjob.BuildFarmJob"
>>
>> I guess this means that whatever increments this runs with admin-level
>> privileges.  Is that really such a good idea?
>
> For now yes, because it's only ever changed in "zopeless" scripts.  That is,
> Zope requres this to work, but doesn't actually apply any level of security to
> it.  :/
>
> I made it admin to discourage changing it in appservers.  There are other
> script-only-changing declarations like this, but zope.Public, with warnings in
> the zcml about exposing it externally!  I prefer the extra protection up
> front...
>

Makes sense. I mean, a better permission would be some kind of
"Forbidden" permission, I guess.

>> > === modified file 'lib/lp/buildmaster/manager.py'
>> > --- lib/lp/buildmaster/manager.py   2010-08-02 16:00:50 +0000
>> > +++ lib/lp/buildmaster/manager.py   2010-08-18 17:01:23 +0000
...
>> > +        :return: True if we disabled something, False if we did not.
>>
>> This seems like an odd thing to return. It doesn't seem to be used in the
>> code. What's it for?
>
> For the tests.
>

Yeah, but what about the tests? Or rather, why do the tests care about
something that the code doesn't?

>> > +            # Decide if we need to terminate the job or fail the
>> > +            # builder.
>> > +            self._incrementFailureCounts(builder)
>> > +            self.logger.info(
>> > +                "builder failure count: %s, job failure count: %s" % (
>> > +                    builder.failure_count,
>> > +
>> > builder.currentjob.specific_job.build.failure_count)) +
>> > BaseDispatchResult(slave=None).assessFailureCounts(builder)
>>
>> This line makes me wonder why assessFailureCounts is on BaseDispatchResult
>> at all.  Perhaps it should be a method on Builder that takes an optional
>> 'info' parameter?
>>
>> The above code would then read:
>>
>>   builder.gotFailure()
>>   self.logger.info(...)
>>   builder.assessFailureCounts()
>
> I don't think we should continue to bloat model classes with code that is only
> used in one place.  This is a buildd-manager-specific function.
>
> Anyway, bugger, I had intended to factor it out to a standalone function as I
> also realised that that line is ridiculous and again forgot :/
>

A standalone function would be a definite improvement, and fine by me.
I personally don't think it's bloat to say "a builder knows how to
handle builder failures", but I'm not going to push it.

>>
>> Also, perhaps there should be a current_build property on Builder.
>>
>>   @property
>>   def current_build(self):
>>       return self.currentjob.specific_job.build
>
> I don't really like that either I'm afraid!  It's masking three complicated
> queries here and dangerous to have as a property as it lulls you into a false
> sense of it really being a cheap operation.
>

I didn't know they were queries!

> I don't mind making it a regular method though, getCurrentBuildFarmJob()
>

That would be great. Out of curiosity, why such a long name? Do
builders ever have jobs that aren't on build farms?

>> > +        builder.failure_count += 1
>> > +        builder.currentjob.specific_job.build.failure_count += 1
>> > +
>>
>> Perhaps this should be a method on Builder. e.g.
>>
>>   class Builder:
>>
>>      # ...
>>
>>      def gotFailure(self):
>>          self.failure_count += 1
>>          self.currentjob.specific_job.build.failure_count += 1
>>
>> Likewise, there could also be a gotSuccess() that resets to 0.
>
> For the same reasons as before, I don't think this belongs on IBuilder.  Not
> to mention it's not just manipulating builder properties, it's changing the
> job's properties.
>

In answer to the second point:
  def gotFailure(self):
    self.failure_count += 1
    self.getCurrentBuild().gotFailure()

And as for bloat, I don't see why adding extra columns is not bloat.

>> > === modified file 'lib/lp/buildmaster/tests/test_manager.py'
>> > --- lib/lp/buildmaster/tests/test_manager.py        2010-08-06 10:48:49 +0000
>> > +++ lib/lp/buildmaster/tests/test_manager.py        2010-08-18 17:01:23 +0000
>>
>> ...
>>
>> > @@ -360,9 +360,17 @@
>> >
>> >              self.assertFalse(result.processed)
>> >
>> >          return d.addCallback(check_result)
>> >
>> > -    def testCheckDispatch(self):
>> > -        """`SlaveScanner.checkDispatch` is chained after dispatch
>> > requests. -
>> > +    def _setUpSlaveAndBuilder(self):
>> > +        # Helper function to set up a builder and its recording slave.
>> > +        slave = RecordingSlave('bob', 'http://foo.buildd:8221/',
>> > 'foo.host') +        bob_builder = getUtility(IBuilderSet)[slave.name]
>> > +        return slave, bob_builder
>> > +
>>
>> There's already a slave called 'bob' in the sampledata?
>
> There's a builder called 'bob'.  There are no slaves.
>
> There's also another called 'frog'.
>

OK. As long as there are meaningfully named constants, I'm happy. When
reading the test in the first place, one can't tell whether that
hostname and port number is significant or not.

>> Also, it might be a good idea to have this method guarantee that the
>> failure counts are zero.
>
> I considered that, but the methods that call this one set the counts to
> varying values.
>

I see that, but I do think it would be clearer, and less surprising
for anyone who adds more tests in future.

>> > +        manager.scheduleNextScanCycle = FakeMethod()
>> > +        self.patch(BaseDispatchResult, 'assessFailureCounts',
>> > FakeMethod())
>>
>> Why are you stubbing out assessFailureCounts here?
>
> Because at the bottom of the method you'll see that it's checking the
> call_count.
>

Yeah, but why are you checking that it's called? Just as easy and more
robust to check that the state is what you expect.

>>
>> >          builder, job_id = self._getBuilder('bob')
>> >
>> >          # Setup a interaction to satisfy 'write_transaction' decorator.
>> >          login(ANONYMOUS)
>> >          slave = RecordingSlave(builder.name, builder.url,
>> >          builder.vm_host) result = FailDispatchResult(slave, 'does not
>> >          work!')
>> >
>> > +        result.assessFailureCounts = FakeMethod()
>>
>> Why are you stubbing it out here?
>
> So that I can see if it got called.
>

As above, why do you care?

>>
>> > +        self.assertEqual(0, result.assessFailureCounts.call_count)
>>
>> This assertion seems a little redundant.
>
> Well, it's fairly standard testing practice, no?  How else do I know that
> calling result() has changed the count?
>

By checking to see that the count is different.

> === modified file 'lib/lp/buildmaster/manager.py'
> --- lib/lp/buildmaster/manager.py	2010-08-18 16:58:18 +0000
> +++ lib/lp/buildmaster/manager.py	2010-08-19 15:51:41 +0000
> @@ -141,10 +141,55 @@
>          return d
>
>
> +def get_builder(name):
> +    """Helper to return the builder given the slave for this request."""
> +    # Avoiding circular imports.
> +    from lp.buildmaster.interfaces.builder import IBuilderSet
> +    return getUtility(IBuilderSet)[name]
> +
> +
> +def assessFailureCounts(builder, fail_notes):
> +    """View builder/job failure_count and work out which needs to die.
> +
> +    :return: True if we disabled something, False if we did not.
> +    """
> +    current_job = builder.currentjob
> +    build_job = current_job.specific_job.build
> +

Or indeed, build_job = builder.getCurrentBuildFarmJob()

> === modified file 'lib/lp/buildmaster/tests/test_manager.py'
> --- lib/lp/buildmaster/tests/test_manager.py	2010-08-18 16:58:18 +0000
> +++ lib/lp/buildmaster/tests/test_manager.py	2010-08-19 16:06:10 +0000
> @@ -7,7 +7,6 @@
>  import signal
>  import time
>  import transaction
> -import unittest
>
>  from twisted.internet import defer, reactor, task
>  from twisted.internet.error import ConnectionClosed
> @@ -38,6 +37,7 @@
>  from lp.soyuz.tests.test_publishing import SoyuzTestPublisher
>  from lp.testing.factory import LaunchpadObjectFactory
>  from lp.testing.fakemethod import FakeMethod
> +from lp.testing.sampledata import BOB_THE_BUILDER_NAME

There's got to be a better name. What are the interesting properties of the
builder that make it worth using?

> @@ -362,7 +363,8 @@
>
>      def _setUpSlaveAndBuilder(self):
>          # Helper function to set up a builder and its recording slave.
> -        slave = RecordingSlave('bob', 'http://foo.buildd:8221/', 'foo.host')
> +        slave = RecordingSlave(
> +            BOB_THE_BUILDER_NAME, 'http://foo.buildd:8221/', 'foo.host')
>          bob_builder = getUtility(IBuilderSet)[slave.name]
>          return slave, bob_builder
>

As mentioned in my reply, are these URLs or domain names special in any way?
If not, why the port number?

jml

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2010-08-20:

#

Download full text (11.6 KiB)

On Friday 20 August 2010 11:43:40 you wrote:
> On Thu, Aug 19, 2010 at 5:13 PM, Julian Edwards
>
> <email address hidden> wrote:
> > On Thursday 19 August 2010 13:37:40 you wrote:
> >> Review: Needs Fixing
> >> Hey Julian,
> >>
> >> This looks like a big step toward build farm reliability -- thank you.
>
> Thanks for the changes. There are a few questions below, but I reckon
> this'll be the last round.

I'm not so sure :)

>
> >> Some of my comments are asking for further explanation, a few make
> >> suggestions for moving some code around, and the rest are the normal
> >> stylistic gotchas.
> >>
> >> Out of curiosity, did you discuss this change in detail with anyone
> >> before landing?
> >
> > It's been so long since I started it I honestly can't remember. I have
> > vague recollections of mentioning failure counts to various people.
>
> No worries. I ask because we have failure counting mechanisms in other
> places in Launchpad (e.g. the branch puller) and it might have been
> interesting to share patterns, variable names and the like.

I had no idea about that, that's a shame :(

> >> > === modified file 'lib/lp/buildmaster/manager.py'
> >> > --- lib/lp/buildmaster/manager.py 2010-08-02 16:00:50 +0000
> >> > +++ lib/lp/buildmaster/manager.py 2010-08-18 17:01:23 +0000
>
> ...
>
> >> > + :return: True if we disabled something, False if we did not.
> >>
> >> This seems like an odd thing to return. It doesn't seem to be used in
> >> the code. What's it for?
> >
> > For the tests.
>
> Yeah, but what about the tests? Or rather, why do the tests care about
> something that the code doesn't?

Well lots of our tests return stuff that the code doesn't need, like Deferreds
for example. But fine, I've got rid of it.

> >> > + # Decide if we need to terminate the job or fail the
> >> > + # builder.
> >> > + self._incrementFailureCounts(builder)
> >> > + self.logger.info(
> >> > + "builder failure count: %s, job failure count: %s" %
> >> > ( + builder.failure_count,
> >> > +
> >> > builder.currentjob.specific_job.build.failure_count)) +
> >> > BaseDispatchResult(slave=None).assessFailureCounts(builder)
> >>
> >> This line makes me wonder why assessFailureCounts is on
> >> BaseDispatchResult at all. Perhaps it should be a method on Builder
> >> that takes an optional 'info' parameter?
> >>
> >> The above code would then read:
> >>
> >> builder.gotFailure()
> >> self.logger.info(...)
> >> builder.assessFailureCounts()
> >
> > I don't think we should continue to bloat model classes with code that is
> > only used in one place. This is a buildd-manager-specific function.
> >
> > Anyway, bugger, I had intended to factor it out to a standalone function
> > as I also realised that that line is ridiculous and again forgot :/
>
> A standalone function would be a definite improvement, and fine by me.
> I personally don't think it's bloat to say "a builder knows how to
> handle builder failures", but I'm not going to push it.

By the same argument, anything that deals with object X should be a method on
that object X.

The difference he...

On Friday 20 August 2010 11:43:40 you wrote:
> On Thu, Aug 19, 2010 at 5:13 PM, Julian Edwards
> 
> <julian.edwards@canonical.com> wrote:
> > On Thursday 19 August 2010 13:37:40 you wrote:
> >> Review: Needs Fixing
> >> Hey Julian,
> >> 
> >> This looks like a big step toward build farm reliability -- thank you.
> 
> Thanks for the changes. There are a few questions below, but I reckon
> this'll be the last round.

I'm not so sure :)

> 
> >> Some of my comments are asking for further explanation, a few make
> >> suggestions for moving some code around, and the rest are the normal
> >> stylistic gotchas.
> >> 
> >> Out of curiosity, did you discuss this change in detail with anyone
> >> before landing?
> > 
> > It's been so long since I started it I honestly can't remember.  I have
> > vague recollections of mentioning failure counts to various people.
> 
> No worries. I ask because we have failure counting mechanisms in other
> places in Launchpad (e.g. the branch puller) and it might have been
> interesting to share patterns, variable names and the like.

I had no idea about that, that's a shame :(

> >> > === modified file 'lib/lp/buildmaster/manager.py'
> >> > --- lib/lp/buildmaster/manager.py   2010-08-02 16:00:50 +0000
> >> > +++ lib/lp/buildmaster/manager.py   2010-08-18 17:01:23 +0000
> 
> ...
> 
> >> > +        :return: True if we disabled something, False if we did not.
> >> 
> >> This seems like an odd thing to return. It doesn't seem to be used in
> >> the code. What's it for?
> > 
> > For the tests.
> 
> Yeah, but what about the tests? Or rather, why do the tests care about
> something that the code doesn't?

Well lots of our tests return stuff that the code doesn't need, like Deferreds 
for example.  But fine, I've got rid of it.

> >> > +            # Decide if we need to terminate the job or fail the
> >> > +            # builder.
> >> > +            self._incrementFailureCounts(builder)
> >> > +            self.logger.info(
> >> > +                "builder failure count: %s, job failure count: %s" %
> >> > ( +                    builder.failure_count,
> >> > +
> >> > builder.currentjob.specific_job.build.failure_count)) +
> >> > BaseDispatchResult(slave=None).assessFailureCounts(builder)
> >> 
> >> This line makes me wonder why assessFailureCounts is on
> >> BaseDispatchResult at all.  Perhaps it should be a method on Builder
> >> that takes an optional 'info' parameter?
> >> 
> >> The above code would then read:
> >> 
> >>   builder.gotFailure()
> >>   self.logger.info(...)
> >>   builder.assessFailureCounts()
> > 
> > I don't think we should continue to bloat model classes with code that is
> > only used in one place.  This is a buildd-manager-specific function.
> > 
> > Anyway, bugger, I had intended to factor it out to a standalone function
> > as I also realised that that line is ridiculous and again forgot :/
> 
> A standalone function would be a definite improvement, and fine by me.
> I personally don't think it's bloat to say "a builder knows how to
> handle builder failures", but I'm not going to push it.

By the same argument, anything that deals with object X should be a method on 
that object X.

The difference here is that it's not dealing with builder failures 
exclusively, it's dealing with a combination of builder and job failures, and 
I would personally find it odd that a builder was trying to modify job 
properties.

Generally, I think the boundaries of responsibility in our code are poor and 
should be less like this.

Did I convince you yet?  Probably not :)

> >> Also, perhaps there should be a current_build property on Builder.
> >> 
> >>   @property
> >>   def current_build(self):
> >>       return self.currentjob.specific_job.build
> > 
> > I don't really like that either I'm afraid!  It's masking three
> > complicated queries here and dangerous to have as a property as it lulls
> > you into a false sense of it really being a cheap operation.
> 
> I didn't know they were queries!
> 
> > I don't mind making it a regular method though, getCurrentBuildFarmJob()
> 
> That would be great. Out of curiosity, why such a long name? Do
> builders ever have jobs that aren't on build farms?

BuildFarmJob is the model object name.  There are other *Job tables so this 
makes it specific which one you're getting.

> 
> >> > +        builder.failure_count += 1
> >> > +        builder.currentjob.specific_job.build.failure_count += 1
> >> > +
> >> 
> >> Perhaps this should be a method on Builder. e.g.
> >> 
> >>   class Builder:
> >> 
> >>      # ...
> >> 
> >>      def gotFailure(self):
> >>          self.failure_count += 1
> >>          self.currentjob.specific_job.build.failure_count += 1
> >> 
> >> Likewise, there could also be a gotSuccess() that resets to 0.
> > 
> > For the same reasons as before, I don't think this belongs on IBuilder.
> >  Not to mention it's not just manipulating builder properties, it's
> > changing the job's properties.
> 
> In answer to the second point:
>   def gotFailure(self):
>     self.failure_count += 1
>     self.getCurrentBuild().gotFailure()

I still don't like it :( It's mixing responsibilities with what needs to 
happen on failures, which is something that only the build manager should be 
thinking about.

I'd prefer to have a gotFailure() on each of Builder and BuildFarmJob and have 
the build manager call each in turn.

Let me know if you think that is acceptable and I'll do that change (it's not 
in the attached diff).

> 
> And as for bloat, I don't see why adding extra columns is not bloat.

Sorry, I don't understand what you mean by that.

> >> > === modified file 'lib/lp/buildmaster/tests/test_manager.py'
> >> > --- lib/lp/buildmaster/tests/test_manager.py        2010-08-06
> >> > 10:48:49 +0000 +++ lib/lp/buildmaster/tests/test_manager.py      
> >> >  2010-08-18 17:01:23 +0000
> >> 
> >> ...
> >> 
> >> > @@ -360,9 +360,17 @@
> >> > 
> >> >              self.assertFalse(result.processed)
> >> > 
> >> >          return d.addCallback(check_result)
> >> > 
> >> > -    def testCheckDispatch(self):
> >> > -        """`SlaveScanner.checkDispatch` is chained after dispatch
> >> > requests. -
> >> > +    def _setUpSlaveAndBuilder(self):
> >> > +        # Helper function to set up a builder and its recording
> >> > slave. +        slave = RecordingSlave('bob',
> >> > 'http://foo.buildd:8221/', 'foo.host') +        bob_builder =
> >> > getUtility(IBuilderSet)[slave.name] +        return slave,
> >> > bob_builder
> >> > +
> >> 
> >> There's already a slave called 'bob' in the sampledata?
> > 
> > There's a builder called 'bob'.  There are no slaves.
> > 
> > There's also another called 'frog'.
> 
> OK. As long as there are meaningfully named constants, I'm happy. When
> reading the test in the first place, one can't tell whether that
> hostname and port number is significant or not.

OK, I'll make them obvious via intent-revealing variable names.

> 
> >> Also, it might be a good idea to have this method guarantee that the
> >> failure counts are zero.
> > 
> > I considered that, but the methods that call this one set the counts to
> > varying values.
> 
> I see that, but I do think it would be clearer, and less surprising
> for anyone who adds more tests in future.

I've re-jigged it so that the counts are passed to the setup method.

> >> > +        manager.scheduleNextScanCycle = FakeMethod()
> >> > +        self.patch(BaseDispatchResult, 'assessFailureCounts',
> >> > FakeMethod())
> >> 
> >> Why are you stubbing out assessFailureCounts here?
> > 
> > Because at the bottom of the method you'll see that it's checking the
> > call_count.
> 
> Yeah, but why are you checking that it's called? Just as easy and more
> robust to check that the state is what you expect.

I think it's less robust like that.  The expected state after calling that 
function is checked elsewhere.  Now, these tests can just check that it's 
called at all and be confident that it will DTRT.  This means that if we tweak 
the method later, it will only require one test to be fixed instead of fixing 
all the tests that duplicate the same check.  (this is something that 
particularly pisses me off with Soyuz tests and I am slowly fixing them all)

If you think this is wrong, I am open as to why you think so.

> >> >          builder, job_id = self._getBuilder('bob')
> >> > 
> >> >          # Setup a interaction to satisfy 'write_transaction'
> >> > decorator. login(ANONYMOUS)
> >> >          slave = RecordingSlave(builder.name, builder.url,
> >> >          builder.vm_host) result = FailDispatchResult(slave, 'does not
> >> >          work!')
> >> > 
> >> > +        result.assessFailureCounts = FakeMethod()
> >> 
> >> Why are you stubbing it out here?
> > 
> > So that I can see if it got called.
> 
> As above, why do you care?

See above.

> 
> >> > +        self.assertEqual(0, result.assessFailureCounts.call_count)
> >> 
> >> This assertion seems a little redundant.
> > 
> > Well, it's fairly standard testing practice, no?  How else do I know that
> > calling result() has changed the count?
> 
> By checking to see that the count is different.

See above.

> 
> > === modified file 'lib/lp/buildmaster/manager.py'
> > --- lib/lp/buildmaster/manager.py	2010-08-18 16:58:18 +0000
> > +++ lib/lp/buildmaster/manager.py	2010-08-19 15:51:41 +0000
> > @@ -141,10 +141,55 @@
> > 
> >          return d
> > 
> > +def get_builder(name):
> > +    """Helper to return the builder given the slave for this request."""
> > +    # Avoiding circular imports.
> > +    from lp.buildmaster.interfaces.builder import IBuilderSet
> > +    return getUtility(IBuilderSet)[name]
> > +
> > +
> > +def assessFailureCounts(builder, fail_notes):
> > +    """View builder/job failure_count and work out which needs to die.
> > +
> > +    :return: True if we disabled something, False if we did not.
> > +    """
> > +    current_job = builder.currentjob
> > +    build_job = current_job.specific_job.build
> > +
> 
> Or indeed, build_job = builder.getCurrentBuildFarmJob()

That would incur duplicated queries! (via getting currentjob twice)  It's why 
I've split it up as I did.

I think performance should trump things here but perhaps you can think of a 
better way?

> > === modified file 'lib/lp/buildmaster/tests/test_manager.py'
> > --- lib/lp/buildmaster/tests/test_manager.py	2010-08-18 16:58:18 +0000
> > +++ lib/lp/buildmaster/tests/test_manager.py	2010-08-19 16:06:10 +0000
> > @@ -7,7 +7,6 @@
> > 
> >  import signal
> >  import time
> >  import transaction
> > 
> > -import unittest
> > 
> >  from twisted.internet import defer, reactor, task
> >  from twisted.internet.error import ConnectionClosed
> > 
> > @@ -38,6 +37,7 @@
> > 
> >  from lp.soyuz.tests.test_publishing import SoyuzTestPublisher
> >  from lp.testing.factory import LaunchpadObjectFactory
> >  from lp.testing.fakemethod import FakeMethod
> > 
> > +from lp.testing.sampledata import BOB_THE_BUILDER_NAME
> 
> There's got to be a better name. What are the interesting properties of the
> builder that make it worth using?

None that I know of, there's just two boring old builders in the sample data.

> 
> > @@ -362,7 +363,8 @@
> > 
> >      def _setUpSlaveAndBuilder(self):
> >          # Helper function to set up a builder and its recording slave.
> > 
> > -        slave = RecordingSlave('bob', 'http://foo.buildd:8221/',
> > 'foo.host') +        slave = RecordingSlave(
> > +            BOB_THE_BUILDER_NAME, 'http://foo.buildd:8221/', 'foo.host')
> > 
> >          bob_builder = getUtility(IBuilderSet)[slave.name]
> >          return slave, bob_builder
> 
> As mentioned in my reply, are these URLs or domain names special in any
> way? If not, why the port number?

I've given them intent-revealing names as I mentioned above.

So, partial diff attached again!  Let me know what you think about my 
questions in reply to your questions :)

Cheers
j

partial.diff

Revision history for this message

Jonathan Lange (jml) wrote on 2010-08-23:

#

Download full text (9.6 KiB)

On Fri, Aug 20, 2010 at 1:16 PM, Julian Edwards
<email address hidden> wrote:
> On Friday 20 August 2010 11:43:40 you wrote:
>> On Thu, Aug 19, 2010 at 5:13 PM, Julian Edwards
>> <email address hidden> wrote:
>> > On Thursday 19 August 2010 13:37:40 you wrote:

...
>> >> > === modified file 'lib/lp/buildmaster/manager.py'
>> >> > --- lib/lp/buildmaster/manager.py 2010-08-02 16:00:50 +0000
>> >> > +++ lib/lp/buildmaster/manager.py 2010-08-18 17:01:23 +0000
>>
>> ...
>>
>> >> > + :return: True if we disabled something, False if we did not.
>> >>
>> >> This seems like an odd thing to return. It doesn't seem to be used in
>> >> the code. What's it for?
>> >
>> > For the tests.
>>
>> Yeah, but what about the tests? Or rather, why do the tests care about
>> something that the code doesn't?
>
> Well lots of our tests return stuff that the code doesn't need, like Deferreds
> for example. But fine, I've got rid of it.
>

Thanks.

>> >> > + # Decide if we need to terminate the job or fail the
>> >> > + # builder.
>> >> > + self._incrementFailureCounts(builder)
>> >> > + self.logger.info(
>> >> > + "builder failure count: %s, job failure count: %s" %
>> >> > ( + builder.failure_count,
>> >> > +
>> >> > builder.currentjob.specific_job.build.failure_count)) +
>> >> > BaseDispatchResult(slave=None).assessFailureCounts(builder)
>> >>
>> >> This line makes me wonder why assessFailureCounts is on
>> >> BaseDispatchResult at all. Perhaps it should be a method on Builder
>> >> that takes an optional 'info' parameter?
>> >>
>> >> The above code would then read:
>> >>
>> >> builder.gotFailure()
>> >> self.logger.info(...)
>> >> builder.assessFailureCounts()
>> >
>> > I don't think we should continue to bloat model classes with code that is
>> > only used in one place. This is a buildd-manager-specific function.
>> >
>> > Anyway, bugger, I had intended to factor it out to a standalone function
>> > as I also realised that that line is ridiculous and again forgot :/
>>
>> A standalone function would be a definite improvement, and fine by me.
>> I personally don't think it's bloat to say "a builder knows how to
>> handle builder failures", but I'm not going to push it.
>
> By the same argument, anything that deals with object X should be a method on
> that object X.
>

I would argue that most things that manipulate state on object X
should be methods of object X. I would not say it's a universal rule.

> The difference here is that it's not dealing with builder failures
> exclusively, it's dealing with a combination of builder and job failures, and
> I would personally find it odd that a builder was trying to modify job
> properties.
>
> Generally, I think the boundaries of responsibility in our code are poor and
> should be less like this.
>
> Did I convince you yet? Probably not :)
>

I think you have, in this case, particularly given the concessions
below to have methods on Job (however it's spelled) and Builder that
do the failure incrementing. That seems like a clean division of
responsibility.

>>
>> >> > + builder.failure_count += 1
>> >> ...

On Fri, Aug 20, 2010 at 1:16 PM, Julian Edwards
<julian.edwards@canonical.com> wrote:
> On Friday 20 August 2010 11:43:40 you wrote:
>> On Thu, Aug 19, 2010 at 5:13 PM, Julian Edwards
>> <julian.edwards@canonical.com> wrote:
>> > On Thursday 19 August 2010 13:37:40 you wrote:

...
>> >> > === modified file 'lib/lp/buildmaster/manager.py'
>> >> > --- lib/lp/buildmaster/manager.py   2010-08-02 16:00:50 +0000
>> >> > +++ lib/lp/buildmaster/manager.py   2010-08-18 17:01:23 +0000
>>
>> ...
>>
>> >> > +        :return: True if we disabled something, False if we did not.
>> >>
>> >> This seems like an odd thing to return. It doesn't seem to be used in
>> >> the code. What's it for?
>> >
>> > For the tests.
>>
>> Yeah, but what about the tests? Or rather, why do the tests care about
>> something that the code doesn't?
>
> Well lots of our tests return stuff that the code doesn't need, like Deferreds
> for example.  But fine, I've got rid of it.
>

Thanks.

>> >> > +            # Decide if we need to terminate the job or fail the
>> >> > +            # builder.
>> >> > +            self._incrementFailureCounts(builder)
>> >> > +            self.logger.info(
>> >> > +                "builder failure count: %s, job failure count: %s" %
>> >> > ( +                    builder.failure_count,
>> >> > +
>> >> > builder.currentjob.specific_job.build.failure_count)) +
>> >> > BaseDispatchResult(slave=None).assessFailureCounts(builder)
>> >>
>> >> This line makes me wonder why assessFailureCounts is on
>> >> BaseDispatchResult at all.  Perhaps it should be a method on Builder
>> >> that takes an optional 'info' parameter?
>> >>
>> >> The above code would then read:
>> >>
>> >>   builder.gotFailure()
>> >>   self.logger.info(...)
>> >>   builder.assessFailureCounts()
>> >
>> > I don't think we should continue to bloat model classes with code that is
>> > only used in one place.  This is a buildd-manager-specific function.
>> >
>> > Anyway, bugger, I had intended to factor it out to a standalone function
>> > as I also realised that that line is ridiculous and again forgot :/
>>
>> A standalone function would be a definite improvement, and fine by me.
>> I personally don't think it's bloat to say "a builder knows how to
>> handle builder failures", but I'm not going to push it.
>
> By the same argument, anything that deals with object X should be a method on
> that object X.
>

I would argue that most things that manipulate state on object X
should be methods of object X. I would not say it's a universal rule.

> The difference here is that it's not dealing with builder failures
> exclusively, it's dealing with a combination of builder and job failures, and
> I would personally find it odd that a builder was trying to modify job
> properties.
>
> Generally, I think the boundaries of responsibility in our code are poor and
> should be less like this.
>
> Did I convince you yet?  Probably not :)
>

I think you have, in this case, particularly given the concessions
below to have methods on Job (however it's spelled) and Builder that
do the failure incrementing. That seems like a clean division of
responsibility.

>>
>> >> > +        builder.failure_count += 1
>> >> > +        builder.currentjob.specific_job.build.failure_count += 1
>> >> > +
>> >>
>> >> Perhaps this should be a method on Builder. e.g.
>> >>
>> >>   class Builder:
>> >>
>> >>      # ...
>> >>
>> >>      def gotFailure(self):
>> >>          self.failure_count += 1
>> >>          self.currentjob.specific_job.build.failure_count += 1
>> >>
>> >> Likewise, there could also be a gotSuccess() that resets to 0.
>> >
>> > For the same reasons as before, I don't think this belongs on IBuilder.
>> >  Not to mention it's not just manipulating builder properties, it's
>> > changing the job's properties.
>>
>> In answer to the second point:
>>   def gotFailure(self):
>>     self.failure_count += 1
>>     self.getCurrentBuild().gotFailure()
>
> I still don't like it :( It's mixing responsibilities with what needs to
> happen on failures, which is something that only the build manager should be
> thinking about.
>
> I'd prefer to have a gotFailure() on each of Builder and BuildFarmJob and have
> the build manager call each in turn.
>
> Let me know if you think that is acceptable and I'll do that change (it's not
> in the attached diff).
>

I think that's a great idea.

>>
>> And as for bloat, I don't see why adding extra columns is not bloat.
>
> Sorry, I don't understand what you mean by that.
>

Never mind. Some other time :)

>> >> > === modified file 'lib/lp/buildmaster/tests/test_manager.py'
>> >> > --- lib/lp/buildmaster/tests/test_manager.py        2010-08-06
>> >> > 10:48:49 +0000 +++ lib/lp/buildmaster/tests/test_manager.py
>> >> >  2010-08-18 17:01:23 +0000
>> >>
>> >> ...
>> >>
>> >> > @@ -360,9 +360,17 @@
>> >> >
>> >> >              self.assertFalse(result.processed)
>> >> >
>> >> >          return d.addCallback(check_result)
>> >> >
>> >> > -    def testCheckDispatch(self):
>> >> > -        """`SlaveScanner.checkDispatch` is chained after dispatch
>> >> > requests. -
>> >> > +    def _setUpSlaveAndBuilder(self):
>> >> > +        # Helper function to set up a builder and its recording
>> >> > slave. +        slave = RecordingSlave('bob',
>> >> > 'http://foo.buildd:8221/', 'foo.host') +        bob_builder =
>> >> > getUtility(IBuilderSet)[slave.name] +        return slave,
>> >> > bob_builder
>> >> > +
>> >>
>> >> There's already a slave called 'bob' in the sampledata?
>> >
>> > There's a builder called 'bob'.  There are no slaves.
>> >
>> > There's also another called 'frog'.
>>
>> OK. As long as there are meaningfully named constants, I'm happy. When
>> reading the test in the first place, one can't tell whether that
>> hostname and port number is significant or not.
>
> OK, I'll make them obvious via intent-revealing variable names.
>

Sweet, thanks.

>>
>> >> Also, it might be a good idea to have this method guarantee that the
>> >> failure counts are zero.
>> >
>> > I considered that, but the methods that call this one set the counts to
>> > varying values.
>>
>> I see that, but I do think it would be clearer, and less surprising
>> for anyone who adds more tests in future.
>
> I've re-jigged it so that the counts are passed to the setup method.
>

Ooh. Nice.

>> >> > +        manager.scheduleNextScanCycle = FakeMethod()
>> >> > +        self.patch(BaseDispatchResult, 'assessFailureCounts',
>> >> > FakeMethod())
>> >>
>> >> Why are you stubbing out assessFailureCounts here?
>> >
>> > Because at the bottom of the method you'll see that it's checking the
>> > call_count.
>>
>> Yeah, but why are you checking that it's called? Just as easy and more
>> robust to check that the state is what you expect.
>
> I think it's less robust like that.  The expected state after calling that
> function is checked elsewhere.  Now, these tests can just check that it's
> called at all and be confident that it will DTRT.  This means that if we tweak
> the method later, it will only require one test to be fixed instead of fixing
> all the tests that duplicate the same check.  (this is something that
> particularly pisses me off with Soyuz tests and I am slowly fixing them all)
>
> If you think this is wrong, I am open as to why you think so.
>

Re robustness, I think you lose one kind of robustness and gain another.

"Was this method called?" tests are robust against behaviour change.
If you change what assessFailureCounts is supposed to do, then you
won't have to change these tests, only the assessFailureCounts ones.
(Some people call this "Mockist")

Tests that check state are robust against implementation change. If
you change the details of how you've implemented the code (as you've
done during this review process), you won't have to change any tests.
(Some people call this "Classic")

As a rule, I lean heavily toward the latter, since I change
implementation details much more frequently than behaviour. I use
mocks in situations where there's a very clearly defined, fairly
stable interface and when using some other kind of double would be
prohibitively difficult.

See http://martinfowler.com/articles/mocksArentStubs.html for a good
discussion on the whole topic.

As far as this diff goes, I think using a mock is less preferable but
still sound, and I'm happy to see it merged.

>> > === modified file 'lib/lp/buildmaster/manager.py'
>> > --- lib/lp/buildmaster/manager.py   2010-08-18 16:58:18 +0000
>> > +++ lib/lp/buildmaster/manager.py   2010-08-19 15:51:41 +0000
>> > @@ -141,10 +141,55 @@
>> >
>> >          return d
>> >
>> > +def get_builder(name):
>> > +    """Helper to return the builder given the slave for this request."""
>> > +    # Avoiding circular imports.
>> > +    from lp.buildmaster.interfaces.builder import IBuilderSet
>> > +    return getUtility(IBuilderSet)[name]
>> > +
>> > +
>> > +def assessFailureCounts(builder, fail_notes):
>> > +    """View builder/job failure_count and work out which needs to die.
>> > +
>> > +    :return: True if we disabled something, False if we did not.
>> > +    """
>> > +    current_job = builder.currentjob
>> > +    build_job = current_job.specific_job.build
>> > +
>>
>> Or indeed, build_job = builder.getCurrentBuildFarmJob()
>
> That would incur duplicated queries! (via getting currentjob twice)  It's why
> I've split it up as I did.
>
> I think performance should trump things here but perhaps you can think of a
> better way?
>

It seems to me that Builder.currentjob ought to be replaced with a
method asap, since it's doing a query, and naive readers might well
miss that fact.

If it's doing dupe queries and you aren't going to change currentjob
to a method in this branch, then please add a comment saying something
like:

# builder.currentjob runs a query. Don't run it twice.

cheers,
jml

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2010-08-24:

#

On Monday 23 August 2010 18:34:43 Jonathan Lange wrote:
> I would argue that most things that manipulate state on object X
> should be methods of object X. I would not say it's a universal rule.

If we boil it down to the basic manipulation required, I completely agree.

> I think you have, in this case, particularly given the concessions
> below to have methods on Job (however it's spelled) and Builder that
> do the failure incrementing. That seems like a clean division of
> responsibility.

Woot! :)

> > I'd prefer to have a gotFailure() on each of Builder and BuildFarmJob and
> > have the build manager call each in turn.
> >
> > Let me know if you think that is acceptable and I'll do that change (it's
> > not in the attached diff).
>
> I think that's a great idea.

Ok it's done.

> Re robustness, I think you lose one kind of robustness and gain another.

I guess there are different ways of looking at it...

> "Was this method called?" tests are robust against behaviour change.
> If you change what assessFailureCounts is supposed to do, then you
> won't have to change these tests, only the assessFailureCounts ones.
> (Some people call this "Mockist")
>
> Tests that check state are robust against implementation change. If
> you change the details of how you've implemented the code (as you've
> done during this review process), you won't have to change any tests.
> (Some people call this "Classic")
>
> As a rule, I lean heavily toward the latter, since I change
> implementation details much more frequently than behaviour. I use
> mocks in situations where there's a very clearly defined, fairly
> stable interface and when using some other kind of double would be
> prohibitively difficult.
>
> See http://martinfowler.com/articles/mocksArentStubs.html for a good
> discussion on the whole topic.
>
> As far as this diff goes, I think using a mock is less preferable but
> still sound, and I'm happy to see it merged.

I think I'm in total agreement with the sentiment above. In my case here I am
trying to abstract away from the "how failures are counted" as much as
possible since I forsee some changes in how we make it work. I suspect that
this branch will not be the final word on the matter :)

> It seems to me that Builder.currentjob ought to be replaced with a
> method asap, since it's doing a query, and naive readers might well
> miss that fact.

Agreed, I've filed a bug though as after grepping around, it will be a big
diff and I don't want to pollute this branch.

> If it's doing dupe queries and you aren't going to change currentjob
> to a method in this branch, then please add a comment saying something
> like:
>
> # builder.currentjob runs a query. Don't run it twice.

Done!

See partial diff.

Thanks for sticking with me :)

On Monday 23 August 2010 18:34:43 Jonathan Lange wrote:
> I would argue that most things that manipulate state on object X
> should be methods of object X. I would not say it's a universal rule.

If we boil it down to the basic manipulation required, I completely agree.

> I think you have, in this case, particularly given the concessions
> below to have methods on Job (however it's spelled) and Builder that
> do the failure incrementing. That seems like a clean division of
> responsibility.

Woot! :)

> > I'd prefer to have a gotFailure() on each of Builder and BuildFarmJob and
> > have the build manager call each in turn.
> > 
> > Let me know if you think that is acceptable and I'll do that change (it's
> > not in the attached diff).
> 
> I think that's a great idea.

Ok it's done.

> Re robustness, I think you lose one kind of robustness and gain another.

I guess there are different ways of looking at it...

> "Was this method called?" tests are robust against behaviour change.
> If you change what assessFailureCounts is supposed to do, then you
> won't have to change these tests, only the assessFailureCounts ones.
> (Some people call this "Mockist")
> 
> Tests that check state are robust against implementation change. If
> you change the details of how you've implemented the code (as you've
> done during this review process), you won't have to change any tests.
> (Some people call this "Classic")
> 
> As a rule, I lean heavily toward the latter, since I change
> implementation details much more frequently than behaviour. I use
> mocks in situations where there's a very clearly defined, fairly
> stable interface and when using some other kind of double would be
> prohibitively difficult.
> 
> See http://martinfowler.com/articles/mocksArentStubs.html for a good
> discussion on the whole topic.
> 
> As far as this diff goes, I think using a mock is less preferable but
> still sound, and I'm happy to see it merged.

I think I'm in total agreement with the sentiment above.  In my case here I am 
trying to abstract away from the "how failures are counted" as much as 
possible since I forsee some changes in how we make it work.  I suspect that 
this branch will not be the final word on the matter :)

> It seems to me that Builder.currentjob ought to be replaced with a
> method asap, since it's doing a query, and naive readers might well
> miss that fact.

Agreed, I've filed a bug though as after grepping around, it will be a big 
diff and I don't want to pollute this branch.

> If it's doing dupe queries and you aren't going to change currentjob
> to a method in this branch, then please add a comment saying something
> like:
> 
>   # builder.currentjob runs a query. Don't run it twice.

Done!

See partial diff.

Thanks for sticking with me :)

partial.diff

Revision history for this message

Jonathan Lange (jml) wrote on 2010-08-24:

#

On Tue, Aug 24, 2010 at 11:34 AM, Julian Edwards
<email address hidden> wrote:
> On Monday 23 August 2010 18:34:43 Jonathan Lange wrote:
>> I would argue that most things that manipulate state on object X
>> should be methods of object X. I would not say it's a universal rule.
>
> If we boil it down to the basic manipulation required, I completely agree.
>
>> I think you have, in this case, particularly given the concessions
>> below to have methods on Job (however it's spelled) and Builder that
>> do the failure incrementing. That seems like a clean division of
>> responsibility.
>
> Woot! :)
...
> See partial diff.
>
>

Rockin. I'd spell resetFailureCount() as gotSuccessfulDispatch(), but
ok to land as is.

jml

Revision history for this message

Jonathan Lange (jml) on 2010-08-24:

#

review: Approve

1	=== modified file 'lib/lp/buildmaster/interfaces/builder.py'
2	--- lib/lp/buildmaster/interfaces/builder.py 2010-08-20 13:47:47 +0000
3	+++ lib/lp/buildmaster/interfaces/builder.py 2010-08-24 09:55:13 +0000
4	@@ -154,6 +154,12 @@
5	title=u"The current behavior of the builder for the current job.",
6	required=False)
7
8	+ def gotFailure():
9	+ """Increment failure_count on the builder."""
10	+
11	+ def resetFailureCount():
12	+ """Set the failure_count back to zero."""
13	+
14	def checkSlaveAlive():
15	"""Check that the buildd slave is alive.
16
17
18	=== modified file 'lib/lp/buildmaster/interfaces/buildfarmjob.py'
19	--- lib/lp/buildmaster/interfaces/buildfarmjob.py 2010-07-23 20:27:27 +0000
20	+++ lib/lp/buildmaster/interfaces/buildfarmjob.py 2010-08-24 10:01:21 +0000
21	@@ -270,6 +270,9 @@
22	returned.
23	"""
24
25	+ def gotFailure():
26	+ """Increment the failure_count for this job."""
27	+
28	title = exported(TextLine(title=_("Title"), required=False))
29
30	was_built = Attribute("Whether or not modified by the builddfarm.")
31
32	=== modified file 'lib/lp/buildmaster/manager.py'
33	--- lib/lp/buildmaster/manager.py 2010-08-20 13:42:19 +0000
34	+++ lib/lp/buildmaster/manager.py 2010-08-24 10:27:37 +0000
35	@@ -150,6 +150,8 @@
36
37	def assessFailureCounts(builder, fail_notes):
38	"""View builder/job failure_count and work out which needs to die. """
39	+ # builder.currentjob hides a complicated query, don't run it twice.
40	+ # See bug 623281.
41	current_job = builder.currentjob
42	build_job = current_job.specific_job.build
43
44	@@ -361,7 +363,7 @@
45	if self.builder.currentjob is not None:
46	# After a successful dispatch we can reset the
47	# failure_count.
48	- self.builder.failure_count = 0
49	+ self.builder.resetFailureCount()
50	transaction.commit()
51	return slave
52
53	@@ -493,8 +495,8 @@
54	return self.reset_result(slave, error_text)
55
56	def _incrementFailureCounts(self, builder):
57	- builder.failure_count += 1
58	- builder.getCurrentBuildFarmJob().failure_count += 1
59	+ builder.gotFailure()
60	+ builder.getCurrentBuildFarmJob().gotFailure()
61
62	def checkDispatch(self, response, method, slave):
63	"""Verify the results of a slave xmlrpc call.
64
65	=== modified file 'lib/lp/buildmaster/model/builder.py'
66	--- lib/lp/buildmaster/model/builder.py 2010-08-20 13:47:47 +0000
67	+++ lib/lp/buildmaster/model/builder.py 2010-08-24 10:15:31 +0000
68	@@ -294,6 +294,14 @@
69	current_build_behavior = property(
70	_getCurrentBuildBehavior, _setCurrentBuildBehavior)
71
72	+ def gotFailure(self):
73	+ """See `IBuilder`."""
74	+ self.failure_count += 1
75	+
76	+ def resetFailureCount(self):
77	+ """See `IBuilder`."""
78	+ self.failure_count = 0
79	+
80	def checkSlaveAlive(self):
81	"""See IBuilder."""
82	if self.slave.echo("Test")[0] != "Test":
83	@@ -311,6 +319,8 @@
84	"""See IBuilder."""
85	return self.slave.clean()
86
87	+ # XXX 2010-08-24 Julian bug=623281
88	+ # This should not be a property! It's masking a complicated query.
89	@property
90	def currentjob(self):
91	"""See IBuilder"""
92
93	=== modified file 'lib/lp/buildmaster/model/buildfarmjob.py'
94	--- lib/lp/buildmaster/model/buildfarmjob.py 2010-08-17 15:04:47 +0000
95	+++ lib/lp/buildmaster/model/buildfarmjob.py 2010-08-24 10:00:34 +0000
96	@@ -344,6 +344,10 @@
97
98	return build_without_outer_proxy
99
100	+ def gotFailure(self):
101	+ """See `IBuildFarmJob`."""
102	+ self.failure_count += 1
103	+
104
105	class BuildFarmJobDerived:
106	implements(IBuildFarmJob)

Launchpad itself

Merge lp:~julian-edwards/launchpad/buildd-failure-counting into lp:launchpad/db-devel

Commit message

Description of the change

Preview Diff

Subscribers

 === modified file 'lib/lp/buildmaster/interfaces/builder.py'
 --- lib/lp/buildmaster/interfaces/builder.py	2010-08-17 15:04:47 +0000
 +++ lib/lp/buildmaster/interfaces/builder.py	2010-08-19 14:43:04 +0000
@@ -281,6 +281,9 @@
          :return: A BuildQueue, or None.
          """
++    def getCurrentBuildFarmJob():
++        """Return a `BuildFarmJob` for this builder."""
++
  class IBuilderSet(Interface):
      """Collections of builders.
 === modified file 'lib/lp/buildmaster/manager.py'
 --- lib/lp/buildmaster/manager.py	2010-08-18 16:58:18 +0000
 +++ lib/lp/buildmaster/manager.py	2010-08-19 15:51:41 +0000
@@ -141,10 +141,55 @@
          return d
++def get_builder(name):
++    """Helper to return the builder given the slave for this request."""
++    # Avoiding circular imports.
++    from lp.buildmaster.interfaces.builder import IBuilderSet
++    return getUtility(IBuilderSet)[name]
++
++
++def assessFailureCounts(builder, fail_notes):
++    """View builder/job failure_count and work out which needs to die.
++
++    :return: True if we disabled something, False if we did not.
++    """
++    current_job = builder.currentjob
++    build_job = current_job.specific_job.build
++
++    if builder.failure_count == build_job.failure_count:
++        # If the failure count for the builder is the same as the
++        # failure count for the job being built, then we cannot
++        # tell whether the job or the builder is at fault. The  best
++        # we can do is try them both again, and hope that the job
++        # runs against a different builder.
++        current_job.reset()
++        return False
++
++    if builder.failure_count > build_job.failure_count:
++        # The builder has failed more than the jobs it's been
++        # running, so let's disable it and re-schedule the build.
++        builder.failBuilder(fail_notes)
++        current_job.reset()
++        return True
++    else:
++        # The job is the culprit!  Override its status to 'failed'
++        # to make sure it won't get automatically dispatched again,
++        # and remove the buildqueue request.  The failure should
++        # have already caused any relevant slave data to be stored
++        # on the build record so don't worry about that here.
++        build_job.status = BuildStatus.FAILEDTOBUILD
++        builder.currentjob.destroySelf()
++
++        # N.B. We could try and call _handleStatus_PACKAGEFAIL here
++        # but that would cause us to query the slave for its status
++        # again, and if the slave is non-responsive it holds up the
++        # next buildd scan.
++        return True
++
++
  class BaseDispatchResult:
      """Base class for *DispatchResult variations.
--
      It will be extended to represent dispatching results and allow
      homogeneous processing.
      """
@@ -158,51 +203,13 @@
          if job is not None:
              job.reset()
--    def _getBuilder(self):
--        # Helper to return the builder given the slave for this request.
--        # Avoiding circular imports.
--        from lp.buildmaster.interfaces.builder import IBuilderSet
--        return getUtility(IBuilderSet)[self.slave.name]
--
--    def assessFailureCounts(self, builder=None):
++    def assessFailureCounts(self):
          """View builder/job failure_count and work out which needs to die.
--
++
          :return: True if we disabled something, False if we did not.
          """
--        # Avoiding circular imports.
--        if builder is None:
--            builder = self._getBuilder()
--        build_job = builder.currentjob.specific_job.build
--
--        if builder.failure_count == build_job.failure_count:
--            # This is either the first failure for this job on this
--            # builder, or by some chance the job was re-dispatched to
--            # the same builder.  This make it impossible to determine
--            # whether the job or the builder is at fault, so don't fail
--            # either.  We reset the builder and job to try again.
--            self._cleanJob(builder.currentjob)
--            return False
--
--        if builder.failure_count > build_job.failure_count:
--            # The builder has failed more than the jobs it's been
--            # running, so let's disable it and re-schedule the build.
--            builder.failBuilder(self.info)
--            self._cleanJob(builder.currentjob)
--            return True
--        else:
--            # The job is the culprit!  Override its status to 'failed'
--            # to make sure it won't get automatically dispatched again,
--            # and remove the buildqueue request.  The failure should
--            # have already caused any relevant slave data to be stored
--            # on the build record so don't worry about that here.
--            build_job.status = BuildStatus.FAILEDTOBUILD
--            builder.currentjob.destroySelf()
--
--            # N.B. We could try and call _handleStatus_PACKAGEFAIL here
--            # but that would cause us to query the slave for its status
--            # again, and if the slave is non-responsive it holds up the
--            # next buildd scan.
--            return True
++        builder = get_builder(self.slave.name)
++        return assessFailureCounts(builder, self.info)
      def ___call__(self):
          raise NotImplementedError(
@@ -237,7 +244,7 @@
      @write_transaction
      def __call__(self):
--        builder = self._getBuilder()
++        builder = get_builder(self.slave.name)
          # Builders that fail to reset should be disabled as per bug
          # 563353.
          # XXX Julian bug=586362
@@ -283,18 +290,16 @@
              self.logger.info("Scanning failed with: %s\n%s" %
                  (error.getErrorMessage(), error.getTraceback()))
--            # Avoid circular import.
--            from lp.buildmaster.interfaces.builder import IBuilderSet
--            builder = getUtility(IBuilderSet)[self.builder_name]
++            builder = get_builder(self.builder_name)
              # Decide if we need to terminate the job or fail the
              # builder.
              self._incrementFailureCounts(builder)
              self.logger.info(
--                "builder failure count: %s, job failure count: %s" % (
++                "builder failure count: %s, job failure count: %s" % (
                      builder.failure_count,
--                    builder.currentjob.specific_job.build.failure_count))
--            BaseDispatchResult(slave=None).assessFailureCounts(builder)
++                    builder.getCurrentBuildFarmJob().failure_count))
++            assessFailureCounts(builder, error.getErrorMessage())
              transaction.commit()
              self.scheduleNextScanCycle()
@@ -311,10 +316,7 @@
          # We need to re-fetch the builder object on each cycle as the
          # Storm store is invalidated over transaction boundaries.
--        # Avoid circular import.
--        from lp.buildmaster.interfaces.builder import IBuilderSet
--        builder_set = getUtility(IBuilderSet)
--        self.builder = builder_set[self.builder_name]
++        self.builder = get_builder(self.builder_name)
          if self.builder.builderok:
              self.builder.updateStatus(self.logger)
@@ -496,9 +498,8 @@
          return self.reset_result(slave, error_text)
      def _incrementFailureCounts(self, builder):
--        # Avoid circular import.
          builder.failure_count += 1
--        builder.currentjob.specific_job.build.failure_count += 1
++        builder.getCurrentBuildFarmJob().failure_count += 1
      def checkDispatch(self, response, method, slave):
          """Verify the results of a slave xmlrpc call.
 === modified file 'lib/lp/buildmaster/model/builder.py'
 --- lib/lp/buildmaster/model/builder.py	2010-08-17 15:04:47 +0000
 +++ lib/lp/buildmaster/model/builder.py	2010-08-19 14:42:10 +0000
@@ -645,6 +645,11 @@
              Job._status == JobStatus.RUNNING,
              Job.date_started != None).one()
++    def getCurrentBuildFarmJob(self):
++        """See `IBuilder`."""
++        # Don't make this a property, it's masking a few queries.
++        return self.currentjob.specific_job.build
++
  class BuilderSet(object):
      """See IBuilderSet"""
 === modified file 'lib/lp/buildmaster/tests/test_builder.py'
 --- lib/lp/buildmaster/tests/test_builder.py	2010-08-17 15:04:47 +0000
 +++ lib/lp/buildmaster/tests/test_builder.py	2010-08-19 15:07:15 +0000
@@ -43,9 +43,18 @@
      def test_default_values(self):
          builder = self.factory.makeBuilder()
++        # Make sure the Storm cache gets the values that the database
++        # initialises.
          flush_database_updates()
          self.assertEqual(0, builder.failure_count)
++    def test_getCurrentBuildFarmJob(self):
++        bq = self.factory.makeSourcePackageRecipeBuildJob(3333)
++        builder = self.factory.makeBuilder()
++        bq.markAsBuilding(builder)
++        self.assertEqual(
++            bq, builder.getCurrentBuildFarmJob().buildqueue_record)
++
      def test_getBuildQueue(self):
          buildqueueset = getUtility(IBuildQueueSet)
          active_jobs = buildqueueset.getActiveBuildJobs()
 === modified file 'lib/lp/buildmaster/tests/test_manager.py'
 --- lib/lp/buildmaster/tests/test_manager.py	2010-08-18 16:58:18 +0000
 +++ lib/lp/buildmaster/tests/test_manager.py	2010-08-19 16:06:10 +0000
@@ -7,7 +7,6 @@
  import signal
  import time
  import transaction
--import unittest
  from twisted.internet import defer, reactor, task
  from twisted.internet.error import ConnectionClosed
@@ -38,6 +37,7 @@
  from lp.soyuz.tests.test_publishing import SoyuzTestPublisher
  from lp.testing.factory import LaunchpadObjectFactory
  from lp.testing.fakemethod import FakeMethod
++from lp.testing.sampledata import BOB_THE_BUILDER_NAME
  from lp.testing import TestCase as LaunchpadTestCase
@@ -225,7 +225,8 @@
      def setUp(self):
          TrialTestCase.setUp(self)
--        self.manager = TestingSlaveScanner("bob", BufferLogger())
++        self.manager = TestingSlaveScanner(
++            BOB_THE_BUILDER_NAME, BufferLogger())
          # We will use an instrumented SlaveScanner instance for tests in
          # this context.
@@ -362,7 +363,8 @@
      def _setUpSlaveAndBuilder(self):
          # Helper function to set up a builder and its recording slave.
--        slave = RecordingSlave('bob', 'http://foo.buildd:8221/', 'foo.host')
++        slave = RecordingSlave(
++            BOB_THE_BUILDER_NAME, 'http://foo.buildd:8221/', 'foo.host')
          bob_builder = getUtility(IBuilderSet)[slave.name]
          return slave, bob_builder
@@ -509,7 +511,8 @@
              dl.addCallback(check_events)
          # A functional slave charged with some interactions.
--        slave = RecordingSlave('bob', 'http://bob.buildd:8221/', 'bob.host')
++        slave = RecordingSlave(
++            BOB_THE_BUILDER_NAME, 'http://bob.buildd:8221/', 'bob.host')
          slave.ensurepresent('arg1', 'arg2', 'arg3')
          slave.build('arg1', 'arg2', 'arg3')
@@ -541,7 +544,8 @@
          # Create a broken slave and insert interaction that will
          # cause the builder to be marked as fail.
          self.test_proxy = TestingXMLRPCProxy('very broken slave')
--        slave = RecordingSlave('bob', 'http://bob.buildd:8221/', 'bob.host')
++        slave = RecordingSlave(
++            BOB_THE_BUILDER_NAME, 'http://bob.buildd:8221/', 'bob.host')
          slave.ensurepresent('arg1', 'arg2', 'arg3')
          slave.build('arg1', 'arg2', 'arg3')
@@ -623,7 +627,7 @@
          Replace its default logging handler by a testing version.
          """
--        manager = SlaveScanner("bob", BufferLogger())
++        manager = SlaveScanner(BOB_THE_BUILDER_NAME, BufferLogger())
          manager.logger.name = 'slave-scanner'
          return manager
@@ -672,7 +676,7 @@
          # A job gets dispatched to the sampledata builder after it's reset.
          # Reset sampledata builder.
--        builder = getUtility(IBuilderSet)['bob']
++        builder = getUtility(IBuilderSet)[BOB_THE_BUILDER_NAME]
          self._resetBuilder(builder)
          # Set this to 1 here so that _checkDispatch can make sure it's
          # reset to 0 after a successful dispatch.
@@ -705,7 +709,7 @@
          # and the builder used should remain active and IDLE.
          # Reset sampledata builder.
--        builder = getUtility(IBuilderSet)['bob']
++        builder = getUtility(IBuilderSet)[BOB_THE_BUILDER_NAME]
          self._resetBuilder(builder)
          # Remove hoary/i386 chroot.
@@ -746,7 +750,7 @@
          # The job assigned to a broken builder is rescued.
          # Sampledata builder is enabled and is assigned to an active job.
--        builder = getUtility(IBuilderSet)['bob']
++        builder = getUtility(IBuilderSet)[BOB_THE_BUILDER_NAME]
          self.assertTrue(builder.builderok)
          job = builder.currentjob
          self.assertBuildingJob(job, builder)
@@ -783,7 +787,7 @@
          # Enable sampledata builder attached to an appropriate testing
          # slave. It will respond as if it was building the sampledata job.
--        builder = getUtility(IBuilderSet)['bob']
++        builder = getUtility(IBuilderSet)[BOB_THE_BUILDER_NAME]
          login('foo.bar@canonical.com')
          builder.builderok = True
@@ -804,12 +808,13 @@
      def test_scan_assesses_failure_exceptions(self):
          # If scan() fails with an exception, failure_counts should be
          # incremented and tested.
--        def fake_scan():
++        def failing_scan():
              raise Exception("fake exception")
          manager = self._getManager()
--        manager.scan = fake_scan
++        manager.scan = failing_scan
          manager.scheduleNextScanCycle = FakeMethod()
--        self.patch(BaseDispatchResult, 'assessFailureCounts', FakeMethod())
++        from lp.buildmaster import manager as manager_module
++        self.patch(manager_module, 'assessFailureCounts', FakeMethod())
          builder = getUtility(IBuilderSet)[manager.builder_name]
          # Failure counts start at zero.
@@ -828,10 +833,10 @@
 , builder.currentjob.specific_job.build.failure_count)
          self.assertEqual(
--            1, BaseDispatchResult.assessFailureCounts.call_count)
--
--
--class TestDispatchResult(unittest.TestCase):
++            1, manager_module.assessFailureCounts.call_count)
++
++
++class TestDispatchResult(LaunchpadTestCase):
      """Tests `BaseDispatchResult` variations.
      Variations of `BaseDispatchResult` when evaluated update the database
@@ -874,12 +879,12 @@
      def assertBuilderIsClean(self, builder):
          # Check that the builder is ready for a new build.
          self.assertTrue(builder.builderok)
--        self.assertTrue(builder.failnotes is None)
--        self.assertTrue(builder.currentjob is None)
++        self.assertIs(None, builder.failnotes)
++        self.assertIs(None, builder.currentjob)
      def testResetDispatchResult(self):
          # Test that `ResetDispatchResult` resets the builder and job.
--        builder, job_id = self._getBuilder('bob')
++        builder, job_id = self._getBuilder(BOB_THE_BUILDER_NAME)
          buildqueue_id = builder.currentjob.id
          builder.builderok = True
          builder.failure_count = 1
@@ -899,8 +904,11 @@
          self.assertBuilderIsClean(builder)
      def testFailDispatchResult(self):
--        # Test that `FailDispatchResult` calls assessFailureCounts().
--        builder, job_id = self._getBuilder('bob')
++        # Test that `FailDispatchResult` calls assessFailureCounts() so
++        # that we know the builders and jobs are failed as necessary
++        # when a FailDispatchResult is called at the end of the dispatch
++        # chain.
++        builder, job_id = self._getBuilder(BOB_THE_BUILDER_NAME)
          # Setup a interaction to satisfy 'write_transaction' decorator.
          login(ANONYMOUS)
@@ -914,7 +922,7 @@
      def _setup_failing_dispatch_result(self):
          # assessFailureCounts should fail jobs or builders depending on
          # whether it sees the failure_counts on each increasing.
--        builder, job_id = self._getBuilder('bob')
++        builder, job_id = self._getBuilder(BOB_THE_BUILDER_NAME)
          slave = RecordingSlave(builder.name, builder.url, builder.vm_host)
          result = FailDispatchResult(slave, 'does not work!')
          return builder, result
 === modified file 'lib/lp/soyuz/configure.zcml'
 --- lib/lp/soyuz/configure.zcml	2010-08-17 15:04:47 +0000
 +++ lib/lp/soyuz/configure.zcml	2010-08-19 15:32:16 +0000
@@ -513,6 +513,9 @@
                              status dependencies upload_log"/>
          <!-- XXX bigjools 2010-07-27 bug=570939
++             Work around the fact that not all BuildFarmJobs are concrete
++             objects.
++
               This should not be required once the old BuildFarmJob stuff is
               removed when the Translation Template jobs and Recipe jobs
               use the new infrastructure -->
 === modified file 'lib/lp/testing/sampledata.py'
 --- lib/lp/testing/sampledata.py	2010-08-05 02:29:25 +0000
 +++ lib/lp/testing/sampledata.py	2010-08-19 15:15:27 +0000
@@ -9,9 +9,11 @@
  __metaclass__ = type
  __all__ = [
++    'BOB_THE_BUILDER_NAME',
      'BUILDD_ADMIN_USERNAME',
      'CHROOT_LIBRARYFILEALIAS',
      'COMMERCIAL_ADMIN_EMAIL',
++    'FROG_THE_BUILDER_NAME',
      'HOARY_DISTROSERIES_NAME',
      'I386_ARCHITECTURE_NAME',
      'LAUNCHPAD_DBUSER_NAME',
@@ -34,6 +36,9 @@
  # A user with buildd admin rights and upload rights to Ubuntu.
  BUILDD_ADMIN_USERNAME = 'cprov'
++# A couple of builders.
++BOB_THE_BUILDER_NAME = 'bob'
++FROG_THE_BUILDER_NAME = 'frog'
  # The LibraryFileAlias of a chroot for attaching to a DistroArchSeries
  CHROOT_LIBRARYFILEALIAS = 1
  HOARY_DISTROSERIES_NAME = 'hoary'

1	=== modified file 'lib/lp/buildmaster/manager.py'
2	--- lib/lp/buildmaster/manager.py 2010-08-20 08:53:14 +0000
3	+++ lib/lp/buildmaster/manager.py 2010-08-20 11:50:42 +0000
4	@@ -149,10 +149,7 @@
5
6
7	def assessFailureCounts(builder, fail_notes):
8	- """View builder/job failure_count and work out which needs to die.
9	-
10	- :return: True if we disabled something, False if we did not.
11	- """
12	+ """View builder/job failure_count and work out which needs to die. """
13	current_job = builder.currentjob
14	build_job = current_job.specific_job.build
15
16	@@ -163,14 +160,13 @@
17	# we can do is try them both again, and hope that the job
18	# runs against a different builder.
19	current_job.reset()
20	- return False
21	+ return
22
23	if builder.failure_count > build_job.failure_count:
24	# The builder has failed more than the jobs it's been
25	# running, so let's disable it and re-schedule the build.
26	builder.failBuilder(fail_notes)
27	current_job.reset()
28	- return True
29	else:
30	# The job is the culprit! Override its status to 'failed'
31	# to make sure it won't get automatically dispatched again,
32	@@ -184,7 +180,6 @@
33	# but that would cause us to query the slave for its status
34	# again, and if the slave is non-responsive it holds up the
35	# next buildd scan.
36	- return True
37
38
39	class BaseDispatchResult:
40	@@ -209,7 +204,7 @@