Merge lp:~julian-edwards/launchpad/buildd-failure-counting into lp:launchpad/db-devel

Proposed by Julian Edwards
Status: Merged
Merged at revision: 9702
Proposed branch: lp:~julian-edwards/launchpad/buildd-failure-counting
Merge into: lp:launchpad/db-devel
Prerequisite: lp:~julian-edwards/launchpad/buildd-manager-parallel-scan
Diff against target: 0 lines
To merge this branch: bzr merge lp:~julian-edwards/launchpad/buildd-failure-counting
Reviewer Review Type Date Requested Status
Jonathan Lange (community) Approve
Stuart Bishop (community) db Approve
Robert Collins db Pending
Review via email: mp+31556@code.launchpad.net

Description of the change

Add failure counting on builder and jobs in the build farm, to work out which one is the nasty one when Stuff Goes Wrong. The builder failure count is reset when we know the dispatch was successul, so it's only there to catch actual dispatch problems, not problems while building.

To post a comment you must log in.
Revision history for this message
Stuart Bishop (stub) wrote :

Fine. patch-2207-78-0.sql.

review: Approve (db)
Revision history for this message
Jonathan Lange (jml) wrote :
Download full text (11.0 KiB)

Hey Julian,

This looks like a big step toward build farm reliability -- thank you.

Some of my comments are asking for further explanation, a few make suggestions
for moving some code around, and the rest are the normal stylistic gotchas.

Out of curiosity, did you discuss this change in detail with anyone before
landing?

cheers,
jml

> === modified file 'lib/lp/buildmaster/configure.zcml'
> --- lib/lp/buildmaster/configure.zcml 2010-06-15 14:34:50 +0000
> +++ lib/lp/buildmaster/configure.zcml 2010-08-18 17:01:23 +0000
> @@ -50,6 +50,9 @@
> <require
> permission="launchpad.Edit"
> set_attributes="status"/>
> + <require
> + permission="launchpad.Admin"
> + set_attributes="failure_count"/>
> </class>
> <securedutility
> component="lp.buildmaster.model.buildfarmjob.BuildFarmJob"
>

I guess this means that whatever increments this runs with admin-level
privileges. Is that really such a good idea?

> === modified file 'lib/lp/buildmaster/manager.py'
> --- lib/lp/buildmaster/manager.py 2010-08-02 16:00:50 +0000
> +++ lib/lp/buildmaster/manager.py 2010-08-18 17:01:23 +0000
...
> @@ -157,6 +158,52 @@
> if job is not None:
> job.reset()
>
> + def _getBuilder(self):
> + # Helper to return the builder given the slave for this request.
> + # Avoiding circular imports.
> + from lp.buildmaster.interfaces.builder import IBuilderSet
> + return getUtility(IBuilderSet)[self.slave.name]
> +

I notice that there's very similar code below.

I'd suggest making a top-level function like this:

  def get_builder(slave_name):
      # Helper to return the builder given the slave for this request.
      # Avoiding circular imports.
      from lp.buildmaster.interfaces.builder import IBuilderSet
      return getUtility(IBuilderSet)[self.slave.name]

And using that here instead, as well as in SlaveScanner.

> + def assessFailureCounts(self, builder=None):
> + """View builder/job failure_count and work out which needs to die.
> +

There's spurious whitespace on this line.

> + :return: True if we disabled something, False if we did not.

This seems like an odd thing to return. It doesn't seem to be used in the
code. What's it for?

> + """
> + # Avoiding circular imports.

Not avoiding circular imports here any more :)

> + if builder is None:
> + builder = self._getBuilder()
> + build_job = builder.currentjob.specific_job.build
> +
> + if builder.failure_count == build_job.failure_count:
> + # This is either the first failure for this job on this
> + # builder, or by some chance the job was re-dispatched to
> + # the same builder. This make it impossible to determine

"makes"

> + # whether the job or the builder is at fault, so don't fail
> + # either. We reset the builder and job to try again.

It's unclear to me why the two counts being equal could imply that this is the
first failure for the job -- surely a count of 1 would signify that?

Perhaps the comment should say something like:
  # If the failure count ...

review: Needs Fixing
Revision history for this message
Julian Edwards (julian-edwards) wrote :
Download full text (14.2 KiB)

On Thursday 19 August 2010 13:37:40 you wrote:
> Review: Needs Fixing
> Hey Julian,
>
> This looks like a big step toward build farm reliability -- thank you.
>
> Some of my comments are asking for further explanation, a few make
> suggestions for moving some code around, and the rest are the normal
> stylistic gotchas.
>
> Out of curiosity, did you discuss this change in detail with anyone before
> landing?

It's been so long since I started it I honestly can't remember. I have vague
recollections of mentioning failure counts to various people.

> > === modified file 'lib/lp/buildmaster/configure.zcml'
> > --- lib/lp/buildmaster/configure.zcml 2010-06-15 14:34:50 +0000
> > +++ lib/lp/buildmaster/configure.zcml 2010-08-18 17:01:23 +0000
> > @@ -50,6 +50,9 @@
> >
> > <require
> >
> > permission="launchpad.Edit"
> > set_attributes="status"/>
> >
> > + <require
> > + permission="launchpad.Admin"
> > + set_attributes="failure_count"/>
> >
> > </class>
> > <securedutility
> >
> > component="lp.buildmaster.model.buildfarmjob.BuildFarmJob"
>
> I guess this means that whatever increments this runs with admin-level
> privileges. Is that really such a good idea?

For now yes, because it's only ever changed in "zopeless" scripts. That is,
Zope requres this to work, but doesn't actually apply any level of security to
it. :/

I made it admin to discourage changing it in appservers. There are other
script-only-changing declarations like this, but zope.Public, with warnings in
the zcml about exposing it externally! I prefer the extra protection up
front...

> > === modified file 'lib/lp/buildmaster/manager.py'
> > --- lib/lp/buildmaster/manager.py 2010-08-02 16:00:50 +0000
> > +++ lib/lp/buildmaster/manager.py 2010-08-18 17:01:23 +0000
>
> ...
>
> > @@ -157,6 +158,52 @@
> >
> > if job is not None:
> > job.reset()
> >
> > + def _getBuilder(self):
> > + # Helper to return the builder given the slave for this request.
> > + # Avoiding circular imports.
> > + from lp.buildmaster.interfaces.builder import IBuilderSet
> > + return getUtility(IBuilderSet)[self.slave.name]
> > +
>
> I notice that there's very similar code below.
>
> I'd suggest making a top-level function like this:
>
> def get_builder(slave_name):
> # Helper to return the builder given the slave for this request.
> # Avoiding circular imports.
> from lp.buildmaster.interfaces.builder import IBuilderSet
> return getUtility(IBuilderSet)[self.slave.name]
>
> And using that here instead, as well as in SlaveScanner.

Meh, I meant to do that and forgot, thanks.

>
> > + def assessFailureCounts(self, builder=None):
> > + """View builder/job failure_count and work out which needs to
> > die. +
>
> There's spurious whitespace on this line.

Grar, my vim setup has gone weird on me, it always used to tell me about
trailing spaces.

>
> > + :return: True if we disabled something, False if we did not.
>
> This seems like an odd thing to return. It doesn't seem to be used in the
> code. W...

1=== modified file 'lib/lp/buildmaster/interfaces/builder.py'
2--- lib/lp/buildmaster/interfaces/builder.py 2010-08-17 15:04:47 +0000
3+++ lib/lp/buildmaster/interfaces/builder.py 2010-08-19 14:43:04 +0000
4@@ -281,6 +281,9 @@
5 :return: A BuildQueue, or None.
6 """
7
8+ def getCurrentBuildFarmJob():
9+ """Return a `BuildFarmJob` for this builder."""
10+
11
12 class IBuilderSet(Interface):
13 """Collections of builders.
14
15=== modified file 'lib/lp/buildmaster/manager.py'
16--- lib/lp/buildmaster/manager.py 2010-08-18 16:58:18 +0000
17+++ lib/lp/buildmaster/manager.py 2010-08-19 15:51:41 +0000
18@@ -141,10 +141,55 @@
19 return d
20
21
22+def get_builder(name):
23+ """Helper to return the builder given the slave for this request."""
24+ # Avoiding circular imports.
25+ from lp.buildmaster.interfaces.builder import IBuilderSet
26+ return getUtility(IBuilderSet)[name]
27+
28+
29+def assessFailureCounts(builder, fail_notes):
30+ """View builder/job failure_count and work out which needs to die.
31+
32+ :return: True if we disabled something, False if we did not.
33+ """
34+ current_job = builder.currentjob
35+ build_job = current_job.specific_job.build
36+
37+ if builder.failure_count == build_job.failure_count:
38+ # If the failure count for the builder is the same as the
39+ # failure count for the job being built, then we cannot
40+ # tell whether the job or the builder is at fault. The best
41+ # we can do is try them both again, and hope that the job
42+ # runs against a different builder.
43+ current_job.reset()
44+ return False
45+
46+ if builder.failure_count > build_job.failure_count:
47+ # The builder has failed more than the jobs it's been
48+ # running, so let's disable it and re-schedule the build.
49+ builder.failBuilder(fail_notes)
50+ current_job.reset()
51+ return True
52+ else:
53+ # The job is the culprit! Override its status to 'failed'
54+ # to make sure it won't get automatically dispatched again,
55+ # and remove the buildqueue request. The failure should
56+ # have already caused any relevant slave data to be stored
57+ # on the build record so don't worry about that here.
58+ build_job.status = BuildStatus.FAILEDTOBUILD
59+ builder.currentjob.destroySelf()
60+
61+ # N.B. We could try and call _handleStatus_PACKAGEFAIL here
62+ # but that would cause us to query the slave for its status
63+ # again, and if the slave is non-responsive it holds up the
64+ # next buildd scan.
65+ return True
66+
67+
68 class BaseDispatchResult:
69 """Base class for *DispatchResult variations.
70
71-
72 It will be extended to represent dispatching results and allow
73 homogeneous processing.
74 """
75@@ -158,51 +203,13 @@
76 if job is not None:
77 job.reset()
78
79- def _getBuilder(self):
80- # Helper to return the builder given the slave for this request.
81- # Avoiding circular imports.
82- from lp.buildmaster.interfaces.builder import IBuilderSet
83- return getUtility(IBuilderSet)[self.slave.name]
84-
85- def assessFailureCounts(self, builder=None):
86+ def assessFailureCounts(self):
87 """View builder/job failure_count and work out which needs to die.
88-
89+
90 :return: True if we disabled something, False if we did not.
91 """
92- # Avoiding circular imports.
93- if builder is None:
94- builder = self._getBuilder()
95- build_job = builder.currentjob.specific_job.build
96-
97- if builder.failure_count == build_job.failure_count:
98- # This is either the first failure for this job on this
99- # builder, or by some chance the job was re-dispatched to
100- # the same builder. This make it impossible to determine
101- # whether the job or the builder is at fault, so don't fail
102- # either. We reset the builder and job to try again.
103- self._cleanJob(builder.currentjob)
104- return False
105-
106- if builder.failure_count > build_job.failure_count:
107- # The builder has failed more than the jobs it's been
108- # running, so let's disable it and re-schedule the build.
109- builder.failBuilder(self.info)
110- self._cleanJob(builder.currentjob)
111- return True
112- else:
113- # The job is the culprit! Override its status to 'failed'
114- # to make sure it won't get automatically dispatched again,
115- # and remove the buildqueue request. The failure should
116- # have already caused any relevant slave data to be stored
117- # on the build record so don't worry about that here.
118- build_job.status = BuildStatus.FAILEDTOBUILD
119- builder.currentjob.destroySelf()
120-
121- # N.B. We could try and call _handleStatus_PACKAGEFAIL here
122- # but that would cause us to query the slave for its status
123- # again, and if the slave is non-responsive it holds up the
124- # next buildd scan.
125- return True
126+ builder = get_builder(self.slave.name)
127+ return assessFailureCounts(builder, self.info)
128
129 def ___call__(self):
130 raise NotImplementedError(
131@@ -237,7 +244,7 @@
132
133 @write_transaction
134 def __call__(self):
135- builder = self._getBuilder()
136+ builder = get_builder(self.slave.name)
137 # Builders that fail to reset should be disabled as per bug
138 # 563353.
139 # XXX Julian bug=586362
140@@ -283,18 +290,16 @@
141 self.logger.info("Scanning failed with: %s\n%s" %
142 (error.getErrorMessage(), error.getTraceback()))
143
144- # Avoid circular import.
145- from lp.buildmaster.interfaces.builder import IBuilderSet
146- builder = getUtility(IBuilderSet)[self.builder_name]
147+ builder = get_builder(self.builder_name)
148
149 # Decide if we need to terminate the job or fail the
150 # builder.
151 self._incrementFailureCounts(builder)
152 self.logger.info(
153- "builder failure count: %s, job failure count: %s" % (
154+ "builder failure count: %s, job failure count: %s" % (
155 builder.failure_count,
156- builder.currentjob.specific_job.build.failure_count))
157- BaseDispatchResult(slave=None).assessFailureCounts(builder)
158+ builder.getCurrentBuildFarmJob().failure_count))
159+ assessFailureCounts(builder, error.getErrorMessage())
160 transaction.commit()
161
162 self.scheduleNextScanCycle()
163@@ -311,10 +316,7 @@
164 # We need to re-fetch the builder object on each cycle as the
165 # Storm store is invalidated over transaction boundaries.
166
167- # Avoid circular import.
168- from lp.buildmaster.interfaces.builder import IBuilderSet
169- builder_set = getUtility(IBuilderSet)
170- self.builder = builder_set[self.builder_name]
171+ self.builder = get_builder(self.builder_name)
172
173 if self.builder.builderok:
174 self.builder.updateStatus(self.logger)
175@@ -496,9 +498,8 @@
176 return self.reset_result(slave, error_text)
177
178 def _incrementFailureCounts(self, builder):
179- # Avoid circular import.
180 builder.failure_count += 1
181- builder.currentjob.specific_job.build.failure_count += 1
182+ builder.getCurrentBuildFarmJob().failure_count += 1
183
184 def checkDispatch(self, response, method, slave):
185 """Verify the results of a slave xmlrpc call.
186
187=== modified file 'lib/lp/buildmaster/model/builder.py'
188--- lib/lp/buildmaster/model/builder.py 2010-08-17 15:04:47 +0000
189+++ lib/lp/buildmaster/model/builder.py 2010-08-19 14:42:10 +0000
190@@ -645,6 +645,11 @@
191 Job._status == JobStatus.RUNNING,
192 Job.date_started != None).one()
193
194+ def getCurrentBuildFarmJob(self):
195+ """See `IBuilder`."""
196+ # Don't make this a property, it's masking a few queries.
197+ return self.currentjob.specific_job.build
198+
199
200 class BuilderSet(object):
201 """See IBuilderSet"""
202
203=== modified file 'lib/lp/buildmaster/tests/test_builder.py'
204--- lib/lp/buildmaster/tests/test_builder.py 2010-08-17 15:04:47 +0000
205+++ lib/lp/buildmaster/tests/test_builder.py 2010-08-19 15:07:15 +0000
206@@ -43,9 +43,18 @@
207
208 def test_default_values(self):
209 builder = self.factory.makeBuilder()
210+ # Make sure the Storm cache gets the values that the database
211+ # initialises.
212 flush_database_updates()
213 self.assertEqual(0, builder.failure_count)
214
215+ def test_getCurrentBuildFarmJob(self):
216+ bq = self.factory.makeSourcePackageRecipeBuildJob(3333)
217+ builder = self.factory.makeBuilder()
218+ bq.markAsBuilding(builder)
219+ self.assertEqual(
220+ bq, builder.getCurrentBuildFarmJob().buildqueue_record)
221+
222 def test_getBuildQueue(self):
223 buildqueueset = getUtility(IBuildQueueSet)
224 active_jobs = buildqueueset.getActiveBuildJobs()
225
226=== modified file 'lib/lp/buildmaster/tests/test_manager.py'
227--- lib/lp/buildmaster/tests/test_manager.py 2010-08-18 16:58:18 +0000
228+++ lib/lp/buildmaster/tests/test_manager.py 2010-08-19 16:06:10 +0000
229@@ -7,7 +7,6 @@
230 import signal
231 import time
232 import transaction
233-import unittest
234
235 from twisted.internet import defer, reactor, task
236 from twisted.internet.error import ConnectionClosed
237@@ -38,6 +37,7 @@
238 from lp.soyuz.tests.test_publishing import SoyuzTestPublisher
239 from lp.testing.factory import LaunchpadObjectFactory
240 from lp.testing.fakemethod import FakeMethod
241+from lp.testing.sampledata import BOB_THE_BUILDER_NAME
242 from lp.testing import TestCase as LaunchpadTestCase
243
244
245@@ -225,7 +225,8 @@
246
247 def setUp(self):
248 TrialTestCase.setUp(self)
249- self.manager = TestingSlaveScanner("bob", BufferLogger())
250+ self.manager = TestingSlaveScanner(
251+ BOB_THE_BUILDER_NAME, BufferLogger())
252
253 # We will use an instrumented SlaveScanner instance for tests in
254 # this context.
255@@ -362,7 +363,8 @@
256
257 def _setUpSlaveAndBuilder(self):
258 # Helper function to set up a builder and its recording slave.
259- slave = RecordingSlave('bob', 'http://foo.buildd:8221/', 'foo.host')
260+ slave = RecordingSlave(
261+ BOB_THE_BUILDER_NAME, 'http://foo.buildd:8221/', 'foo.host')
262 bob_builder = getUtility(IBuilderSet)[slave.name]
263 return slave, bob_builder
264
265@@ -509,7 +511,8 @@
266 dl.addCallback(check_events)
267
268 # A functional slave charged with some interactions.
269- slave = RecordingSlave('bob', 'http://bob.buildd:8221/', 'bob.host')
270+ slave = RecordingSlave(
271+ BOB_THE_BUILDER_NAME, 'http://bob.buildd:8221/', 'bob.host')
272 slave.ensurepresent('arg1', 'arg2', 'arg3')
273 slave.build('arg1', 'arg2', 'arg3')
274
275@@ -541,7 +544,8 @@
276 # Create a broken slave and insert interaction that will
277 # cause the builder to be marked as fail.
278 self.test_proxy = TestingXMLRPCProxy('very broken slave')
279- slave = RecordingSlave('bob', 'http://bob.buildd:8221/', 'bob.host')
280+ slave = RecordingSlave(
281+ BOB_THE_BUILDER_NAME, 'http://bob.buildd:8221/', 'bob.host')
282 slave.ensurepresent('arg1', 'arg2', 'arg3')
283 slave.build('arg1', 'arg2', 'arg3')
284
285@@ -623,7 +627,7 @@
286
287 Replace its default logging handler by a testing version.
288 """
289- manager = SlaveScanner("bob", BufferLogger())
290+ manager = SlaveScanner(BOB_THE_BUILDER_NAME, BufferLogger())
291 manager.logger.name = 'slave-scanner'
292
293 return manager
294@@ -672,7 +676,7 @@
295 # A job gets dispatched to the sampledata builder after it's reset.
296
297 # Reset sampledata builder.
298- builder = getUtility(IBuilderSet)['bob']
299+ builder = getUtility(IBuilderSet)[BOB_THE_BUILDER_NAME]
300 self._resetBuilder(builder)
301 # Set this to 1 here so that _checkDispatch can make sure it's
302 # reset to 0 after a successful dispatch.
303@@ -705,7 +709,7 @@
304 # and the builder used should remain active and IDLE.
305
306 # Reset sampledata builder.
307- builder = getUtility(IBuilderSet)['bob']
308+ builder = getUtility(IBuilderSet)[BOB_THE_BUILDER_NAME]
309 self._resetBuilder(builder)
310
311 # Remove hoary/i386 chroot.
312@@ -746,7 +750,7 @@
313 # The job assigned to a broken builder is rescued.
314
315 # Sampledata builder is enabled and is assigned to an active job.
316- builder = getUtility(IBuilderSet)['bob']
317+ builder = getUtility(IBuilderSet)[BOB_THE_BUILDER_NAME]
318 self.assertTrue(builder.builderok)
319 job = builder.currentjob
320 self.assertBuildingJob(job, builder)
321@@ -783,7 +787,7 @@
322
323 # Enable sampledata builder attached to an appropriate testing
324 # slave. It will respond as if it was building the sampledata job.
325- builder = getUtility(IBuilderSet)['bob']
326+ builder = getUtility(IBuilderSet)[BOB_THE_BUILDER_NAME]
327
328 login('foo.bar@canonical.com')
329 builder.builderok = True
330@@ -804,12 +808,13 @@
331 def test_scan_assesses_failure_exceptions(self):
332 # If scan() fails with an exception, failure_counts should be
333 # incremented and tested.
334- def fake_scan():
335+ def failing_scan():
336 raise Exception("fake exception")
337 manager = self._getManager()
338- manager.scan = fake_scan
339+ manager.scan = failing_scan
340 manager.scheduleNextScanCycle = FakeMethod()
341- self.patch(BaseDispatchResult, 'assessFailureCounts', FakeMethod())
342+ from lp.buildmaster import manager as manager_module
343+ self.patch(manager_module, 'assessFailureCounts', FakeMethod())
344 builder = getUtility(IBuilderSet)[manager.builder_name]
345
346 # Failure counts start at zero.
347@@ -828,10 +833,10 @@
348 1, builder.currentjob.specific_job.build.failure_count)
349
350 self.assertEqual(
351- 1, BaseDispatchResult.assessFailureCounts.call_count)
352-
353-
354-class TestDispatchResult(unittest.TestCase):
355+ 1, manager_module.assessFailureCounts.call_count)
356+
357+
358+class TestDispatchResult(LaunchpadTestCase):
359 """Tests `BaseDispatchResult` variations.
360
361 Variations of `BaseDispatchResult` when evaluated update the database
362@@ -874,12 +879,12 @@
363 def assertBuilderIsClean(self, builder):
364 # Check that the builder is ready for a new build.
365 self.assertTrue(builder.builderok)
366- self.assertTrue(builder.failnotes is None)
367- self.assertTrue(builder.currentjob is None)
368+ self.assertIs(None, builder.failnotes)
369+ self.assertIs(None, builder.currentjob)
370
371 def testResetDispatchResult(self):
372 # Test that `ResetDispatchResult` resets the builder and job.
373- builder, job_id = self._getBuilder('bob')
374+ builder, job_id = self._getBuilder(BOB_THE_BUILDER_NAME)
375 buildqueue_id = builder.currentjob.id
376 builder.builderok = True
377 builder.failure_count = 1
378@@ -899,8 +904,11 @@
379 self.assertBuilderIsClean(builder)
380
381 def testFailDispatchResult(self):
382- # Test that `FailDispatchResult` calls assessFailureCounts().
383- builder, job_id = self._getBuilder('bob')
384+ # Test that `FailDispatchResult` calls assessFailureCounts() so
385+ # that we know the builders and jobs are failed as necessary
386+ # when a FailDispatchResult is called at the end of the dispatch
387+ # chain.
388+ builder, job_id = self._getBuilder(BOB_THE_BUILDER_NAME)
389
390 # Setup a interaction to satisfy 'write_transaction' decorator.
391 login(ANONYMOUS)
392@@ -914,7 +922,7 @@
393 def _setup_failing_dispatch_result(self):
394 # assessFailureCounts should fail jobs or builders depending on
395 # whether it sees the failure_counts on each increasing.
396- builder, job_id = self._getBuilder('bob')
397+ builder, job_id = self._getBuilder(BOB_THE_BUILDER_NAME)
398 slave = RecordingSlave(builder.name, builder.url, builder.vm_host)
399 result = FailDispatchResult(slave, 'does not work!')
400 return builder, result
401
402=== modified file 'lib/lp/soyuz/configure.zcml'
403--- lib/lp/soyuz/configure.zcml 2010-08-17 15:04:47 +0000
404+++ lib/lp/soyuz/configure.zcml 2010-08-19 15:32:16 +0000
405@@ -513,6 +513,9 @@
406 status dependencies upload_log"/>
407
408 <!-- XXX bigjools 2010-07-27 bug=570939
409+ Work around the fact that not all BuildFarmJobs are concrete
410+ objects.
411+
412 This should not be required once the old BuildFarmJob stuff is
413 removed when the Translation Template jobs and Recipe jobs
414 use the new infrastructure -->
415
416=== modified file 'lib/lp/testing/sampledata.py'
417--- lib/lp/testing/sampledata.py 2010-08-05 02:29:25 +0000
418+++ lib/lp/testing/sampledata.py 2010-08-19 15:15:27 +0000
419@@ -9,9 +9,11 @@
420
421 __metaclass__ = type
422 __all__ = [
423+ 'BOB_THE_BUILDER_NAME',
424 'BUILDD_ADMIN_USERNAME',
425 'CHROOT_LIBRARYFILEALIAS',
426 'COMMERCIAL_ADMIN_EMAIL',
427+ 'FROG_THE_BUILDER_NAME',
428 'HOARY_DISTROSERIES_NAME',
429 'I386_ARCHITECTURE_NAME',
430 'LAUNCHPAD_DBUSER_NAME',
431@@ -34,6 +36,9 @@
432
433 # A user with buildd admin rights and upload rights to Ubuntu.
434 BUILDD_ADMIN_USERNAME = 'cprov'
435+# A couple of builders.
436+BOB_THE_BUILDER_NAME = 'bob'
437+FROG_THE_BUILDER_NAME = 'frog'
438 # The LibraryFileAlias of a chroot for attaching to a DistroArchSeries
439 CHROOT_LIBRARYFILEALIAS = 1
440 HOARY_DISTROSERIES_NAME = 'hoary'
Revision history for this message
Stuart Bishop (stub) wrote :

Allocated new DB patch number patch-2208-04-0.sql

Revision history for this message
Jonathan Lange (jml) wrote :
Download full text (10.0 KiB)

On Thu, Aug 19, 2010 at 5:13 PM, Julian Edwards
<email address hidden> wrote:
> On Thursday 19 August 2010 13:37:40 you wrote:
>> Review: Needs Fixing
>> Hey Julian,
>>
>> This looks like a big step toward build farm reliability -- thank you.
>>

Thanks for the changes. There are a few questions below, but I reckon
this'll be the last round.

>> Some of my comments are asking for further explanation, a few make
>> suggestions for moving some code around, and the rest are the normal
>> stylistic gotchas.
>>
>> Out of curiosity, did you discuss this change in detail with anyone before
>> landing?
>
> It's been so long since I started it I honestly can't remember.  I have vague
> recollections of mentioning failure counts to various people.
>

No worries. I ask because we have failure counting mechanisms in other
places in Launchpad (e.g. the branch puller) and it might have been
interesting to share patterns, variable names and the like.

>> > === modified file 'lib/lp/buildmaster/configure.zcml'
>> > --- lib/lp/buildmaster/configure.zcml       2010-06-15 14:34:50 +0000
>> > +++ lib/lp/buildmaster/configure.zcml       2010-08-18 17:01:23 +0000
>> > @@ -50,6 +50,9 @@
>> >
>> >          <require
>> >
>> >              permission="launchpad.Edit"
>> >              set_attributes="status"/>
>> >
>> > +        <require
>> > +            permission="launchpad.Admin"
>> > +            set_attributes="failure_count"/>
>> >
>> >      </class>
>> >      <securedutility
>> >
>> >          component="lp.buildmaster.model.buildfarmjob.BuildFarmJob"
>>
>> I guess this means that whatever increments this runs with admin-level
>> privileges.  Is that really such a good idea?
>
> For now yes, because it's only ever changed in "zopeless" scripts.  That is,
> Zope requres this to work, but doesn't actually apply any level of security to
> it.  :/
>
> I made it admin to discourage changing it in appservers.  There are other
> script-only-changing declarations like this, but zope.Public, with warnings in
> the zcml about exposing it externally!  I prefer the extra protection up
> front...
>

Makes sense. I mean, a better permission would be some kind of
"Forbidden" permission, I guess.

>> > === modified file 'lib/lp/buildmaster/manager.py'
>> > --- lib/lp/buildmaster/manager.py   2010-08-02 16:00:50 +0000
>> > +++ lib/lp/buildmaster/manager.py   2010-08-18 17:01:23 +0000
...
>> > +        :return: True if we disabled something, False if we did not.
>>
>> This seems like an odd thing to return. It doesn't seem to be used in the
>> code. What's it for?
>
> For the tests.
>

Yeah, but what about the tests? Or rather, why do the tests care about
something that the code doesn't?

>> > +            # Decide if we need to terminate the job or fail the
>> > +            # builder.
>> > +            self._incrementFailureCounts(builder)
>> > +            self.logger.info(
>> > +                "builder failure count: %s, job failure count: %s" % (
>> > +                    builder.failure_count,
>> > +
>> > builder.currentjob.specific_job.build.failure_count)) +
>> > BaseDispatchResult(slave=None).assessFailureCounts(builder)
>>
>> This line makes me wonder why ass...

Revision history for this message
Julian Edwards (julian-edwards) wrote :
Download full text (11.6 KiB)

On Friday 20 August 2010 11:43:40 you wrote:
> On Thu, Aug 19, 2010 at 5:13 PM, Julian Edwards
>
> <email address hidden> wrote:
> > On Thursday 19 August 2010 13:37:40 you wrote:
> >> Review: Needs Fixing
> >> Hey Julian,
> >>
> >> This looks like a big step toward build farm reliability -- thank you.
>
> Thanks for the changes. There are a few questions below, but I reckon
> this'll be the last round.

I'm not so sure :)

>
> >> Some of my comments are asking for further explanation, a few make
> >> suggestions for moving some code around, and the rest are the normal
> >> stylistic gotchas.
> >>
> >> Out of curiosity, did you discuss this change in detail with anyone
> >> before landing?
> >
> > It's been so long since I started it I honestly can't remember. I have
> > vague recollections of mentioning failure counts to various people.
>
> No worries. I ask because we have failure counting mechanisms in other
> places in Launchpad (e.g. the branch puller) and it might have been
> interesting to share patterns, variable names and the like.

I had no idea about that, that's a shame :(

> >> > === modified file 'lib/lp/buildmaster/manager.py'
> >> > --- lib/lp/buildmaster/manager.py 2010-08-02 16:00:50 +0000
> >> > +++ lib/lp/buildmaster/manager.py 2010-08-18 17:01:23 +0000
>
> ...
>
> >> > + :return: True if we disabled something, False if we did not.
> >>
> >> This seems like an odd thing to return. It doesn't seem to be used in
> >> the code. What's it for?
> >
> > For the tests.
>
> Yeah, but what about the tests? Or rather, why do the tests care about
> something that the code doesn't?

Well lots of our tests return stuff that the code doesn't need, like Deferreds
for example. But fine, I've got rid of it.

> >> > + # Decide if we need to terminate the job or fail the
> >> > + # builder.
> >> > + self._incrementFailureCounts(builder)
> >> > + self.logger.info(
> >> > + "builder failure count: %s, job failure count: %s" %
> >> > ( + builder.failure_count,
> >> > +
> >> > builder.currentjob.specific_job.build.failure_count)) +
> >> > BaseDispatchResult(slave=None).assessFailureCounts(builder)
> >>
> >> This line makes me wonder why assessFailureCounts is on
> >> BaseDispatchResult at all. Perhaps it should be a method on Builder
> >> that takes an optional 'info' parameter?
> >>
> >> The above code would then read:
> >>
> >> builder.gotFailure()
> >> self.logger.info(...)
> >> builder.assessFailureCounts()
> >
> > I don't think we should continue to bloat model classes with code that is
> > only used in one place. This is a buildd-manager-specific function.
> >
> > Anyway, bugger, I had intended to factor it out to a standalone function
> > as I also realised that that line is ridiculous and again forgot :/
>
> A standalone function would be a definite improvement, and fine by me.
> I personally don't think it's bloat to say "a builder knows how to
> handle builder failures", but I'm not going to push it.

By the same argument, anything that deals with object X should be a method on
that object X.

The difference he...

1=== modified file 'lib/lp/buildmaster/manager.py'
2--- lib/lp/buildmaster/manager.py 2010-08-20 08:53:14 +0000
3+++ lib/lp/buildmaster/manager.py 2010-08-20 11:50:42 +0000
4@@ -149,10 +149,7 @@
5
6
7 def assessFailureCounts(builder, fail_notes):
8- """View builder/job failure_count and work out which needs to die.
9-
10- :return: True if we disabled something, False if we did not.
11- """
12+ """View builder/job failure_count and work out which needs to die. """
13 current_job = builder.currentjob
14 build_job = current_job.specific_job.build
15
16@@ -163,14 +160,13 @@
17 # we can do is try them both again, and hope that the job
18 # runs against a different builder.
19 current_job.reset()
20- return False
21+ return
22
23 if builder.failure_count > build_job.failure_count:
24 # The builder has failed more than the jobs it's been
25 # running, so let's disable it and re-schedule the build.
26 builder.failBuilder(fail_notes)
27 current_job.reset()
28- return True
29 else:
30 # The job is the culprit! Override its status to 'failed'
31 # to make sure it won't get automatically dispatched again,
32@@ -184,7 +180,6 @@
33 # but that would cause us to query the slave for its status
34 # again, and if the slave is non-responsive it holds up the
35 # next buildd scan.
36- return True
37
38
39 class BaseDispatchResult:
40@@ -209,7 +204,7 @@
41 :return: True if we disabled something, False if we did not.
42 """
43 builder = get_builder(self.slave.name)
44- return assessFailureCounts(builder, self.info)
45+ assessFailureCounts(builder, self.info)
46
47 def ___call__(self):
48 raise NotImplementedError(
49
50=== modified file 'lib/lp/buildmaster/tests/test_manager.py'
51--- lib/lp/buildmaster/tests/test_manager.py 2010-08-20 08:53:14 +0000
52+++ lib/lp/buildmaster/tests/test_manager.py 2010-08-20 12:07:15 +0000
53@@ -228,6 +228,9 @@
54 self.manager = TestingSlaveScanner(
55 BOB_THE_BUILDER_NAME, BufferLogger())
56
57+ self.fake_builder_url = 'http://bob.buildd:8221/'
58+ self.fake_builder_host = 'bob.host'
59+
60 # We will use an instrumented SlaveScanner instance for tests in
61 # this context.
62
63@@ -361,11 +364,19 @@
64 self.assertFalse(result.processed)
65 return d.addCallback(check_result)
66
67- def _setUpSlaveAndBuilder(self):
68+ def _setUpSlaveAndBuilder(self, builder_failure_count=None,
69+ job_failure_count=None):
70 # Helper function to set up a builder and its recording slave.
71+ if builder_failure_count is None:
72+ builder_failure_count = 0
73+ if job_failure_count is None:
74+ job_failure_count = 0
75 slave = RecordingSlave(
76- BOB_THE_BUILDER_NAME, 'http://foo.buildd:8221/', 'foo.host')
77+ BOB_THE_BUILDER_NAME, self.fake_builder_url,
78+ self.fake_builder_host)
79 bob_builder = getUtility(IBuilderSet)[slave.name]
80+ bob_builder.failure_count = builder_failure_count
81+ bob_builder.getCurrentBuildFarmJob().failure_count = job_failure_count
82 return slave, bob_builder
83
84 def test_checkDispatch_success(self):
85@@ -395,9 +406,8 @@
86
87 On success dispatching it returns None.
88 """
89- slave, bob_builder = self._setUpSlaveAndBuilder()
90- bob_builder.failure_count = 0
91- bob_builder.currentjob.specific_job.build.failure_count = 0
92+ slave, bob_builder = self._setUpSlaveAndBuilder(
93+ builder_failure_count=0, job_failure_count=0)
94
95 # Successful legitimate response, None is returned.
96 successful_response = [
97@@ -410,9 +420,8 @@
98 def test_checkDispatch_first_fail(self):
99 # Failed legitimate response, results in FailDispatchResult and
100 # failure_count on the job and the builder are both incremented.
101- slave, bob_builder = self._setUpSlaveAndBuilder()
102- bob_builder.failure_count = 0
103- bob_builder.currentjob.specific_job.build.failure_count = 0
104+ slave, bob_builder = self._setUpSlaveAndBuilder(
105+ builder_failure_count=0, job_failure_count=0)
106
107 failed_response = [False, 'uncool builder']
108 result = self.manager.checkDispatch(
109@@ -420,50 +429,48 @@
110 self.assertIsDispatchFail(result)
111 self.assertEqual(
112 repr(result),
113- '<bob:http://foo.buildd:8221/> failure (uncool builder)')
114+ '<bob:%s> failure (uncool builder)' % self.fake_builder_url)
115 self.assertEqual(1, bob_builder.failure_count)
116 self.assertEqual(
117- 1, bob_builder.currentjob.specific_job.build.failure_count)
118+ 1, bob_builder.getCurrentBuildFarmJob().failure_count)
119
120 def test_checkDispatch_second_reset_fail_by_builder(self):
121 # Twisted Failure response, results in a `FailDispatchResult`.
122- slave, bob_builder = self._setUpSlaveAndBuilder()
123- bob_builder.failure_count = 1
124- bob_builder.currentjob.specific_job.build.failure_count = 0
125+ slave, bob_builder = self._setUpSlaveAndBuilder(
126+ builder_failure_count=1, job_failure_count=0)
127
128 twisted_failure = Failure(ConnectionClosed('Boom!'))
129 result = self.manager.checkDispatch(
130 twisted_failure, 'ensurepresent', slave)
131 self.assertIsDispatchFail(result)
132 self.assertEqual(
133- '<bob:http://foo.buildd:8221/> failure (None)', repr(result))
134+ '<bob:%s> failure (None)' % self.fake_builder_url, repr(result))
135 self.assertEqual(2, bob_builder.failure_count)
136 self.assertEqual(
137- 1, bob_builder.currentjob.specific_job.build.failure_count)
138+ 1, bob_builder.getCurrentBuildFarmJob().failure_count)
139
140 def test_checkDispatch_second_comms_fail_by_builder(self):
141 # Unexpected response, results in a `FailDispatchResult`.
142- slave, bob_builder = self._setUpSlaveAndBuilder()
143- bob_builder.failure_count = 1
144- bob_builder.currentjob.specific_job.build.failure_count = 0
145+ slave, bob_builder = self._setUpSlaveAndBuilder(
146+ builder_failure_count=1, job_failure_count=0)
147
148 unexpected_response = [1, 2, 3]
149 result = self.manager.checkDispatch(
150 unexpected_response, 'build', slave)
151 self.assertIsDispatchFail(result)
152 self.assertEqual(
153- '<bob:http://foo.buildd:8221/> failure '
154- '(Unexpected response: [1, 2, 3])', repr(result))
155+ '<bob:%s> failure '
156+ '(Unexpected response: [1, 2, 3])' % self.fake_builder_url,
157+ repr(result))
158 self.assertEqual(2, bob_builder.failure_count)
159 self.assertEqual(
160- 1, bob_builder.currentjob.specific_job.build.failure_count)
161+ 1, bob_builder.getCurrentBuildFarmJob().failure_count)
162
163 def test_checkDispatch_second_comms_fail_by_job(self):
164 # Unknown method was given, results in a `FailDispatchResult`.
165 # This could be caused by a faulty job which would fail the job.
166- slave, bob_builder = self._setUpSlaveAndBuilder()
167- bob_builder.failure_count = 0
168- bob_builder.currentjob.specific_job.build.failure_count = 1
169+ slave, bob_builder = self._setUpSlaveAndBuilder(
170+ builder_failure_count=0, job_failure_count=1)
171
172 successful_response = [
173 buildd_success_result_map.get('ensurepresent'), 'cool builder']
174@@ -471,11 +478,12 @@
175 successful_response, 'unknown-method', slave)
176 self.assertIsDispatchFail(result)
177 self.assertEqual(
178- '<bob:http://foo.buildd:8221/> failure '
179- '(Unknown slave method: unknown-method)', repr(result))
180+ '<bob:%s> failure '
181+ '(Unknown slave method: unknown-method)' % self.fake_builder_url,
182+ repr(result))
183 self.assertEqual(1, bob_builder.failure_count)
184 self.assertEqual(
185- 2, bob_builder.currentjob.specific_job.build.failure_count)
186+ 2, bob_builder.getCurrentBuildFarmJob().failure_count)
187
188 def test_initiateDispatch(self):
189 """Check `dispatchBuild` in various scenarios.
190@@ -498,7 +506,8 @@
191 def check_events(results):
192 [error] = [r for s, r in results if r is not None]
193 self.assertEqual(
194- '<bob:http://bob.buildd:8221/> failure (very broken slave)',
195+ '<bob:%s> failure (very broken slave)'
196+ % self.fake_builder_url,
197 repr(error))
198 self.assertTrue(error.processed)
199
200@@ -512,7 +521,8 @@
201
202 # A functional slave charged with some interactions.
203 slave = RecordingSlave(
204- BOB_THE_BUILDER_NAME, 'http://bob.buildd:8221/', 'bob.host')
205+ BOB_THE_BUILDER_NAME, self.fake_builder_url,
206+ self.fake_builder_host)
207 slave.ensurepresent('arg1', 'arg2', 'arg3')
208 slave.build('arg1', 'arg2', 'arg3')
209
210@@ -545,7 +555,8 @@
211 # cause the builder to be marked as fail.
212 self.test_proxy = TestingXMLRPCProxy('very broken slave')
213 slave = RecordingSlave(
214- BOB_THE_BUILDER_NAME, 'http://bob.buildd:8221/', 'bob.host')
215+ BOB_THE_BUILDER_NAME, self.fake_builder_url,
216+ self.fake_builder_host)
217 slave.ensurepresent('arg1', 'arg2', 'arg3')
218 slave.build('arg1', 'arg2', 'arg3')
219
220@@ -935,9 +946,8 @@
221 build = buildqueue.specific_job.build
222 builder.failure_count = 2
223 build.failure_count = 2
224- disabled = result.assessFailureCounts()
225+ result.assessFailureCounts()
226
227- self.assertFalse(disabled)
228 self.assertBuilderIsClean(builder)
229 self.assertEqual('NEEDSBUILD', build.status.name)
230 self.assertBuildqueueIsClean(buildqueue)
231@@ -949,9 +959,8 @@
232 build = buildqueue.specific_job.build
233 build.failure_count = 2
234 builder.failure_count = 1
235- disabled = result.assessFailureCounts()
236+ result.assessFailureCounts()
237
238- self.assertTrue(disabled)
239 self.assertBuilderIsClean(builder)
240 self.assertEqual('FAILEDTOBUILD', build.status.name)
241 # The buildqueue should have been removed entirely.
242@@ -966,9 +975,8 @@
243 build = buildqueue.specific_job.build
244 build.failure_count = 2
245 builder.failure_count = 3
246- disabled = result.assessFailureCounts()
247+ result.assessFailureCounts()
248
249- self.assertTrue(disabled)
250 self.assertFalse(builder.builderok)
251 self.assertEqual('does not work!', builder.failnotes)
252 self.assertTrue(builder.currentjob is None)
Revision history for this message
Jonathan Lange (jml) wrote :
Download full text (9.6 KiB)

On Fri, Aug 20, 2010 at 1:16 PM, Julian Edwards
<email address hidden> wrote:
> On Friday 20 August 2010 11:43:40 you wrote:
>> On Thu, Aug 19, 2010 at 5:13 PM, Julian Edwards
>> <email address hidden> wrote:
>> > On Thursday 19 August 2010 13:37:40 you wrote:

...
>> >> > === modified file 'lib/lp/buildmaster/manager.py'
>> >> > --- lib/lp/buildmaster/manager.py   2010-08-02 16:00:50 +0000
>> >> > +++ lib/lp/buildmaster/manager.py   2010-08-18 17:01:23 +0000
>>
>> ...
>>
>> >> > +        :return: True if we disabled something, False if we did not.
>> >>
>> >> This seems like an odd thing to return. It doesn't seem to be used in
>> >> the code. What's it for?
>> >
>> > For the tests.
>>
>> Yeah, but what about the tests? Or rather, why do the tests care about
>> something that the code doesn't?
>
> Well lots of our tests return stuff that the code doesn't need, like Deferreds
> for example.  But fine, I've got rid of it.
>

Thanks.

>> >> > +            # Decide if we need to terminate the job or fail the
>> >> > +            # builder.
>> >> > +            self._incrementFailureCounts(builder)
>> >> > +            self.logger.info(
>> >> > +                "builder failure count: %s, job failure count: %s" %
>> >> > ( +                    builder.failure_count,
>> >> > +
>> >> > builder.currentjob.specific_job.build.failure_count)) +
>> >> > BaseDispatchResult(slave=None).assessFailureCounts(builder)
>> >>
>> >> This line makes me wonder why assessFailureCounts is on
>> >> BaseDispatchResult at all.  Perhaps it should be a method on Builder
>> >> that takes an optional 'info' parameter?
>> >>
>> >> The above code would then read:
>> >>
>> >>   builder.gotFailure()
>> >>   self.logger.info(...)
>> >>   builder.assessFailureCounts()
>> >
>> > I don't think we should continue to bloat model classes with code that is
>> > only used in one place.  This is a buildd-manager-specific function.
>> >
>> > Anyway, bugger, I had intended to factor it out to a standalone function
>> > as I also realised that that line is ridiculous and again forgot :/
>>
>> A standalone function would be a definite improvement, and fine by me.
>> I personally don't think it's bloat to say "a builder knows how to
>> handle builder failures", but I'm not going to push it.
>
> By the same argument, anything that deals with object X should be a method on
> that object X.
>

I would argue that most things that manipulate state on object X
should be methods of object X. I would not say it's a universal rule.

> The difference here is that it's not dealing with builder failures
> exclusively, it's dealing with a combination of builder and job failures, and
> I would personally find it odd that a builder was trying to modify job
> properties.
>
> Generally, I think the boundaries of responsibility in our code are poor and
> should be less like this.
>
> Did I convince you yet?  Probably not :)
>

I think you have, in this case, particularly given the concessions
below to have methods on Job (however it's spelled) and Builder that
do the failure incrementing. That seems like a clean division of
responsibility.

>>
>> >> > +        builder.failure_count += 1
>> >> ...

Read more...

Revision history for this message
Julian Edwards (julian-edwards) wrote :

On Monday 23 August 2010 18:34:43 Jonathan Lange wrote:
> I would argue that most things that manipulate state on object X
> should be methods of object X. I would not say it's a universal rule.

If we boil it down to the basic manipulation required, I completely agree.

> I think you have, in this case, particularly given the concessions
> below to have methods on Job (however it's spelled) and Builder that
> do the failure incrementing. That seems like a clean division of
> responsibility.

Woot! :)

> > I'd prefer to have a gotFailure() on each of Builder and BuildFarmJob and
> > have the build manager call each in turn.
> >
> > Let me know if you think that is acceptable and I'll do that change (it's
> > not in the attached diff).
>
> I think that's a great idea.

Ok it's done.

> Re robustness, I think you lose one kind of robustness and gain another.

I guess there are different ways of looking at it...

> "Was this method called?" tests are robust against behaviour change.
> If you change what assessFailureCounts is supposed to do, then you
> won't have to change these tests, only the assessFailureCounts ones.
> (Some people call this "Mockist")
>
> Tests that check state are robust against implementation change. If
> you change the details of how you've implemented the code (as you've
> done during this review process), you won't have to change any tests.
> (Some people call this "Classic")
>
> As a rule, I lean heavily toward the latter, since I change
> implementation details much more frequently than behaviour. I use
> mocks in situations where there's a very clearly defined, fairly
> stable interface and when using some other kind of double would be
> prohibitively difficult.
>
> See http://martinfowler.com/articles/mocksArentStubs.html for a good
> discussion on the whole topic.
>
> As far as this diff goes, I think using a mock is less preferable but
> still sound, and I'm happy to see it merged.

I think I'm in total agreement with the sentiment above. In my case here I am
trying to abstract away from the "how failures are counted" as much as
possible since I forsee some changes in how we make it work. I suspect that
this branch will not be the final word on the matter :)

> It seems to me that Builder.currentjob ought to be replaced with a
> method asap, since it's doing a query, and naive readers might well
> miss that fact.

Agreed, I've filed a bug though as after grepping around, it will be a big
diff and I don't want to pollute this branch.

> If it's doing dupe queries and you aren't going to change currentjob
> to a method in this branch, then please add a comment saying something
> like:
>
> # builder.currentjob runs a query. Don't run it twice.

Done!

See partial diff.

Thanks for sticking with me :)

1=== modified file 'lib/lp/buildmaster/interfaces/builder.py'
2--- lib/lp/buildmaster/interfaces/builder.py 2010-08-20 13:47:47 +0000
3+++ lib/lp/buildmaster/interfaces/builder.py 2010-08-24 09:55:13 +0000
4@@ -154,6 +154,12 @@
5 title=u"The current behavior of the builder for the current job.",
6 required=False)
7
8+ def gotFailure():
9+ """Increment failure_count on the builder."""
10+
11+ def resetFailureCount():
12+ """Set the failure_count back to zero."""
13+
14 def checkSlaveAlive():
15 """Check that the buildd slave is alive.
16
17
18=== modified file 'lib/lp/buildmaster/interfaces/buildfarmjob.py'
19--- lib/lp/buildmaster/interfaces/buildfarmjob.py 2010-07-23 20:27:27 +0000
20+++ lib/lp/buildmaster/interfaces/buildfarmjob.py 2010-08-24 10:01:21 +0000
21@@ -270,6 +270,9 @@
22 returned.
23 """
24
25+ def gotFailure():
26+ """Increment the failure_count for this job."""
27+
28 title = exported(TextLine(title=_("Title"), required=False))
29
30 was_built = Attribute("Whether or not modified by the builddfarm.")
31
32=== modified file 'lib/lp/buildmaster/manager.py'
33--- lib/lp/buildmaster/manager.py 2010-08-20 13:42:19 +0000
34+++ lib/lp/buildmaster/manager.py 2010-08-24 10:27:37 +0000
35@@ -150,6 +150,8 @@
36
37 def assessFailureCounts(builder, fail_notes):
38 """View builder/job failure_count and work out which needs to die. """
39+ # builder.currentjob hides a complicated query, don't run it twice.
40+ # See bug 623281.
41 current_job = builder.currentjob
42 build_job = current_job.specific_job.build
43
44@@ -361,7 +363,7 @@
45 if self.builder.currentjob is not None:
46 # After a successful dispatch we can reset the
47 # failure_count.
48- self.builder.failure_count = 0
49+ self.builder.resetFailureCount()
50 transaction.commit()
51 return slave
52
53@@ -493,8 +495,8 @@
54 return self.reset_result(slave, error_text)
55
56 def _incrementFailureCounts(self, builder):
57- builder.failure_count += 1
58- builder.getCurrentBuildFarmJob().failure_count += 1
59+ builder.gotFailure()
60+ builder.getCurrentBuildFarmJob().gotFailure()
61
62 def checkDispatch(self, response, method, slave):
63 """Verify the results of a slave xmlrpc call.
64
65=== modified file 'lib/lp/buildmaster/model/builder.py'
66--- lib/lp/buildmaster/model/builder.py 2010-08-20 13:47:47 +0000
67+++ lib/lp/buildmaster/model/builder.py 2010-08-24 10:15:31 +0000
68@@ -294,6 +294,14 @@
69 current_build_behavior = property(
70 _getCurrentBuildBehavior, _setCurrentBuildBehavior)
71
72+ def gotFailure(self):
73+ """See `IBuilder`."""
74+ self.failure_count += 1
75+
76+ def resetFailureCount(self):
77+ """See `IBuilder`."""
78+ self.failure_count = 0
79+
80 def checkSlaveAlive(self):
81 """See IBuilder."""
82 if self.slave.echo("Test")[0] != "Test":
83@@ -311,6 +319,8 @@
84 """See IBuilder."""
85 return self.slave.clean()
86
87+ # XXX 2010-08-24 Julian bug=623281
88+ # This should not be a property! It's masking a complicated query.
89 @property
90 def currentjob(self):
91 """See IBuilder"""
92
93=== modified file 'lib/lp/buildmaster/model/buildfarmjob.py'
94--- lib/lp/buildmaster/model/buildfarmjob.py 2010-08-17 15:04:47 +0000
95+++ lib/lp/buildmaster/model/buildfarmjob.py 2010-08-24 10:00:34 +0000
96@@ -344,6 +344,10 @@
97
98 return build_without_outer_proxy
99
100+ def gotFailure(self):
101+ """See `IBuildFarmJob`."""
102+ self.failure_count += 1
103+
104
105 class BuildFarmJobDerived:
106 implements(IBuildFarmJob)
Revision history for this message
Jonathan Lange (jml) wrote :

On Tue, Aug 24, 2010 at 11:34 AM, Julian Edwards
<email address hidden> wrote:
> On Monday 23 August 2010 18:34:43 Jonathan Lange wrote:
>> I would argue that most things that manipulate state on object X
>> should be methods of object X. I would not say it's a universal rule.
>
> If we boil it down to the basic manipulation required, I completely agree.
>
>> I think you have, in this case, particularly given the concessions
>> below to have methods on Job (however it's spelled) and Builder that
>> do the failure incrementing. That seems like a clean division of
>> responsibility.
>
> Woot! :)
...
> See partial diff.
>
>

Rockin. I'd spell resetFailureCount() as gotSuccessfulDispatch(), but
ok to land as is.

jml

Revision history for this message
Jonathan Lange (jml) :
review: Approve

Preview Diff

Empty

Subscribers

People subscribed via source and target branches

to status/vote changes: