Merge lp:~julian-edwards/launchpad/buildd-failure-counting into lp:launchpad/db-devel
- buildd-failure-counting
- Merge into db-devel
Status: | Merged |
---|---|
Merged at revision: | 9702 |
Proposed branch: | lp:~julian-edwards/launchpad/buildd-failure-counting |
Merge into: | lp:launchpad/db-devel |
Prerequisite: | lp:~julian-edwards/launchpad/buildd-manager-parallel-scan |
Diff against target: | 0 lines |
To merge this branch: | bzr merge lp:~julian-edwards/launchpad/buildd-failure-counting |
Related bugs: |
|
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Jonathan Lange (community) | Approve | ||
Stuart Bishop (community) | db | Approve | |
Robert Collins | db | Pending | |
Review via email: mp+31556@code.launchpad.net |
Commit message
Description of the change
Add failure counting on builder and jobs in the build farm, to work out which one is the nasty one when Stuff Goes Wrong. The builder failure count is reset when we know the dispatch was successul, so it's only there to catch actual dispatch problems, not problems while building.
Jonathan Lange (jml) wrote : | # |
Hey Julian,
This looks like a big step toward build farm reliability -- thank you.
Some of my comments are asking for further explanation, a few make suggestions
for moving some code around, and the rest are the normal stylistic gotchas.
Out of curiosity, did you discuss this change in detail with anyone before
landing?
cheers,
jml
> === modified file 'lib/lp/
> --- lib/lp/
> +++ lib/lp/
> @@ -50,6 +50,9 @@
> <require
> permission=
> set_attributes=
> + <require
> + permission=
> + set_attributes=
> </class>
> <securedutility
> component=
>
I guess this means that whatever increments this runs with admin-level
privileges. Is that really such a good idea?
> === modified file 'lib/lp/
> --- lib/lp/
> +++ lib/lp/
...
> @@ -157,6 +158,52 @@
> if job is not None:
> job.reset()
>
> + def _getBuilder(self):
> + # Helper to return the builder given the slave for this request.
> + # Avoiding circular imports.
> + from lp.buildmaster.
> + return getUtility(
> +
I notice that there's very similar code below.
I'd suggest making a top-level function like this:
def get_builder(
# Helper to return the builder given the slave for this request.
# Avoiding circular imports.
from lp.buildmaster.
return getUtility(
And using that here instead, as well as in SlaveScanner.
> + def assessFailureCo
> + """View builder/job failure_count and work out which needs to die.
> +
There's spurious whitespace on this line.
> + :return: True if we disabled something, False if we did not.
This seems like an odd thing to return. It doesn't seem to be used in the
code. What's it for?
> + """
> + # Avoiding circular imports.
Not avoiding circular imports here any more :)
> + if builder is None:
> + builder = self._getBuilder()
> + build_job = builder.
> +
> + if builder.
> + # This is either the first failure for this job on this
> + # builder, or by some chance the job was re-dispatched to
> + # the same builder. This make it impossible to determine
"makes"
> + # whether the job or the builder is at fault, so don't fail
> + # either. We reset the builder and job to try again.
It's unclear to me why the two counts being equal could imply that this is the
first failure for the job -- surely a count of 1 would signify that?
Perhaps the comment should say something like:
# If the failure count ...
Julian Edwards (julian-edwards) wrote : | # |
On Thursday 19 August 2010 13:37:40 you wrote:
> Review: Needs Fixing
> Hey Julian,
>
> This looks like a big step toward build farm reliability -- thank you.
>
> Some of my comments are asking for further explanation, a few make
> suggestions for moving some code around, and the rest are the normal
> stylistic gotchas.
>
> Out of curiosity, did you discuss this change in detail with anyone before
> landing?
It's been so long since I started it I honestly can't remember. I have vague
recollections of mentioning failure counts to various people.
> > === modified file 'lib/lp/
> > --- lib/lp/
> > +++ lib/lp/
> > @@ -50,6 +50,9 @@
> >
> > <require
> >
> > permission=
> > set_attributes=
> >
> > + <require
> > + permission=
> > + set_attributes=
> >
> > </class>
> > <securedutility
> >
> > component=
>
> I guess this means that whatever increments this runs with admin-level
> privileges. Is that really such a good idea?
For now yes, because it's only ever changed in "zopeless" scripts. That is,
Zope requres this to work, but doesn't actually apply any level of security to
it. :/
I made it admin to discourage changing it in appservers. There are other
script-
the zcml about exposing it externally! I prefer the extra protection up
front...
> > === modified file 'lib/lp/
> > --- lib/lp/
> > +++ lib/lp/
>
> ...
>
> > @@ -157,6 +158,52 @@
> >
> > if job is not None:
> > job.reset()
> >
> > + def _getBuilder(self):
> > + # Helper to return the builder given the slave for this request.
> > + # Avoiding circular imports.
> > + from lp.buildmaster.
> > + return getUtility(
> > +
>
> I notice that there's very similar code below.
>
> I'd suggest making a top-level function like this:
>
> def get_builder(
> # Helper to return the builder given the slave for this request.
> # Avoiding circular imports.
> from lp.buildmaster.
> return getUtility(
>
> And using that here instead, as well as in SlaveScanner.
Meh, I meant to do that and forgot, thanks.
>
> > + def assessFailureCo
> > + """View builder/job failure_count and work out which needs to
> > die. +
>
> There's spurious whitespace on this line.
Grar, my vim setup has gone weird on me, it always used to tell me about
trailing spaces.
>
> > + :return: True if we disabled something, False if we did not.
>
> This seems like an odd thing to return. It doesn't seem to be used in the
> code. W...
1 | === modified file 'lib/lp/buildmaster/interfaces/builder.py' |
2 | --- lib/lp/buildmaster/interfaces/builder.py 2010-08-17 15:04:47 +0000 |
3 | +++ lib/lp/buildmaster/interfaces/builder.py 2010-08-19 14:43:04 +0000 |
4 | @@ -281,6 +281,9 @@ |
5 | :return: A BuildQueue, or None. |
6 | """ |
7 | |
8 | + def getCurrentBuildFarmJob(): |
9 | + """Return a `BuildFarmJob` for this builder.""" |
10 | + |
11 | |
12 | class IBuilderSet(Interface): |
13 | """Collections of builders. |
14 | |
15 | === modified file 'lib/lp/buildmaster/manager.py' |
16 | --- lib/lp/buildmaster/manager.py 2010-08-18 16:58:18 +0000 |
17 | +++ lib/lp/buildmaster/manager.py 2010-08-19 15:51:41 +0000 |
18 | @@ -141,10 +141,55 @@ |
19 | return d |
20 | |
21 | |
22 | +def get_builder(name): |
23 | + """Helper to return the builder given the slave for this request.""" |
24 | + # Avoiding circular imports. |
25 | + from lp.buildmaster.interfaces.builder import IBuilderSet |
26 | + return getUtility(IBuilderSet)[name] |
27 | + |
28 | + |
29 | +def assessFailureCounts(builder, fail_notes): |
30 | + """View builder/job failure_count and work out which needs to die. |
31 | + |
32 | + :return: True if we disabled something, False if we did not. |
33 | + """ |
34 | + current_job = builder.currentjob |
35 | + build_job = current_job.specific_job.build |
36 | + |
37 | + if builder.failure_count == build_job.failure_count: |
38 | + # If the failure count for the builder is the same as the |
39 | + # failure count for the job being built, then we cannot |
40 | + # tell whether the job or the builder is at fault. The best |
41 | + # we can do is try them both again, and hope that the job |
42 | + # runs against a different builder. |
43 | + current_job.reset() |
44 | + return False |
45 | + |
46 | + if builder.failure_count > build_job.failure_count: |
47 | + # The builder has failed more than the jobs it's been |
48 | + # running, so let's disable it and re-schedule the build. |
49 | + builder.failBuilder(fail_notes) |
50 | + current_job.reset() |
51 | + return True |
52 | + else: |
53 | + # The job is the culprit! Override its status to 'failed' |
54 | + # to make sure it won't get automatically dispatched again, |
55 | + # and remove the buildqueue request. The failure should |
56 | + # have already caused any relevant slave data to be stored |
57 | + # on the build record so don't worry about that here. |
58 | + build_job.status = BuildStatus.FAILEDTOBUILD |
59 | + builder.currentjob.destroySelf() |
60 | + |
61 | + # N.B. We could try and call _handleStatus_PACKAGEFAIL here |
62 | + # but that would cause us to query the slave for its status |
63 | + # again, and if the slave is non-responsive it holds up the |
64 | + # next buildd scan. |
65 | + return True |
66 | + |
67 | + |
68 | class BaseDispatchResult: |
69 | """Base class for *DispatchResult variations. |
70 | |
71 | - |
72 | It will be extended to represent dispatching results and allow |
73 | homogeneous processing. |
74 | """ |
75 | @@ -158,51 +203,13 @@ |
76 | if job is not None: |
77 | job.reset() |
78 | |
79 | - def _getBuilder(self): |
80 | - # Helper to return the builder given the slave for this request. |
81 | - # Avoiding circular imports. |
82 | - from lp.buildmaster.interfaces.builder import IBuilderSet |
83 | - return getUtility(IBuilderSet)[self.slave.name] |
84 | - |
85 | - def assessFailureCounts(self, builder=None): |
86 | + def assessFailureCounts(self): |
87 | """View builder/job failure_count and work out which needs to die. |
88 | - |
89 | + |
90 | :return: True if we disabled something, False if we did not. |
91 | """ |
92 | - # Avoiding circular imports. |
93 | - if builder is None: |
94 | - builder = self._getBuilder() |
95 | - build_job = builder.currentjob.specific_job.build |
96 | - |
97 | - if builder.failure_count == build_job.failure_count: |
98 | - # This is either the first failure for this job on this |
99 | - # builder, or by some chance the job was re-dispatched to |
100 | - # the same builder. This make it impossible to determine |
101 | - # whether the job or the builder is at fault, so don't fail |
102 | - # either. We reset the builder and job to try again. |
103 | - self._cleanJob(builder.currentjob) |
104 | - return False |
105 | - |
106 | - if builder.failure_count > build_job.failure_count: |
107 | - # The builder has failed more than the jobs it's been |
108 | - # running, so let's disable it and re-schedule the build. |
109 | - builder.failBuilder(self.info) |
110 | - self._cleanJob(builder.currentjob) |
111 | - return True |
112 | - else: |
113 | - # The job is the culprit! Override its status to 'failed' |
114 | - # to make sure it won't get automatically dispatched again, |
115 | - # and remove the buildqueue request. The failure should |
116 | - # have already caused any relevant slave data to be stored |
117 | - # on the build record so don't worry about that here. |
118 | - build_job.status = BuildStatus.FAILEDTOBUILD |
119 | - builder.currentjob.destroySelf() |
120 | - |
121 | - # N.B. We could try and call _handleStatus_PACKAGEFAIL here |
122 | - # but that would cause us to query the slave for its status |
123 | - # again, and if the slave is non-responsive it holds up the |
124 | - # next buildd scan. |
125 | - return True |
126 | + builder = get_builder(self.slave.name) |
127 | + return assessFailureCounts(builder, self.info) |
128 | |
129 | def ___call__(self): |
130 | raise NotImplementedError( |
131 | @@ -237,7 +244,7 @@ |
132 | |
133 | @write_transaction |
134 | def __call__(self): |
135 | - builder = self._getBuilder() |
136 | + builder = get_builder(self.slave.name) |
137 | # Builders that fail to reset should be disabled as per bug |
138 | # 563353. |
139 | # XXX Julian bug=586362 |
140 | @@ -283,18 +290,16 @@ |
141 | self.logger.info("Scanning failed with: %s\n%s" % |
142 | (error.getErrorMessage(), error.getTraceback())) |
143 | |
144 | - # Avoid circular import. |
145 | - from lp.buildmaster.interfaces.builder import IBuilderSet |
146 | - builder = getUtility(IBuilderSet)[self.builder_name] |
147 | + builder = get_builder(self.builder_name) |
148 | |
149 | # Decide if we need to terminate the job or fail the |
150 | # builder. |
151 | self._incrementFailureCounts(builder) |
152 | self.logger.info( |
153 | - "builder failure count: %s, job failure count: %s" % ( |
154 | + "builder failure count: %s, job failure count: %s" % ( |
155 | builder.failure_count, |
156 | - builder.currentjob.specific_job.build.failure_count)) |
157 | - BaseDispatchResult(slave=None).assessFailureCounts(builder) |
158 | + builder.getCurrentBuildFarmJob().failure_count)) |
159 | + assessFailureCounts(builder, error.getErrorMessage()) |
160 | transaction.commit() |
161 | |
162 | self.scheduleNextScanCycle() |
163 | @@ -311,10 +316,7 @@ |
164 | # We need to re-fetch the builder object on each cycle as the |
165 | # Storm store is invalidated over transaction boundaries. |
166 | |
167 | - # Avoid circular import. |
168 | - from lp.buildmaster.interfaces.builder import IBuilderSet |
169 | - builder_set = getUtility(IBuilderSet) |
170 | - self.builder = builder_set[self.builder_name] |
171 | + self.builder = get_builder(self.builder_name) |
172 | |
173 | if self.builder.builderok: |
174 | self.builder.updateStatus(self.logger) |
175 | @@ -496,9 +498,8 @@ |
176 | return self.reset_result(slave, error_text) |
177 | |
178 | def _incrementFailureCounts(self, builder): |
179 | - # Avoid circular import. |
180 | builder.failure_count += 1 |
181 | - builder.currentjob.specific_job.build.failure_count += 1 |
182 | + builder.getCurrentBuildFarmJob().failure_count += 1 |
183 | |
184 | def checkDispatch(self, response, method, slave): |
185 | """Verify the results of a slave xmlrpc call. |
186 | |
187 | === modified file 'lib/lp/buildmaster/model/builder.py' |
188 | --- lib/lp/buildmaster/model/builder.py 2010-08-17 15:04:47 +0000 |
189 | +++ lib/lp/buildmaster/model/builder.py 2010-08-19 14:42:10 +0000 |
190 | @@ -645,6 +645,11 @@ |
191 | Job._status == JobStatus.RUNNING, |
192 | Job.date_started != None).one() |
193 | |
194 | + def getCurrentBuildFarmJob(self): |
195 | + """See `IBuilder`.""" |
196 | + # Don't make this a property, it's masking a few queries. |
197 | + return self.currentjob.specific_job.build |
198 | + |
199 | |
200 | class BuilderSet(object): |
201 | """See IBuilderSet""" |
202 | |
203 | === modified file 'lib/lp/buildmaster/tests/test_builder.py' |
204 | --- lib/lp/buildmaster/tests/test_builder.py 2010-08-17 15:04:47 +0000 |
205 | +++ lib/lp/buildmaster/tests/test_builder.py 2010-08-19 15:07:15 +0000 |
206 | @@ -43,9 +43,18 @@ |
207 | |
208 | def test_default_values(self): |
209 | builder = self.factory.makeBuilder() |
210 | + # Make sure the Storm cache gets the values that the database |
211 | + # initialises. |
212 | flush_database_updates() |
213 | self.assertEqual(0, builder.failure_count) |
214 | |
215 | + def test_getCurrentBuildFarmJob(self): |
216 | + bq = self.factory.makeSourcePackageRecipeBuildJob(3333) |
217 | + builder = self.factory.makeBuilder() |
218 | + bq.markAsBuilding(builder) |
219 | + self.assertEqual( |
220 | + bq, builder.getCurrentBuildFarmJob().buildqueue_record) |
221 | + |
222 | def test_getBuildQueue(self): |
223 | buildqueueset = getUtility(IBuildQueueSet) |
224 | active_jobs = buildqueueset.getActiveBuildJobs() |
225 | |
226 | === modified file 'lib/lp/buildmaster/tests/test_manager.py' |
227 | --- lib/lp/buildmaster/tests/test_manager.py 2010-08-18 16:58:18 +0000 |
228 | +++ lib/lp/buildmaster/tests/test_manager.py 2010-08-19 16:06:10 +0000 |
229 | @@ -7,7 +7,6 @@ |
230 | import signal |
231 | import time |
232 | import transaction |
233 | -import unittest |
234 | |
235 | from twisted.internet import defer, reactor, task |
236 | from twisted.internet.error import ConnectionClosed |
237 | @@ -38,6 +37,7 @@ |
238 | from lp.soyuz.tests.test_publishing import SoyuzTestPublisher |
239 | from lp.testing.factory import LaunchpadObjectFactory |
240 | from lp.testing.fakemethod import FakeMethod |
241 | +from lp.testing.sampledata import BOB_THE_BUILDER_NAME |
242 | from lp.testing import TestCase as LaunchpadTestCase |
243 | |
244 | |
245 | @@ -225,7 +225,8 @@ |
246 | |
247 | def setUp(self): |
248 | TrialTestCase.setUp(self) |
249 | - self.manager = TestingSlaveScanner("bob", BufferLogger()) |
250 | + self.manager = TestingSlaveScanner( |
251 | + BOB_THE_BUILDER_NAME, BufferLogger()) |
252 | |
253 | # We will use an instrumented SlaveScanner instance for tests in |
254 | # this context. |
255 | @@ -362,7 +363,8 @@ |
256 | |
257 | def _setUpSlaveAndBuilder(self): |
258 | # Helper function to set up a builder and its recording slave. |
259 | - slave = RecordingSlave('bob', 'http://foo.buildd:8221/', 'foo.host') |
260 | + slave = RecordingSlave( |
261 | + BOB_THE_BUILDER_NAME, 'http://foo.buildd:8221/', 'foo.host') |
262 | bob_builder = getUtility(IBuilderSet)[slave.name] |
263 | return slave, bob_builder |
264 | |
265 | @@ -509,7 +511,8 @@ |
266 | dl.addCallback(check_events) |
267 | |
268 | # A functional slave charged with some interactions. |
269 | - slave = RecordingSlave('bob', 'http://bob.buildd:8221/', 'bob.host') |
270 | + slave = RecordingSlave( |
271 | + BOB_THE_BUILDER_NAME, 'http://bob.buildd:8221/', 'bob.host') |
272 | slave.ensurepresent('arg1', 'arg2', 'arg3') |
273 | slave.build('arg1', 'arg2', 'arg3') |
274 | |
275 | @@ -541,7 +544,8 @@ |
276 | # Create a broken slave and insert interaction that will |
277 | # cause the builder to be marked as fail. |
278 | self.test_proxy = TestingXMLRPCProxy('very broken slave') |
279 | - slave = RecordingSlave('bob', 'http://bob.buildd:8221/', 'bob.host') |
280 | + slave = RecordingSlave( |
281 | + BOB_THE_BUILDER_NAME, 'http://bob.buildd:8221/', 'bob.host') |
282 | slave.ensurepresent('arg1', 'arg2', 'arg3') |
283 | slave.build('arg1', 'arg2', 'arg3') |
284 | |
285 | @@ -623,7 +627,7 @@ |
286 | |
287 | Replace its default logging handler by a testing version. |
288 | """ |
289 | - manager = SlaveScanner("bob", BufferLogger()) |
290 | + manager = SlaveScanner(BOB_THE_BUILDER_NAME, BufferLogger()) |
291 | manager.logger.name = 'slave-scanner' |
292 | |
293 | return manager |
294 | @@ -672,7 +676,7 @@ |
295 | # A job gets dispatched to the sampledata builder after it's reset. |
296 | |
297 | # Reset sampledata builder. |
298 | - builder = getUtility(IBuilderSet)['bob'] |
299 | + builder = getUtility(IBuilderSet)[BOB_THE_BUILDER_NAME] |
300 | self._resetBuilder(builder) |
301 | # Set this to 1 here so that _checkDispatch can make sure it's |
302 | # reset to 0 after a successful dispatch. |
303 | @@ -705,7 +709,7 @@ |
304 | # and the builder used should remain active and IDLE. |
305 | |
306 | # Reset sampledata builder. |
307 | - builder = getUtility(IBuilderSet)['bob'] |
308 | + builder = getUtility(IBuilderSet)[BOB_THE_BUILDER_NAME] |
309 | self._resetBuilder(builder) |
310 | |
311 | # Remove hoary/i386 chroot. |
312 | @@ -746,7 +750,7 @@ |
313 | # The job assigned to a broken builder is rescued. |
314 | |
315 | # Sampledata builder is enabled and is assigned to an active job. |
316 | - builder = getUtility(IBuilderSet)['bob'] |
317 | + builder = getUtility(IBuilderSet)[BOB_THE_BUILDER_NAME] |
318 | self.assertTrue(builder.builderok) |
319 | job = builder.currentjob |
320 | self.assertBuildingJob(job, builder) |
321 | @@ -783,7 +787,7 @@ |
322 | |
323 | # Enable sampledata builder attached to an appropriate testing |
324 | # slave. It will respond as if it was building the sampledata job. |
325 | - builder = getUtility(IBuilderSet)['bob'] |
326 | + builder = getUtility(IBuilderSet)[BOB_THE_BUILDER_NAME] |
327 | |
328 | login('foo.bar@canonical.com') |
329 | builder.builderok = True |
330 | @@ -804,12 +808,13 @@ |
331 | def test_scan_assesses_failure_exceptions(self): |
332 | # If scan() fails with an exception, failure_counts should be |
333 | # incremented and tested. |
334 | - def fake_scan(): |
335 | + def failing_scan(): |
336 | raise Exception("fake exception") |
337 | manager = self._getManager() |
338 | - manager.scan = fake_scan |
339 | + manager.scan = failing_scan |
340 | manager.scheduleNextScanCycle = FakeMethod() |
341 | - self.patch(BaseDispatchResult, 'assessFailureCounts', FakeMethod()) |
342 | + from lp.buildmaster import manager as manager_module |
343 | + self.patch(manager_module, 'assessFailureCounts', FakeMethod()) |
344 | builder = getUtility(IBuilderSet)[manager.builder_name] |
345 | |
346 | # Failure counts start at zero. |
347 | @@ -828,10 +833,10 @@ |
348 | 1, builder.currentjob.specific_job.build.failure_count) |
349 | |
350 | self.assertEqual( |
351 | - 1, BaseDispatchResult.assessFailureCounts.call_count) |
352 | - |
353 | - |
354 | -class TestDispatchResult(unittest.TestCase): |
355 | + 1, manager_module.assessFailureCounts.call_count) |
356 | + |
357 | + |
358 | +class TestDispatchResult(LaunchpadTestCase): |
359 | """Tests `BaseDispatchResult` variations. |
360 | |
361 | Variations of `BaseDispatchResult` when evaluated update the database |
362 | @@ -874,12 +879,12 @@ |
363 | def assertBuilderIsClean(self, builder): |
364 | # Check that the builder is ready for a new build. |
365 | self.assertTrue(builder.builderok) |
366 | - self.assertTrue(builder.failnotes is None) |
367 | - self.assertTrue(builder.currentjob is None) |
368 | + self.assertIs(None, builder.failnotes) |
369 | + self.assertIs(None, builder.currentjob) |
370 | |
371 | def testResetDispatchResult(self): |
372 | # Test that `ResetDispatchResult` resets the builder and job. |
373 | - builder, job_id = self._getBuilder('bob') |
374 | + builder, job_id = self._getBuilder(BOB_THE_BUILDER_NAME) |
375 | buildqueue_id = builder.currentjob.id |
376 | builder.builderok = True |
377 | builder.failure_count = 1 |
378 | @@ -899,8 +904,11 @@ |
379 | self.assertBuilderIsClean(builder) |
380 | |
381 | def testFailDispatchResult(self): |
382 | - # Test that `FailDispatchResult` calls assessFailureCounts(). |
383 | - builder, job_id = self._getBuilder('bob') |
384 | + # Test that `FailDispatchResult` calls assessFailureCounts() so |
385 | + # that we know the builders and jobs are failed as necessary |
386 | + # when a FailDispatchResult is called at the end of the dispatch |
387 | + # chain. |
388 | + builder, job_id = self._getBuilder(BOB_THE_BUILDER_NAME) |
389 | |
390 | # Setup a interaction to satisfy 'write_transaction' decorator. |
391 | login(ANONYMOUS) |
392 | @@ -914,7 +922,7 @@ |
393 | def _setup_failing_dispatch_result(self): |
394 | # assessFailureCounts should fail jobs or builders depending on |
395 | # whether it sees the failure_counts on each increasing. |
396 | - builder, job_id = self._getBuilder('bob') |
397 | + builder, job_id = self._getBuilder(BOB_THE_BUILDER_NAME) |
398 | slave = RecordingSlave(builder.name, builder.url, builder.vm_host) |
399 | result = FailDispatchResult(slave, 'does not work!') |
400 | return builder, result |
401 | |
402 | === modified file 'lib/lp/soyuz/configure.zcml' |
403 | --- lib/lp/soyuz/configure.zcml 2010-08-17 15:04:47 +0000 |
404 | +++ lib/lp/soyuz/configure.zcml 2010-08-19 15:32:16 +0000 |
405 | @@ -513,6 +513,9 @@ |
406 | status dependencies upload_log"/> |
407 | |
408 | <!-- XXX bigjools 2010-07-27 bug=570939 |
409 | + Work around the fact that not all BuildFarmJobs are concrete |
410 | + objects. |
411 | + |
412 | This should not be required once the old BuildFarmJob stuff is |
413 | removed when the Translation Template jobs and Recipe jobs |
414 | use the new infrastructure --> |
415 | |
416 | === modified file 'lib/lp/testing/sampledata.py' |
417 | --- lib/lp/testing/sampledata.py 2010-08-05 02:29:25 +0000 |
418 | +++ lib/lp/testing/sampledata.py 2010-08-19 15:15:27 +0000 |
419 | @@ -9,9 +9,11 @@ |
420 | |
421 | __metaclass__ = type |
422 | __all__ = [ |
423 | + 'BOB_THE_BUILDER_NAME', |
424 | 'BUILDD_ADMIN_USERNAME', |
425 | 'CHROOT_LIBRARYFILEALIAS', |
426 | 'COMMERCIAL_ADMIN_EMAIL', |
427 | + 'FROG_THE_BUILDER_NAME', |
428 | 'HOARY_DISTROSERIES_NAME', |
429 | 'I386_ARCHITECTURE_NAME', |
430 | 'LAUNCHPAD_DBUSER_NAME', |
431 | @@ -34,6 +36,9 @@ |
432 | |
433 | # A user with buildd admin rights and upload rights to Ubuntu. |
434 | BUILDD_ADMIN_USERNAME = 'cprov' |
435 | +# A couple of builders. |
436 | +BOB_THE_BUILDER_NAME = 'bob' |
437 | +FROG_THE_BUILDER_NAME = 'frog' |
438 | # The LibraryFileAlias of a chroot for attaching to a DistroArchSeries |
439 | CHROOT_LIBRARYFILEALIAS = 1 |
440 | HOARY_DISTROSERIES_NAME = 'hoary' |
Stuart Bishop (stub) wrote : | # |
Allocated new DB patch number patch-2208-04-0.sql
Jonathan Lange (jml) wrote : | # |
On Thu, Aug 19, 2010 at 5:13 PM, Julian Edwards
<email address hidden> wrote:
> On Thursday 19 August 2010 13:37:40 you wrote:
>> Review: Needs Fixing
>> Hey Julian,
>>
>> This looks like a big step toward build farm reliability -- thank you.
>>
Thanks for the changes. There are a few questions below, but I reckon
this'll be the last round.
>> Some of my comments are asking for further explanation, a few make
>> suggestions for moving some code around, and the rest are the normal
>> stylistic gotchas.
>>
>> Out of curiosity, did you discuss this change in detail with anyone before
>> landing?
>
> It's been so long since I started it I honestly can't remember. I have vague
> recollections of mentioning failure counts to various people.
>
No worries. I ask because we have failure counting mechanisms in other
places in Launchpad (e.g. the branch puller) and it might have been
interesting to share patterns, variable names and the like.
>> > === modified file 'lib/lp/
>> > --- lib/lp/
>> > +++ lib/lp/
>> > @@ -50,6 +50,9 @@
>> >
>> > <require
>> >
>> > permission=
>> > set_attributes=
>> >
>> > + <require
>> > + permission=
>> > + set_attributes=
>> >
>> > </class>
>> > <securedutility
>> >
>> > component=
>>
>> I guess this means that whatever increments this runs with admin-level
>> privileges. Is that really such a good idea?
>
> For now yes, because it's only ever changed in "zopeless" scripts. That is,
> Zope requres this to work, but doesn't actually apply any level of security to
> it. :/
>
> I made it admin to discourage changing it in appservers. There are other
> script-
> the zcml about exposing it externally! I prefer the extra protection up
> front...
>
Makes sense. I mean, a better permission would be some kind of
"Forbidden" permission, I guess.
>> > === modified file 'lib/lp/
>> > --- lib/lp/
>> > +++ lib/lp/
...
>> > + :return: True if we disabled something, False if we did not.
>>
>> This seems like an odd thing to return. It doesn't seem to be used in the
>> code. What's it for?
>
> For the tests.
>
Yeah, but what about the tests? Or rather, why do the tests care about
something that the code doesn't?
>> > + # Decide if we need to terminate the job or fail the
>> > + # builder.
>> > + self._increment
>> > + self.logger.info(
>> > + "builder failure count: %s, job failure count: %s" % (
>> > + builder.
>> > +
>> > builder.
>> > BaseDispatchRes
>>
>> This line makes me wonder why ass...
Julian Edwards (julian-edwards) wrote : | # |
On Friday 20 August 2010 11:43:40 you wrote:
> On Thu, Aug 19, 2010 at 5:13 PM, Julian Edwards
>
> <email address hidden> wrote:
> > On Thursday 19 August 2010 13:37:40 you wrote:
> >> Review: Needs Fixing
> >> Hey Julian,
> >>
> >> This looks like a big step toward build farm reliability -- thank you.
>
> Thanks for the changes. There are a few questions below, but I reckon
> this'll be the last round.
I'm not so sure :)
>
> >> Some of my comments are asking for further explanation, a few make
> >> suggestions for moving some code around, and the rest are the normal
> >> stylistic gotchas.
> >>
> >> Out of curiosity, did you discuss this change in detail with anyone
> >> before landing?
> >
> > It's been so long since I started it I honestly can't remember. I have
> > vague recollections of mentioning failure counts to various people.
>
> No worries. I ask because we have failure counting mechanisms in other
> places in Launchpad (e.g. the branch puller) and it might have been
> interesting to share patterns, variable names and the like.
I had no idea about that, that's a shame :(
> >> > === modified file 'lib/lp/
> >> > --- lib/lp/
> >> > +++ lib/lp/
>
> ...
>
> >> > + :return: True if we disabled something, False if we did not.
> >>
> >> This seems like an odd thing to return. It doesn't seem to be used in
> >> the code. What's it for?
> >
> > For the tests.
>
> Yeah, but what about the tests? Or rather, why do the tests care about
> something that the code doesn't?
Well lots of our tests return stuff that the code doesn't need, like Deferreds
for example. But fine, I've got rid of it.
> >> > + # Decide if we need to terminate the job or fail the
> >> > + # builder.
> >> > + self._increment
> >> > + self.logger.info(
> >> > + "builder failure count: %s, job failure count: %s" %
> >> > ( + builder.
> >> > +
> >> > builder.
> >> > BaseDispatchRes
> >>
> >> This line makes me wonder why assessFailureCounts is on
> >> BaseDispatchResult at all. Perhaps it should be a method on Builder
> >> that takes an optional 'info' parameter?
> >>
> >> The above code would then read:
> >>
> >> builder.
> >> self.logger.
> >> builder.
> >
> > I don't think we should continue to bloat model classes with code that is
> > only used in one place. This is a buildd-
> >
> > Anyway, bugger, I had intended to factor it out to a standalone function
> > as I also realised that that line is ridiculous and again forgot :/
>
> A standalone function would be a definite improvement, and fine by me.
> I personally don't think it's bloat to say "a builder knows how to
> handle builder failures", but I'm not going to push it.
By the same argument, anything that deals with object X should be a method on
that object X.
The difference he...
1 | === modified file 'lib/lp/buildmaster/manager.py' |
2 | --- lib/lp/buildmaster/manager.py 2010-08-20 08:53:14 +0000 |
3 | +++ lib/lp/buildmaster/manager.py 2010-08-20 11:50:42 +0000 |
4 | @@ -149,10 +149,7 @@ |
5 | |
6 | |
7 | def assessFailureCounts(builder, fail_notes): |
8 | - """View builder/job failure_count and work out which needs to die. |
9 | - |
10 | - :return: True if we disabled something, False if we did not. |
11 | - """ |
12 | + """View builder/job failure_count and work out which needs to die. """ |
13 | current_job = builder.currentjob |
14 | build_job = current_job.specific_job.build |
15 | |
16 | @@ -163,14 +160,13 @@ |
17 | # we can do is try them both again, and hope that the job |
18 | # runs against a different builder. |
19 | current_job.reset() |
20 | - return False |
21 | + return |
22 | |
23 | if builder.failure_count > build_job.failure_count: |
24 | # The builder has failed more than the jobs it's been |
25 | # running, so let's disable it and re-schedule the build. |
26 | builder.failBuilder(fail_notes) |
27 | current_job.reset() |
28 | - return True |
29 | else: |
30 | # The job is the culprit! Override its status to 'failed' |
31 | # to make sure it won't get automatically dispatched again, |
32 | @@ -184,7 +180,6 @@ |
33 | # but that would cause us to query the slave for its status |
34 | # again, and if the slave is non-responsive it holds up the |
35 | # next buildd scan. |
36 | - return True |
37 | |
38 | |
39 | class BaseDispatchResult: |
40 | @@ -209,7 +204,7 @@ |
41 | :return: True if we disabled something, False if we did not. |
42 | """ |
43 | builder = get_builder(self.slave.name) |
44 | - return assessFailureCounts(builder, self.info) |
45 | + assessFailureCounts(builder, self.info) |
46 | |
47 | def ___call__(self): |
48 | raise NotImplementedError( |
49 | |
50 | === modified file 'lib/lp/buildmaster/tests/test_manager.py' |
51 | --- lib/lp/buildmaster/tests/test_manager.py 2010-08-20 08:53:14 +0000 |
52 | +++ lib/lp/buildmaster/tests/test_manager.py 2010-08-20 12:07:15 +0000 |
53 | @@ -228,6 +228,9 @@ |
54 | self.manager = TestingSlaveScanner( |
55 | BOB_THE_BUILDER_NAME, BufferLogger()) |
56 | |
57 | + self.fake_builder_url = 'http://bob.buildd:8221/' |
58 | + self.fake_builder_host = 'bob.host' |
59 | + |
60 | # We will use an instrumented SlaveScanner instance for tests in |
61 | # this context. |
62 | |
63 | @@ -361,11 +364,19 @@ |
64 | self.assertFalse(result.processed) |
65 | return d.addCallback(check_result) |
66 | |
67 | - def _setUpSlaveAndBuilder(self): |
68 | + def _setUpSlaveAndBuilder(self, builder_failure_count=None, |
69 | + job_failure_count=None): |
70 | # Helper function to set up a builder and its recording slave. |
71 | + if builder_failure_count is None: |
72 | + builder_failure_count = 0 |
73 | + if job_failure_count is None: |
74 | + job_failure_count = 0 |
75 | slave = RecordingSlave( |
76 | - BOB_THE_BUILDER_NAME, 'http://foo.buildd:8221/', 'foo.host') |
77 | + BOB_THE_BUILDER_NAME, self.fake_builder_url, |
78 | + self.fake_builder_host) |
79 | bob_builder = getUtility(IBuilderSet)[slave.name] |
80 | + bob_builder.failure_count = builder_failure_count |
81 | + bob_builder.getCurrentBuildFarmJob().failure_count = job_failure_count |
82 | return slave, bob_builder |
83 | |
84 | def test_checkDispatch_success(self): |
85 | @@ -395,9 +406,8 @@ |
86 | |
87 | On success dispatching it returns None. |
88 | """ |
89 | - slave, bob_builder = self._setUpSlaveAndBuilder() |
90 | - bob_builder.failure_count = 0 |
91 | - bob_builder.currentjob.specific_job.build.failure_count = 0 |
92 | + slave, bob_builder = self._setUpSlaveAndBuilder( |
93 | + builder_failure_count=0, job_failure_count=0) |
94 | |
95 | # Successful legitimate response, None is returned. |
96 | successful_response = [ |
97 | @@ -410,9 +420,8 @@ |
98 | def test_checkDispatch_first_fail(self): |
99 | # Failed legitimate response, results in FailDispatchResult and |
100 | # failure_count on the job and the builder are both incremented. |
101 | - slave, bob_builder = self._setUpSlaveAndBuilder() |
102 | - bob_builder.failure_count = 0 |
103 | - bob_builder.currentjob.specific_job.build.failure_count = 0 |
104 | + slave, bob_builder = self._setUpSlaveAndBuilder( |
105 | + builder_failure_count=0, job_failure_count=0) |
106 | |
107 | failed_response = [False, 'uncool builder'] |
108 | result = self.manager.checkDispatch( |
109 | @@ -420,50 +429,48 @@ |
110 | self.assertIsDispatchFail(result) |
111 | self.assertEqual( |
112 | repr(result), |
113 | - '<bob:http://foo.buildd:8221/> failure (uncool builder)') |
114 | + '<bob:%s> failure (uncool builder)' % self.fake_builder_url) |
115 | self.assertEqual(1, bob_builder.failure_count) |
116 | self.assertEqual( |
117 | - 1, bob_builder.currentjob.specific_job.build.failure_count) |
118 | + 1, bob_builder.getCurrentBuildFarmJob().failure_count) |
119 | |
120 | def test_checkDispatch_second_reset_fail_by_builder(self): |
121 | # Twisted Failure response, results in a `FailDispatchResult`. |
122 | - slave, bob_builder = self._setUpSlaveAndBuilder() |
123 | - bob_builder.failure_count = 1 |
124 | - bob_builder.currentjob.specific_job.build.failure_count = 0 |
125 | + slave, bob_builder = self._setUpSlaveAndBuilder( |
126 | + builder_failure_count=1, job_failure_count=0) |
127 | |
128 | twisted_failure = Failure(ConnectionClosed('Boom!')) |
129 | result = self.manager.checkDispatch( |
130 | twisted_failure, 'ensurepresent', slave) |
131 | self.assertIsDispatchFail(result) |
132 | self.assertEqual( |
133 | - '<bob:http://foo.buildd:8221/> failure (None)', repr(result)) |
134 | + '<bob:%s> failure (None)' % self.fake_builder_url, repr(result)) |
135 | self.assertEqual(2, bob_builder.failure_count) |
136 | self.assertEqual( |
137 | - 1, bob_builder.currentjob.specific_job.build.failure_count) |
138 | + 1, bob_builder.getCurrentBuildFarmJob().failure_count) |
139 | |
140 | def test_checkDispatch_second_comms_fail_by_builder(self): |
141 | # Unexpected response, results in a `FailDispatchResult`. |
142 | - slave, bob_builder = self._setUpSlaveAndBuilder() |
143 | - bob_builder.failure_count = 1 |
144 | - bob_builder.currentjob.specific_job.build.failure_count = 0 |
145 | + slave, bob_builder = self._setUpSlaveAndBuilder( |
146 | + builder_failure_count=1, job_failure_count=0) |
147 | |
148 | unexpected_response = [1, 2, 3] |
149 | result = self.manager.checkDispatch( |
150 | unexpected_response, 'build', slave) |
151 | self.assertIsDispatchFail(result) |
152 | self.assertEqual( |
153 | - '<bob:http://foo.buildd:8221/> failure ' |
154 | - '(Unexpected response: [1, 2, 3])', repr(result)) |
155 | + '<bob:%s> failure ' |
156 | + '(Unexpected response: [1, 2, 3])' % self.fake_builder_url, |
157 | + repr(result)) |
158 | self.assertEqual(2, bob_builder.failure_count) |
159 | self.assertEqual( |
160 | - 1, bob_builder.currentjob.specific_job.build.failure_count) |
161 | + 1, bob_builder.getCurrentBuildFarmJob().failure_count) |
162 | |
163 | def test_checkDispatch_second_comms_fail_by_job(self): |
164 | # Unknown method was given, results in a `FailDispatchResult`. |
165 | # This could be caused by a faulty job which would fail the job. |
166 | - slave, bob_builder = self._setUpSlaveAndBuilder() |
167 | - bob_builder.failure_count = 0 |
168 | - bob_builder.currentjob.specific_job.build.failure_count = 1 |
169 | + slave, bob_builder = self._setUpSlaveAndBuilder( |
170 | + builder_failure_count=0, job_failure_count=1) |
171 | |
172 | successful_response = [ |
173 | buildd_success_result_map.get('ensurepresent'), 'cool builder'] |
174 | @@ -471,11 +478,12 @@ |
175 | successful_response, 'unknown-method', slave) |
176 | self.assertIsDispatchFail(result) |
177 | self.assertEqual( |
178 | - '<bob:http://foo.buildd:8221/> failure ' |
179 | - '(Unknown slave method: unknown-method)', repr(result)) |
180 | + '<bob:%s> failure ' |
181 | + '(Unknown slave method: unknown-method)' % self.fake_builder_url, |
182 | + repr(result)) |
183 | self.assertEqual(1, bob_builder.failure_count) |
184 | self.assertEqual( |
185 | - 2, bob_builder.currentjob.specific_job.build.failure_count) |
186 | + 2, bob_builder.getCurrentBuildFarmJob().failure_count) |
187 | |
188 | def test_initiateDispatch(self): |
189 | """Check `dispatchBuild` in various scenarios. |
190 | @@ -498,7 +506,8 @@ |
191 | def check_events(results): |
192 | [error] = [r for s, r in results if r is not None] |
193 | self.assertEqual( |
194 | - '<bob:http://bob.buildd:8221/> failure (very broken slave)', |
195 | + '<bob:%s> failure (very broken slave)' |
196 | + % self.fake_builder_url, |
197 | repr(error)) |
198 | self.assertTrue(error.processed) |
199 | |
200 | @@ -512,7 +521,8 @@ |
201 | |
202 | # A functional slave charged with some interactions. |
203 | slave = RecordingSlave( |
204 | - BOB_THE_BUILDER_NAME, 'http://bob.buildd:8221/', 'bob.host') |
205 | + BOB_THE_BUILDER_NAME, self.fake_builder_url, |
206 | + self.fake_builder_host) |
207 | slave.ensurepresent('arg1', 'arg2', 'arg3') |
208 | slave.build('arg1', 'arg2', 'arg3') |
209 | |
210 | @@ -545,7 +555,8 @@ |
211 | # cause the builder to be marked as fail. |
212 | self.test_proxy = TestingXMLRPCProxy('very broken slave') |
213 | slave = RecordingSlave( |
214 | - BOB_THE_BUILDER_NAME, 'http://bob.buildd:8221/', 'bob.host') |
215 | + BOB_THE_BUILDER_NAME, self.fake_builder_url, |
216 | + self.fake_builder_host) |
217 | slave.ensurepresent('arg1', 'arg2', 'arg3') |
218 | slave.build('arg1', 'arg2', 'arg3') |
219 | |
220 | @@ -935,9 +946,8 @@ |
221 | build = buildqueue.specific_job.build |
222 | builder.failure_count = 2 |
223 | build.failure_count = 2 |
224 | - disabled = result.assessFailureCounts() |
225 | + result.assessFailureCounts() |
226 | |
227 | - self.assertFalse(disabled) |
228 | self.assertBuilderIsClean(builder) |
229 | self.assertEqual('NEEDSBUILD', build.status.name) |
230 | self.assertBuildqueueIsClean(buildqueue) |
231 | @@ -949,9 +959,8 @@ |
232 | build = buildqueue.specific_job.build |
233 | build.failure_count = 2 |
234 | builder.failure_count = 1 |
235 | - disabled = result.assessFailureCounts() |
236 | + result.assessFailureCounts() |
237 | |
238 | - self.assertTrue(disabled) |
239 | self.assertBuilderIsClean(builder) |
240 | self.assertEqual('FAILEDTOBUILD', build.status.name) |
241 | # The buildqueue should have been removed entirely. |
242 | @@ -966,9 +975,8 @@ |
243 | build = buildqueue.specific_job.build |
244 | build.failure_count = 2 |
245 | builder.failure_count = 3 |
246 | - disabled = result.assessFailureCounts() |
247 | + result.assessFailureCounts() |
248 | |
249 | - self.assertTrue(disabled) |
250 | self.assertFalse(builder.builderok) |
251 | self.assertEqual('does not work!', builder.failnotes) |
252 | self.assertTrue(builder.currentjob is None) |
Jonathan Lange (jml) wrote : | # |
On Fri, Aug 20, 2010 at 1:16 PM, Julian Edwards
<email address hidden> wrote:
> On Friday 20 August 2010 11:43:40 you wrote:
>> On Thu, Aug 19, 2010 at 5:13 PM, Julian Edwards
>> <email address hidden> wrote:
>> > On Thursday 19 August 2010 13:37:40 you wrote:
...
>> >> > === modified file 'lib/lp/
>> >> > --- lib/lp/
>> >> > +++ lib/lp/
>>
>> ...
>>
>> >> > + :return: True if we disabled something, False if we did not.
>> >>
>> >> This seems like an odd thing to return. It doesn't seem to be used in
>> >> the code. What's it for?
>> >
>> > For the tests.
>>
>> Yeah, but what about the tests? Or rather, why do the tests care about
>> something that the code doesn't?
>
> Well lots of our tests return stuff that the code doesn't need, like Deferreds
> for example. But fine, I've got rid of it.
>
Thanks.
>> >> > + # Decide if we need to terminate the job or fail the
>> >> > + # builder.
>> >> > + self._increment
>> >> > + self.logger.info(
>> >> > + "builder failure count: %s, job failure count: %s" %
>> >> > ( + builder.
>> >> > +
>> >> > builder.
>> >> > BaseDispatchRes
>> >>
>> >> This line makes me wonder why assessFailureCounts is on
>> >> BaseDispatchResult at all. Perhaps it should be a method on Builder
>> >> that takes an optional 'info' parameter?
>> >>
>> >> The above code would then read:
>> >>
>> >> builder.
>> >> self.logger.
>> >> builder.
>> >
>> > I don't think we should continue to bloat model classes with code that is
>> > only used in one place. This is a buildd-
>> >
>> > Anyway, bugger, I had intended to factor it out to a standalone function
>> > as I also realised that that line is ridiculous and again forgot :/
>>
>> A standalone function would be a definite improvement, and fine by me.
>> I personally don't think it's bloat to say "a builder knows how to
>> handle builder failures", but I'm not going to push it.
>
> By the same argument, anything that deals with object X should be a method on
> that object X.
>
I would argue that most things that manipulate state on object X
should be methods of object X. I would not say it's a universal rule.
> The difference here is that it's not dealing with builder failures
> exclusively, it's dealing with a combination of builder and job failures, and
> I would personally find it odd that a builder was trying to modify job
> properties.
>
> Generally, I think the boundaries of responsibility in our code are poor and
> should be less like this.
>
> Did I convince you yet? Probably not :)
>
I think you have, in this case, particularly given the concessions
below to have methods on Job (however it's spelled) and Builder that
do the failure incrementing. That seems like a clean division of
responsibility.
>>
>> >> > + builder.
>> >> ...
Julian Edwards (julian-edwards) wrote : | # |
On Monday 23 August 2010 18:34:43 Jonathan Lange wrote:
> I would argue that most things that manipulate state on object X
> should be methods of object X. I would not say it's a universal rule.
If we boil it down to the basic manipulation required, I completely agree.
> I think you have, in this case, particularly given the concessions
> below to have methods on Job (however it's spelled) and Builder that
> do the failure incrementing. That seems like a clean division of
> responsibility.
Woot! :)
> > I'd prefer to have a gotFailure() on each of Builder and BuildFarmJob and
> > have the build manager call each in turn.
> >
> > Let me know if you think that is acceptable and I'll do that change (it's
> > not in the attached diff).
>
> I think that's a great idea.
Ok it's done.
> Re robustness, I think you lose one kind of robustness and gain another.
I guess there are different ways of looking at it...
> "Was this method called?" tests are robust against behaviour change.
> If you change what assessFailureCounts is supposed to do, then you
> won't have to change these tests, only the assessFailureCounts ones.
> (Some people call this "Mockist")
>
> Tests that check state are robust against implementation change. If
> you change the details of how you've implemented the code (as you've
> done during this review process), you won't have to change any tests.
> (Some people call this "Classic")
>
> As a rule, I lean heavily toward the latter, since I change
> implementation details much more frequently than behaviour. I use
> mocks in situations where there's a very clearly defined, fairly
> stable interface and when using some other kind of double would be
> prohibitively difficult.
>
> See http://
> discussion on the whole topic.
>
> As far as this diff goes, I think using a mock is less preferable but
> still sound, and I'm happy to see it merged.
I think I'm in total agreement with the sentiment above. In my case here I am
trying to abstract away from the "how failures are counted" as much as
possible since I forsee some changes in how we make it work. I suspect that
this branch will not be the final word on the matter :)
> It seems to me that Builder.currentjob ought to be replaced with a
> method asap, since it's doing a query, and naive readers might well
> miss that fact.
Agreed, I've filed a bug though as after grepping around, it will be a big
diff and I don't want to pollute this branch.
> If it's doing dupe queries and you aren't going to change currentjob
> to a method in this branch, then please add a comment saying something
> like:
>
> # builder.currentjob runs a query. Don't run it twice.
Done!
See partial diff.
Thanks for sticking with me :)
1 | === modified file 'lib/lp/buildmaster/interfaces/builder.py' |
2 | --- lib/lp/buildmaster/interfaces/builder.py 2010-08-20 13:47:47 +0000 |
3 | +++ lib/lp/buildmaster/interfaces/builder.py 2010-08-24 09:55:13 +0000 |
4 | @@ -154,6 +154,12 @@ |
5 | title=u"The current behavior of the builder for the current job.", |
6 | required=False) |
7 | |
8 | + def gotFailure(): |
9 | + """Increment failure_count on the builder.""" |
10 | + |
11 | + def resetFailureCount(): |
12 | + """Set the failure_count back to zero.""" |
13 | + |
14 | def checkSlaveAlive(): |
15 | """Check that the buildd slave is alive. |
16 | |
17 | |
18 | === modified file 'lib/lp/buildmaster/interfaces/buildfarmjob.py' |
19 | --- lib/lp/buildmaster/interfaces/buildfarmjob.py 2010-07-23 20:27:27 +0000 |
20 | +++ lib/lp/buildmaster/interfaces/buildfarmjob.py 2010-08-24 10:01:21 +0000 |
21 | @@ -270,6 +270,9 @@ |
22 | returned. |
23 | """ |
24 | |
25 | + def gotFailure(): |
26 | + """Increment the failure_count for this job.""" |
27 | + |
28 | title = exported(TextLine(title=_("Title"), required=False)) |
29 | |
30 | was_built = Attribute("Whether or not modified by the builddfarm.") |
31 | |
32 | === modified file 'lib/lp/buildmaster/manager.py' |
33 | --- lib/lp/buildmaster/manager.py 2010-08-20 13:42:19 +0000 |
34 | +++ lib/lp/buildmaster/manager.py 2010-08-24 10:27:37 +0000 |
35 | @@ -150,6 +150,8 @@ |
36 | |
37 | def assessFailureCounts(builder, fail_notes): |
38 | """View builder/job failure_count and work out which needs to die. """ |
39 | + # builder.currentjob hides a complicated query, don't run it twice. |
40 | + # See bug 623281. |
41 | current_job = builder.currentjob |
42 | build_job = current_job.specific_job.build |
43 | |
44 | @@ -361,7 +363,7 @@ |
45 | if self.builder.currentjob is not None: |
46 | # After a successful dispatch we can reset the |
47 | # failure_count. |
48 | - self.builder.failure_count = 0 |
49 | + self.builder.resetFailureCount() |
50 | transaction.commit() |
51 | return slave |
52 | |
53 | @@ -493,8 +495,8 @@ |
54 | return self.reset_result(slave, error_text) |
55 | |
56 | def _incrementFailureCounts(self, builder): |
57 | - builder.failure_count += 1 |
58 | - builder.getCurrentBuildFarmJob().failure_count += 1 |
59 | + builder.gotFailure() |
60 | + builder.getCurrentBuildFarmJob().gotFailure() |
61 | |
62 | def checkDispatch(self, response, method, slave): |
63 | """Verify the results of a slave xmlrpc call. |
64 | |
65 | === modified file 'lib/lp/buildmaster/model/builder.py' |
66 | --- lib/lp/buildmaster/model/builder.py 2010-08-20 13:47:47 +0000 |
67 | +++ lib/lp/buildmaster/model/builder.py 2010-08-24 10:15:31 +0000 |
68 | @@ -294,6 +294,14 @@ |
69 | current_build_behavior = property( |
70 | _getCurrentBuildBehavior, _setCurrentBuildBehavior) |
71 | |
72 | + def gotFailure(self): |
73 | + """See `IBuilder`.""" |
74 | + self.failure_count += 1 |
75 | + |
76 | + def resetFailureCount(self): |
77 | + """See `IBuilder`.""" |
78 | + self.failure_count = 0 |
79 | + |
80 | def checkSlaveAlive(self): |
81 | """See IBuilder.""" |
82 | if self.slave.echo("Test")[0] != "Test": |
83 | @@ -311,6 +319,8 @@ |
84 | """See IBuilder.""" |
85 | return self.slave.clean() |
86 | |
87 | + # XXX 2010-08-24 Julian bug=623281 |
88 | + # This should not be a property! It's masking a complicated query. |
89 | @property |
90 | def currentjob(self): |
91 | """See IBuilder""" |
92 | |
93 | === modified file 'lib/lp/buildmaster/model/buildfarmjob.py' |
94 | --- lib/lp/buildmaster/model/buildfarmjob.py 2010-08-17 15:04:47 +0000 |
95 | +++ lib/lp/buildmaster/model/buildfarmjob.py 2010-08-24 10:00:34 +0000 |
96 | @@ -344,6 +344,10 @@ |
97 | |
98 | return build_without_outer_proxy |
99 | |
100 | + def gotFailure(self): |
101 | + """See `IBuildFarmJob`.""" |
102 | + self.failure_count += 1 |
103 | + |
104 | |
105 | class BuildFarmJobDerived: |
106 | implements(IBuildFarmJob) |
Jonathan Lange (jml) wrote : | # |
On Tue, Aug 24, 2010 at 11:34 AM, Julian Edwards
<email address hidden> wrote:
> On Monday 23 August 2010 18:34:43 Jonathan Lange wrote:
>> I would argue that most things that manipulate state on object X
>> should be methods of object X. I would not say it's a universal rule.
>
> If we boil it down to the basic manipulation required, I completely agree.
>
>> I think you have, in this case, particularly given the concessions
>> below to have methods on Job (however it's spelled) and Builder that
>> do the failure incrementing. That seems like a clean division of
>> responsibility.
>
> Woot! :)
...
> See partial diff.
>
>
Rockin. I'd spell resetFailureCount() as gotSuccessfulDi
ok to land as is.
jml
Jonathan Lange (jml) : | # |
Fine. patch-2207- 78-0.sql.