Merge lp:~jml/launchpad/test-timeout-505913 into lp:launchpad

Proposed by Jonathan Lange
Status: Merged
Approved by: Aaron Bentley
Approved revision: no longer in the source branch.
Merged at revision: 12660
Proposed branch: lp:~jml/launchpad/test-timeout-505913
Merge into: lp:launchpad
Diff against target: 373 lines (+143/-62)
4 files modified
lib/lp/services/job/runner.py (+45/-11)
lib/lp/services/job/tests/test_job.py (+4/-9)
lib/lp/services/job/tests/test_runner.py (+93/-40)
lib/lp/testing/__init__.py (+1/-2)
To merge this branch: bzr merge lp:~jml/launchpad/test-timeout-505913
Reviewer Review Type Date Requested Status
Aaron Bentley (community) Approve
j.c.sackett (community) Approve
Review via email: mp+54598@code.launchpad.net

Commit message

[r=abentley,jcsackett][bug=505913] Fix the intermittent test failure in TestTwistedJobRunner.test_timeout by handling the case where the timeout occurs during framework code.

Description of the change

This branch, I believe, fixes one of the intermittently failing tests. The linked bug has most of the relevant details. Essentially, the timeout was happening at random times. When it happened while we were actually running the job, everything was good. When it happened during some of the surrounding code (e.g. while the child process was unmarshalling input to turn it into a potential request to do a job), then it was ignored.

This is because the timeout mechanism was to raise an exception. In Twisted, when a SIGHUP signal handler raises an exception, it just raises in the code that's currently running. If we're in the reactor loop at the time, then it is logged as an unhandled error.

Instead of raising an exception, this branch just exits the worker process with a particular exit code (TwistedJobRunner.TIMEOUT_CODE). When the runner sees a process exit with that code, we ignore the ProcessDone "failure" and log a timeout OOPS instead.

The testing strategy for this was to duplicate the test_timeout test that was failing intermittently. One of them would have the timeout bumped up a little (from 1 to 5) to make it more likely to exercise the case where the timeout happens during an actual job and not in surrounding machinery. The other has a much, much shorter timeout (0.05), which is guaranteed to hit the machinery.

In the branch, you'll see lots of other comments and cleanups. It's been a meandering experience, and I've tried to leave what I've learned marked on the walls of the labyrinth.

To post a comment you must log in.
Revision history for this message
j.c.sackett (jcsackett) wrote :

This looks good. I'm curious about one piece of code though:

+ def _logTimeout(self, job):
+ try:
+ raise TimeoutError
+ except TimeoutError:
+ oops = self._doOops(job, sys.exc_info())
+ self._logOopsId(oops.id)

Given all that's happening is an exception being raised in the try block, which immediately goes to the exception, couldn't this all be replaced with the following?

+ def _logTimeout(self, job):
+ oops = self._doOops(job, sys.exc_info())
+ self._logOopsId(oops.id)

Or is there some sort of async thing here I'm missing?

The code as written certainly works, so this isn't a blocker to approval; I'm just assuming I'm missing something, and want to know what.

review: Approve
Revision history for this message
Jonathan Lange (jml) wrote :

On Thu, Mar 24, 2011 at 3:44 PM, j.c.sackett
<email address hidden> wrote:
> Review: Approve
> This looks good. I'm curious about one piece of code though:
>
> +    def _logTimeout(self, job):
> +        try:
> +            raise TimeoutError
> +        except TimeoutError:
> +            oops = self._doOops(job, sys.exc_info())
> +            self._logOopsId(oops.id)
>
> Given all that's happening is an exception being raised in the try block, which immediately goes to the exception, couldn't this all be replaced with the following?
>
> +    def _logTimeout(self, job):
> +        oops = self._doOops(job, sys.exc_info())
> +        self._logOopsId(oops.id)
>
> Or is there some sort of async thing here I'm missing?
>

It's an error conversion thing, not an async thing. We're getting a
ProcessDone error, but we want to log a TimeoutError. That's all
that's going on. I'll add some comments to make this clearer.

Thanks!
jml

Revision history for this message
Aaron Bentley (abentley) :
review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'lib/lp/services/job/runner.py'
2--- lib/lp/services/job/runner.py 2011-02-09 17:43:54 +0000
3+++ lib/lp/services/job/runner.py 2011-03-24 18:38:33 +0000
4@@ -1,12 +1,10 @@
5-# Copyright 2009 Canonical Ltd. This software is licensed under the
6+# Copyright 2009-2011 Canonical Ltd. This software is licensed under the
7 # GNU Affero General Public License version 3 (see the file LICENSE).
8
9 """Facilities for running Jobs."""
10
11-
12 __metaclass__ = type
13
14-
15 __all__ = [
16 'BaseRunnableJob',
17 'JobCronScript',
18@@ -303,8 +301,23 @@
19
20 @classmethod
21 def __enter__(cls):
22-
23 def handler(signum, frame):
24+ # We raise an exception **and** schedule a call to exit the
25+ # process hard. This is because we cannot rely on the exception
26+ # being raised during useful code. Sometimes, it will be raised
27+ # while the reactor is looping, which means that it will be
28+ # ignored.
29+ #
30+ # If the exception is raised during the actual job, then we'll get
31+ # a nice traceback indicating what timed out, and that will be
32+ # logged as an OOPS.
33+ #
34+ # Regardless of where the exception is raised, we'll hard exit the
35+ # process and have a TimeoutError OOPS logged, although that will
36+ # have a crappy traceback. See the job_raised callback in
37+ # TwistedJobRunner.runJobInSubprocess for the other half of that.
38+ reactor.callFromThread(
39+ reactor.callLater, 0, os._exit, TwistedJobRunner.TIMEOUT_CODE)
40 raise TimeoutError
41 scripts.execute_zcml_for_scripts(use_web_security=False)
42 signal(SIGHUP, handler)
43@@ -340,6 +353,8 @@
44 class TwistedJobRunner(BaseJobRunner):
45 """Run Jobs via twisted."""
46
47+ TIMEOUT_CODE = 42
48+
49 def __init__(self, job_source, dbuser, logger=None, error_utility=None):
50 env = {'PATH': os.environ['PATH']}
51 for name in ('PYTHONPATH', 'LPCONFIG'):
52@@ -373,7 +388,7 @@
53 self.logger.debug(
54 'Running %r, lease expires %s', job, job.lease_expires)
55 deferred = self.pool.doWork(
56- RunJobCommand, job_id = job_id, _deadline=deadline)
57+ RunJobCommand, job_id=job_id, _deadline=deadline)
58
59 def update(response):
60 if response['success']:
61@@ -387,17 +402,36 @@
62
63 def job_raised(failure):
64 self.incomplete_jobs.append(job)
65- info = (failure.type, failure.value, failure.tb)
66- oops = self._doOops(job, info)
67- self._logOopsId(oops.id)
68+ exit_code = getattr(failure.value, 'exitCode', None)
69+ if exit_code == self.TIMEOUT_CODE:
70+ # The process ended with the error code that we have
71+ # arbitrarily chosen to indicate a timeout. Rather than log
72+ # that error (ProcessDone), we log a TimeoutError instead.
73+ self._logTimeout(job)
74+ else:
75+ info = (failure.type, failure.value, failure.tb)
76+ oops = self._doOops(job, info)
77+ self._logOopsId(oops.id)
78 deferred.addCallbacks(update, job_raised)
79 return deferred
80
81+ def _logTimeout(self, job):
82+ try:
83+ raise TimeoutError
84+ except TimeoutError:
85+ oops = self._doOops(job, sys.exc_info())
86+ self._logOopsId(oops.id)
87+
88 def getTaskSource(self):
89 """Return a task source for all jobs in job_source."""
90
91 def producer():
92 while True:
93+ # XXX: JonathanLange bug=741204: If we're getting all of the
94+ # jobs at the start anyway, we can use a DeferredSemaphore,
95+ # instead of the more complex PollingTaskSource, which is
96+ # better suited to cases where we don't know how much work
97+ # there will be.
98 jobs = list(self.job_source.iterReady())
99 if len(jobs) == 0:
100 yield None
101@@ -407,9 +441,9 @@
102
103 def doConsumer(self):
104 """Create a ParallelLimitedTaskConsumer for this job type."""
105- logger = logging.getLogger('gloop')
106- logger.addHandler(logging.StreamHandler(sys.stdout))
107- logger.setLevel(logging.DEBUG)
108+ # 1 is hard-coded for now until we're sure we'd get gains by running
109+ # more than one at a time. Note that test_timeout relies on this
110+ # being 1.
111 consumer = ParallelLimitedTaskConsumer(1, logger=None)
112 return consumer.consume(self.getTaskSource())
113
114
115=== modified file 'lib/lp/services/job/tests/test_job.py'
116--- lib/lp/services/job/tests/test_job.py 2010-10-04 19:50:45 +0000
117+++ lib/lp/services/job/tests/test_job.py 2011-03-24 18:38:33 +0000
118@@ -1,11 +1,10 @@
119-# Copyright 2009 Canonical Ltd. This software is licensed under the
120+# Copyright 2009-2011 Canonical Ltd. This software is licensed under the
121 # GNU Affero General Public License version 3 (see the file LICENSE).
122
123 __metaclass__ = type
124
125 from datetime import datetime
126 import time
127-from unittest import TestLoader
128
129 import pytz
130 from storm.locals import Store
131@@ -18,7 +17,7 @@
132 MAIN_STORE,
133 )
134 from canonical.launchpad.webapp.testing import verifyObject
135-from canonical.testing.layers import LaunchpadZopelessLayer
136+from canonical.testing.layers import ZopelessDatabaseLayer
137 from lp.services.job.interfaces.job import (
138 IJob,
139 JobStatus,
140@@ -34,7 +33,7 @@
141 class TestJob(TestCase):
142 """Ensure Job behaves as intended."""
143
144- layer = LaunchpadZopelessLayer
145+ layer = ZopelessDatabaseLayer
146
147 def test_implements_IJob(self):
148 """Job should implement IJob."""
149@@ -211,7 +210,7 @@
150 class TestReadiness(TestCase):
151 """Test the implementation of readiness."""
152
153- layer = LaunchpadZopelessLayer
154+ layer = ZopelessDatabaseLayer
155
156 def _sampleData(self):
157 store = getUtility(IStoreSelector).get(MAIN_STORE, DEFAULT_FLAVOR)
158@@ -296,7 +295,3 @@
159 job = Job()
160 job.acquireLease(-300)
161 self.assertEqual(0, job.getTimeout())
162-
163-
164-def test_suite():
165- return TestLoader().loadTestsFromName(__name__)
166
167=== modified file 'lib/lp/services/job/tests/test_runner.py'
168--- lib/lp/services/job/tests/test_runner.py 2010-12-23 00:38:29 +0000
169+++ lib/lp/services/job/tests/test_runner.py 2011-03-24 18:38:33 +0000
170@@ -7,19 +7,16 @@
171 import sys
172 from textwrap import dedent
173 from time import sleep
174-from unittest import TestLoader
175
176 import transaction
177-from zope.component import getUtility
178 from zope.interface import implements
179
180+from canonical.config import config
181 from canonical.launchpad.webapp import errorlog
182-from canonical.launchpad.webapp.interfaces import (
183- DEFAULT_FLAVOR,
184- IStoreSelector,
185- MAIN_STORE,
186+from canonical.testing.layers import (
187+ LaunchpadZopelessLayer,
188+ ZopelessDatabaseLayer,
189 )
190-from canonical.testing.layers import LaunchpadZopelessLayer
191 from lp.code.interfaces.branchmergeproposal import IUpdatePreviewDiffJobSource
192 from lp.services.job.interfaces.job import (
193 IRunnableJob,
194@@ -329,42 +326,68 @@
195
196 done = False
197
198+ # A list of jobs to run: id, lease_length, delay.
199+ #
200+ # For the first job, have a very long lease, so that it
201+ # doesn't expire and so we soak up the ZCML loading time. For the
202+ # second job, have a short lease so we hit the timeout.
203+ jobs = [
204+ (0, 10000, 0),
205+ (1, 5, 30),
206+ ]
207+
208 @classmethod
209 def iterReady(cls):
210 if not cls.done:
211- yield StuckJob(1)
212- yield StuckJob(2)
213+ for id, lease_length, delay in cls.jobs:
214+ yield cls(id, lease_length, delay)
215 cls.done = True
216
217- @staticmethod
218- def get(id):
219- return StuckJob(id)
220+ @classmethod
221+ def get(cls, id):
222+ id, lease_length, delay = cls.jobs[id]
223+ return cls(id, lease_length, delay)
224
225- def __init__(self, id):
226+ def __init__(self, id, lease_length, delay):
227 self.id = id
228+ self.lease_length = lease_length
229+ self.delay = delay
230 self.job = Job()
231
232+ def __repr__(self):
233+ return '<StuckJob(%r, lease_length=%s, delay=%s)>' % (
234+ self.id, self.lease_length, self.delay)
235+
236 def acquireLease(self):
237- if self.id == 2:
238- lease_length = 1
239- else:
240- lease_length = 10000
241- return self.job.acquireLease(lease_length)
242+ return self.job.acquireLease(self.lease_length)
243
244 def run(self):
245- if self.id == 2:
246- sleep(30)
247- else:
248- store = getUtility(IStoreSelector).get(MAIN_STORE, DEFAULT_FLAVOR)
249- assert (
250- 'user=branchscanner' in store._connection._raw_connection.dsn)
251+ sleep(self.delay)
252+
253+
254+class ShorterStuckJob(StuckJob):
255+ """Simulation of a job that stalls."""
256+
257+ jobs = [
258+ (0, 10000, 0),
259+ (1, 0.05, 30),
260+ ]
261
262
263 class TestTwistedJobRunner(ZopeTestInSubProcess, TestCaseWithFactory):
264
265- layer = LaunchpadZopelessLayer
266-
267- def test_timeout(self):
268+ layer = ZopelessDatabaseLayer
269+
270+ def setUp(self):
271+ super(TestTwistedJobRunner, self).setUp()
272+ # The test relies on _pythonpath being importable. Thus we need to add
273+ # a directory that contains _pythonpath to the sys.path. We can rely
274+ # on the root directory of the checkout containing _pythonpath.
275+ if config.root not in sys.path:
276+ sys.path.append(config.root)
277+ self.addCleanup(sys.path.remove, config.root)
278+
279+ def test_timeout_long(self):
280 """When a job exceeds its lease, an exception is raised.
281
282 Unfortunately, timeouts include the time it takes for the zope
283@@ -373,18 +396,52 @@
284 """
285 logger = BufferLogger()
286 logger.setLevel(logging.INFO)
287+ # StuckJob is actually a source of two jobs. The first is fast, the
288+ # second slow.
289 runner = TwistedJobRunner.runFromSource(
290 StuckJob, 'branchscanner', logger)
291
292- self.assertEqual(1, len(runner.completed_jobs))
293- self.assertEqual(1, len(runner.incomplete_jobs))
294- oops = errorlog.globalErrorUtility.getLastOopsReport()
295- self.assertEqual(dedent("""\
296- INFO Running through Twisted.
297- INFO Job resulted in OOPS: %s
298- """) % oops.id, logger.getLogBuffer())
299- self.assertEqual('TimeoutError', oops.type)
300- self.assertIn('Job ran too long.', oops.value)
301+ # XXX: JonathanLange 2011-03-23 bug=740443: Potential source of race
302+ # condition. Another OOPS could be logged. Also confusing because it
303+ # might be polluted by values from previous jobs.
304+ oops = errorlog.globalErrorUtility.getLastOopsReport()
305+ self.assertEqual(
306+ (1, 1), (len(runner.completed_jobs), len(runner.incomplete_jobs)))
307+ self.assertEqual(
308+ (dedent("""\
309+ INFO Running through Twisted.
310+ INFO Job resulted in OOPS: %s
311+ """) % oops.id,
312+ 'TimeoutError', 'Job ran too long.'),
313+ (logger.getLogBuffer(), oops.type, oops.value))
314+
315+ def test_timeout_short(self):
316+ """When a job exceeds its lease, an exception is raised.
317+
318+ Unfortunately, timeouts include the time it takes for the zope
319+ machinery to start up, so we run a job that will not time out first,
320+ followed by a job that is sure to time out.
321+ """
322+ logger = BufferLogger()
323+ logger.setLevel(logging.INFO)
324+ # StuckJob is actually a source of two jobs. The first is fast, the
325+ # second slow.
326+ runner = TwistedJobRunner.runFromSource(
327+ ShorterStuckJob, 'branchscanner', logger)
328+
329+ # XXX: JonathanLange 2011-03-23 bug=740443: Potential source of race
330+ # condition. Another OOPS could be logged. Also confusing because it
331+ # might be polluted by values from previous jobs.
332+ oops = errorlog.globalErrorUtility.getLastOopsReport()
333+ self.assertEqual(
334+ (1, 1), (len(runner.completed_jobs), len(runner.incomplete_jobs)))
335+ self.assertEqual(
336+ (dedent("""\
337+ INFO Running through Twisted.
338+ INFO Job resulted in OOPS: %s
339+ """) % oops.id,
340+ 'TimeoutError', 'Job ran too long.'),
341+ (logger.getLogBuffer(), oops.type, oops.value))
342
343
344 class TestJobCronScript(ZopeTestInSubProcess, TestCaseWithFactory):
345@@ -425,7 +482,3 @@
346 cronscript.main()
347 finally:
348 errorlog.globalErrorUtility = old_errorlog
349-
350-
351-def test_suite():
352- return TestLoader().loadTestsFromName(__name__)
353
354=== modified file 'lib/lp/testing/__init__.py'
355--- lib/lp/testing/__init__.py 2011-03-21 21:06:46 +0000
356+++ lib/lp/testing/__init__.py 2011-03-24 18:38:33 +0000
357@@ -12,7 +12,6 @@
358 'api_url',
359 'build_yui_unittest_suite',
360 'BrowserTestCase',
361- 'capture_events',
362 'celebrity_logged_in',
363 'FakeTime',
364 'get_lsb_information',
365@@ -29,7 +28,7 @@
366 'normalize_whitespace',
367 'oauth_access_token_for',
368 'person_logged_in',
369- 'quote_jquery_expression'
370+ 'quote_jquery_expression',
371 'record_statements',
372 'run_with_login',
373 'run_with_storm_debug',