Merge lp:~mars/launchpad/fix-test_on_merge-578886 into lp:launchpad

Proposed by Māris Fogels
Status: Superseded
Proposed branch: lp:~mars/launchpad/fix-test_on_merge-578886
Merge into: lp:launchpad
Diff against target: 321 lines (+175/-64)
2 files modified
buildout-templates/bin/test.in (+0/-9)
test_on_merge.py (+175/-55)
To merge this branch: bzr merge lp:~mars/launchpad/fix-test_on_merge-578886
Reviewer Review Type Date Requested Status
Māris Fogels (community) Needs Resubmitting
Michael Hudson-Doyle Approve
Review via email: mp+25981@code.launchpad.net

This proposal has been superseded by a proposal from 2010-07-28.

Description of the change

Hi,

This branch fixes test_on_merge.py so that it now terminates the test suite properly after 10 minutes. This means EC2 instances will no longer hang when they kill the test suite.

This branch changes how the testrunner watchdog handles the process groups of itself and its children. The diff explains why the process group fiddling takes place. In order for the testrunner watchdog to control the process groups I had to remove process group handling from bin/test.

I changed the code a fair bit by introducing many comments and some new functions to encapsulate the existing code. I found these changes necessary to understand what the code was doing, as I only had basic knowledge of Unix process handling before this change. This code should now be easier for others to understand.

I dropped the testrunner timeout to 10 minutes. No test should run that long without output.

I cleaned up the print statements a bit, personal preference.

I added some detailed output to the testrunner kill procedure. This is failing loudly, with the hope that the script's behaviour will be less opaque when things go wrong in the code that handles things going wrong.

This branch was tested with a hand-crafted windmill testrunner harness. The harness forces an hour-long sleep() in one of the windmill tests. I used this harness to test both a local test suite timeout and an ec2 test suite timeout, and to ensure that they can kill off an entire process group.

I will complete a full "make check" run and a full "ec2 test" run with this branch before landing.

Pre-implementation call with: no one
Test command: see above

Maris

To post a comment you must log in.
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Gary asked me to have a look at this.

I think it broadly looks reasonable -- certainly it terms of readability it's way better than what went before -- and it's great that you've tested that it works. The proof is in the pudding here!

Just to be clear, the reason this stopped working is because there's now a "xvfb-run" process 'between' the test_on_merge.py script and the bin/test script?

I have a little apprehension about simply dying if we're the process leader. For a start, it means you can't just run "./test_on_merge.py" from the shell! It also might not work on buildbot -- I think it actually will, unless we change our configuration to specify usePTY=True, but that seems rather fragile. Did you consider re-execing the process if we find we're the process group leader? I've pushed an implementation of this idea to lp:~mwhudson/launchpad/fix-test_on_merge-578886 and it seems to work.

If you merge my change in or do something similar, I think this is good to land.

review: Approve
Revision history for this message
Māris Fogels (mars) wrote :

As discussed on IRC, I added in a call to os.fork() that will get us a fresh PID to use as the process group leader. This enables the group-swapping trick to work when the script is run directly.

To make this new bit of complexity clearer I ended up splitting the major actions of main() out into their own functions. This revealed to yet more fixes that were hidden beneath the main() beast's bulk:

 * main()'s docstring was dead wrong. I fixed it.
 * The 'here' variable was being reused and redefined throughout the function. I made it global.
 * The tabnanny code is just plain wrong. It catches tab problems, but misses the rest of the errors that the script may generate. I marked this with an XXX.

The code also makes heavy use of function return codes, which are ugly but effective. I left them.

Everything is being re-tested before submission.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

134 + pid = os.fork()
135 + if pid != 0:
136 + # We are the parent process, so we'll wait for our child process to
137 + # do the heavy lifting for us.
138 + pid, status = os.wait()

Why call wait() and not waitpid() here? Alternatively, is it worth checking the pids match? Clearly they _should_ match, but if they didn't it could be very confusing :-)

I like the new approach overall btw.

Revision history for this message
Māris Fogels (mars) wrote :

I have had to re-examine a number of assumptions about this code, such as the need to aggressively kill the entire process tree, and the use of os.fork() instead of using Popen's preexec_fn argument.

I'm withdrawing this proposal until the code can be revisited.

review: Needs Resubmitting

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'buildout-templates/bin/test.in'
2--- buildout-templates/bin/test.in 2010-05-15 17:43:59 +0000
3+++ buildout-templates/bin/test.in 2010-05-31 14:28:31 +0000
4@@ -29,15 +29,6 @@
5 BUILD_DIR = ${buildout:directory|path-repr}
6 CUSTOM_SITE_DIR = ${scripts:parts-directory|path-repr}
7
8-if os.getsid(0) == os.getsid(os.getppid()):
9- # We need to become the process group leader so test_on_merge.py
10- # can reap its children.
11- #
12- # Note that if setpgrp() is used to move a process from one
13- # process group to another (as is done by some shells when
14- # creating pipelines), then both process groups must be part of
15- # the same session.
16- os.setpgrp()
17
18 # Make tests run in a timezone no launchpad developers live in.
19 # Our tests need to run in any timezone.
20
21=== modified file 'test_on_merge.py'
22--- test_on_merge.py 2010-04-22 17:30:35 +0000
23+++ test_on_merge.py 2010-05-31 14:28:31 +0000
24@@ -12,34 +12,52 @@
25 from StringIO import StringIO
26 import psycopg2
27 from subprocess import Popen, PIPE, STDOUT
28-from signal import SIGKILL, SIGTERM
29+from signal import SIGKILL, SIGTERM, SIGINT, SIGHUP
30 from select import select
31
32+
33 # The TIMEOUT setting (expressed in seconds) affects how long a test will run
34 # before it is deemed to be hung, and then appropriately terminated.
35 # It's principal use is preventing a PQM job from hanging indefinitely and
36 # backing up the queue.
37-# e.g. Usage: TIMEOUT = 60 * 15
38-# This will set the timeout to 15 minutes.
39-TIMEOUT = 60 * 15
40+# e.g. Usage: TIMEOUT = 60 * 10
41+# This will set the timeout to 10 minutes.
42+TIMEOUT = 60 * 10
43+
44+HERE = os.path.dirname(os.path.realpath(__file__))
45+
46
47 def main():
48 """Call bin/test with whatever arguments this script was run with.
49
50- If the tests ran ok (last line of stderr is 'OK<return>') then suppress
51- output and exit(0).
52-
53- Otherwise, print output and exit(1).
54- """
55- here = os.path.dirname(os.path.realpath(__file__))
56-
57- # Tabnanny
58- # NB. If tabnanny raises an exception, run
59- # python /usr/lib/python2.5/tabnanny.py -vv lib/canonical
60- # for more detailed output.
61+ Prior to running the tests this script checks the project files with
62+ Python2.5's tabnanny and sets up the test database.
63+
64+ Returns 1 on error, otherwise it returns the testrunner's exit code.
65+ """
66+ if run_tabnanny() != 0:
67+ return 1
68+
69+ if setup_test_database() != 0:
70+ return 1
71+
72+ return run_test_process()
73+
74+
75+def run_tabnanny():
76+ """Run the tabnanny, return its exit code.
77+
78+ If tabnanny raises an exception, run "python /usr/lib/python2.5/tabnanny.py
79+ -vv lib/canonical for more detailed output.
80+ """
81+ # XXX mars 2010-05-26
82+ # Tabnanny reports some of its errors on sys.stderr, so this code is
83+ # already wrong. subprocess.Popen.communicate() would work better.
84+ print "Checking the source tree with tabnanny..."
85 org_stdout = sys.stdout
86 sys.stdout = StringIO()
87- tabnanny.check(os.path.join(here, 'lib', 'canonical'))
88+ tabnanny.check(os.path.join(HERE, 'lib', 'canonical'))
89+ tabnanny.check(os.path.join(HERE, 'lib', 'lp'))
90 tabnanny_results = sys.stdout.getvalue()
91 sys.stdout = org_stdout
92 if len(tabnanny_results) > 0:
93@@ -47,7 +65,16 @@
94 print tabnanny_results
95 print '---- end tabnanny bitching ----'
96 return 1
97-
98+ else:
99+ print "Done"
100+ return 0
101+
102+
103+def setup_test_database():
104+ """Set up a test instance of our postgresql database.
105+
106+ Returns 0 for success, 1 for errors.
107+ """
108 # Sanity check PostgreSQL version. No point in trying to create a test
109 # database when PostgreSQL is too old.
110 con = psycopg2.connect('dbname=template1')
111@@ -91,8 +118,7 @@
112 con.close()
113
114 # Build the template database. Tests duplicate this.
115- here = os.path.dirname(os.path.realpath(__file__))
116- schema_dir = os.path.join(here, 'database', 'schema')
117+ schema_dir = os.path.join(HERE, 'database', 'schema')
118 if os.system('cd %s; make test > /dev/null' % (schema_dir)) != 0:
119 print 'Failed to create database or load sampledata.'
120 return 1
121@@ -134,70 +160,164 @@
122 con.close()
123 del con
124
125+ return 0
126+
127+
128+def run_test_process():
129+ """Start the testrunner process and return its exit code."""
130+ # Fork a child process so that we get a new process ID that we can
131+ # guarantee is not currently in use as a process group leader. This
132+ # addresses the case where this script has been started directly in the
133+ # shell using "python foo.py" or "./foo.py".
134+ pid = os.fork()
135+ if pid != 0:
136+ # We are the parent process, so we'll wait for our child process to
137+ # do the heavy lifting for us.
138+ pid, status = os.wait()
139+
140+ if os.WIFEXITED(status):
141+ return os.WEXITSTATUS(status)
142+ else:
143+ # We should not reach this code unless something segfaulted in
144+ # our child process, or it recieved a signal from some outside
145+ # force.
146+ raise RuntimeError(
147+ "Oops! The test watchdog was killed by signal %s" % (
148+ os.WTERMSIG(status)))
149+
150 print 'Running tests.'
151- os.chdir(here)
152+ os.chdir(HERE)
153+
154+ # Play shenanigans with our process group. We want to kill off our child
155+ # groups while at the same time not slaughtering ourselves!
156+ original_process_group = os.getpgid(0)
157+
158+ # Make sure we are not already the process group leader. Otherwise this
159+ # trick won't work.
160+ assert original_process_group != os.getpid()
161+
162+ # Change our process group to match our PID, as per POSIX convention.
163+ os.setpgrp()
164+
165+ # We run the test suite under a virtual frame buffer server so that the
166+ # JavaScript integration test suite can run.
167 cmd = [
168 'xvfb-run',
169 '-s',
170 "'-screen 0 1024x768x24'",
171- os.path.join(here, 'bin', 'test')] + sys.argv[1:]
172+ os.path.join(HERE, 'bin', 'test')] + sys.argv[1:]
173+
174 command_line = ' '.join(cmd)
175- print command_line
176+ print "Running command:", command_line
177
178 # Run the test suite and return the error code
179- #return call(cmd)
180-
181- proc = Popen(
182+ xvfb_proc = Popen(
183 command_line, stdin=PIPE, stdout=PIPE, stderr=STDOUT, shell=True)
184- proc.stdin.close()
185-
186- # Do proc.communicate(), but timeout if there's no activity on stdout or
187- # stderr for too long.
188- open_readers = set([proc.stdout])
189+ xvfb_proc.stdin.close()
190+
191+ # Restore our original process group, thus removing ourselves from
192+ # os.killpg's target list. Our child process and its children will retain
193+ # the process group number matching our PID.
194+ os.setpgid(0, original_process_group)
195+
196+ # This code is very similar to what takes place in Popen._communicate(),
197+ # but this code times out if there is no activity on STDOUT for too long.
198+ open_readers = set([xvfb_proc.stdout])
199 while open_readers:
200 rlist, wlist, xlist = select(open_readers, [], [], TIMEOUT)
201
202 if len(rlist) == 0:
203- if proc.poll() is not None:
204+ # The select() statement timed out!
205+
206+ if xvfb_proc.poll() is not None:
207+ # The process we were watching died.
208 break
209- print ("\nA test appears to be hung. There has been no output for"
210- " %d seconds. Sending SIGTERM." % TIMEOUT)
211- killem(proc.pid, SIGTERM)
212- time.sleep(3)
213- if proc.poll() is not None:
214- print ("\nSIGTERM did not work. Sending SIGKILL.")
215- killem(proc.pid, SIGKILL)
216- # Drain the subprocess's stdout and stderr.
217- sys.stdout.write(proc.stdout.read())
218+
219+ cleanup_hung_testrunner(xvfb_proc)
220 break
221
222- if proc.stdout in rlist:
223- chunk = os.read(proc.stdout.fileno(), 1024)
224+ if xvfb_proc.stdout in rlist:
225+ # Read a chunk of output from STDOUT.
226+ chunk = os.read(xvfb_proc.stdout.fileno(), 1024)
227 sys.stdout.write(chunk)
228 if chunk == "":
229- open_readers.remove(proc.stdout)
230-
231- rv = proc.wait()
232+ # Gracefully exit the loop if STDOUT is empty.
233+ open_readers.remove(xvfb_proc.stdout)
234+
235+ rv = xvfb_proc.wait()
236+
237 if rv == 0:
238- print '\nSuccessfully ran all tests.'
239+ print
240+ print 'Successfully ran all tests.'
241 else:
242- print '\nTests failed (exit code %d)' % rv
243+ print
244+ print 'Tests failed (exit code %d)' % rv
245
246 return rv
247
248
249-def killem(pid, signal):
250- """Kill the process group leader identified by pid and other group members
251-
252- Note that bin/test sets its process to a process group leader.
253- """
254+def cleanup_hung_testrunner(process):
255+ """Kill and clean up the testrunner process and its children."""
256+ print
257+ print
258+ print ("WARNING: A test appears to be hung. There has been no "
259+ "output for %d seconds." % TIMEOUT)
260+ print "Forcibly shutting down the test suite"
261+
262+ # This guarantees the process' group will die. In rare cases
263+ # a child process may survive this if they are in a different
264+ # process group and they ignore the signals we send their parent.
265+ nice_killpg(process)
266+
267+ # Drain the subprocess's stdout and stderr.
268+ print "The dying processes left behind the following output:"
269+ print "--------------- BEGIN OUTPUT ---------------"
270+ sys.stdout.write(process.stdout.read())
271+ print
272+ print "---------------- END OUTPUT ----------------"
273+
274+
275+def nice_killpg(process):
276+ """Kill a Unix process group using increasingly harmful signals."""
277+ pgid = os.getpgid(process.pid)
278+
279 try:
280- os.killpg(os.getpgid(pid), signal)
281- except OSError, x:
282- if x.errno == errno.ESRCH:
283+ print "Process group %d will be killed" % pgid
284+
285+ # Attempt a series of increasingly brutal methods of killing the
286+ # process.
287+ for signum in [SIGTERM, SIGINT, SIGHUP, SIGKILL]:
288+ print "Sending signal %s to process group %d" % (signum, pgid)
289+ os.killpg(pgid, signum)
290+
291+ # Give the processes some time to shut down.
292+ time.sleep(3)
293+
294+ # Poll our original child process so that the Popen object can
295+ # capture the process' exit code. If we do not do this now it
296+ # will be lost by the following call to os.waitpid(). Note that
297+ # this also reaps every process in the process group!
298+ process.poll()
299+
300+ # This call will raise ESRCH if the group is empty, or ECHILD if
301+ # the group has already been reaped. The exception will exit the
302+ # loop for us.
303+ os.waitpid(-pgid, os.WNOHANG) # Check for survivors.
304+
305+ print "Some processes ignored our signal!"
306+
307+ except OSError, exc:
308+ if exc.errno == errno.ESRCH:
309+ # We tried to call os.killpg() and found the group to be empty.
310+ pass
311+ elif exc.errno == errno.ECHILD:
312+ # We tried to poll the process group with os.waitpid() and found
313+ # it was empty.
314 pass
315 else:
316 raise
317+ print "Process group %d is now empty." % pgid
318+
319
320 if __name__ == '__main__':
321 sys.exit(main())