Merge lp:~mars/launchpad/fix-test_on_merge-578886 into lp:launchpad

Proposed by Māris Fogels
Status: Merged
Approved by: Māris Fogels
Approved revision: no longer in the source branch.
Merged at revision: 11372
Proposed branch: lp:~mars/launchpad/fix-test_on_merge-578886
Merge into: lp:launchpad
Diff against target: 254 lines (+108/-73)
2 files modified
buildout-templates/bin/test.in (+0/-9)
test_on_merge.py (+108/-64)
To merge this branch: bzr merge lp:~mars/launchpad/fix-test_on_merge-578886
Reviewer Review Type Date Requested Status
Michael Hudson-Doyle Approve
Robert Collins Pending
Māris Fogels Pending
Review via email: mp+31207@code.launchpad.net

This proposal supersedes a proposal from 2010-05-25.

Commit message

Clean up the test_on_merge.py code to make it handle suite errors correctly. Removed the tabnanny checker.

Description of the change

Hello,

This branch contains simplified code for playing process group shenanigans, and it simplifies the nice_killpg() function. After discussion with Robert, this is a best effort approach.

I am particularly keen to see that the cleanup_hung_testrunner() and nice_killpg() functions work together, and that cleanup_hung_testrunner() does NOT hang if some child process is holding a pipe open. I hope code inspection will help tell me if this is true.

I tested this code as best I could with harnesses that reproduce test suite hangs. However, I think the process group shenanigans alone should be enough to keep the watchdog from dying, and therefore to keep the suite sane.

I have included the original proposal text below. It still holds true for this patch.

Maris

=====
Hi,

This branch fixes test_on_merge.py so that it now terminates the test suite properly after 10 minutes. This means EC2 instances will no longer hang when they kill the test suite.

This branch changes how the testrunner watchdog handles the process groups of itself and its children. The diff explains why the process group fiddling takes place. In order for the testrunner watchdog to control the process groups I had to remove process group handling from bin/test.

I changed the code a fair bit by introducing many comments and some new functions to encapsulate the existing code. I found these changes necessary to understand what the code was doing, as I only had basic knowledge of Unix process handling before this change. This code should now be easier for others to understand.

I dropped the testrunner timeout to 10 minutes. No test should run that long without output.

I cleaned up the print statements a bit, personal preference.

I added some detailed output to the testrunner kill procedure. This is failing loudly, with the hope that the script's behaviour will be less opaque when things go wrong in the code that handles things going wrong.

This branch was tested with a hand-crafted windmill testrunner harness. The harness forces an hour-long sleep() in one of the windmill tests. I used this harness to test both a local test suite timeout and an ec2 test suite timeout, and to ensure that they can kill off an entire process group.

I will complete a full "make check" run and a full "ec2 test" run with this branch before landing.

Pre-implementation call with: no one
Test command: see above

Maris

To post a comment you must log in.
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote : Posted in a previous version of this proposal

Gary asked me to have a look at this.

I think it broadly looks reasonable -- certainly it terms of readability it's way better than what went before -- and it's great that you've tested that it works. The proof is in the pudding here!

Just to be clear, the reason this stopped working is because there's now a "xvfb-run" process 'between' the test_on_merge.py script and the bin/test script?

I have a little apprehension about simply dying if we're the process leader. For a start, it means you can't just run "./test_on_merge.py" from the shell! It also might not work on buildbot -- I think it actually will, unless we change our configuration to specify usePTY=True, but that seems rather fragile. Did you consider re-execing the process if we find we're the process group leader? I've pushed an implementation of this idea to lp:~mwhudson/launchpad/fix-test_on_merge-578886 and it seems to work.

If you merge my change in or do something similar, I think this is good to land.

review: Approve
Revision history for this message
Māris Fogels (mars) wrote : Posted in a previous version of this proposal

As discussed on IRC, I added in a call to os.fork() that will get us a fresh PID to use as the process group leader. This enables the group-swapping trick to work when the script is run directly.

To make this new bit of complexity clearer I ended up splitting the major actions of main() out into their own functions. This revealed to yet more fixes that were hidden beneath the main() beast's bulk:

 * main()'s docstring was dead wrong. I fixed it.
 * The 'here' variable was being reused and redefined throughout the function. I made it global.
 * The tabnanny code is just plain wrong. It catches tab problems, but misses the rest of the errors that the script may generate. I marked this with an XXX.

The code also makes heavy use of function return codes, which are ugly but effective. I left them.

Everything is being re-tested before submission.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote : Posted in a previous version of this proposal

134 + pid = os.fork()
135 + if pid != 0:
136 + # We are the parent process, so we'll wait for our child process to
137 + # do the heavy lifting for us.
138 + pid, status = os.wait()

Why call wait() and not waitpid() here? Alternatively, is it worth checking the pids match? Clearly they _should_ match, but if they didn't it could be very confusing :-)

I like the new approach overall btw.

Revision history for this message
Māris Fogels (mars) wrote : Posted in a previous version of this proposal

I have had to re-examine a number of assumptions about this code, such as the need to aggressively kill the entire process tree, and the use of os.fork() instead of using Popen's preexec_fn argument.

I'm withdrawing this proposal until the code can be revisited.

review: Needs Resubmitting
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

I think this looks fine. I assume you've tested it -- I haven't, this time around.

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'buildout-templates/bin/test.in'
--- buildout-templates/bin/test.in 2010-07-30 11:52:48 +0000
+++ buildout-templates/bin/test.in 2010-08-17 13:34:50 +0000
@@ -45,15 +45,6 @@
45BUILD_DIR = ${buildout:directory|path-repr}45BUILD_DIR = ${buildout:directory|path-repr}
46CUSTOM_SITE_DIR = ${scripts:parts-directory|path-repr}46CUSTOM_SITE_DIR = ${scripts:parts-directory|path-repr}
4747
48if os.getsid(0) == os.getsid(os.getppid()):
49 # We need to become the process group leader so test_on_merge.py
50 # can reap its children.
51 #
52 # Note that if setpgrp() is used to move a process from one
53 # process group to another (as is done by some shells when
54 # creating pipelines), then both process groups must be part of
55 # the same session.
56 os.setpgrp()
5748
58# Make tests run in a timezone no launchpad developers live in.49# Make tests run in a timezone no launchpad developers live in.
59# Our tests need to run in any timezone.50# Our tests need to run in any timezone.
6051
=== modified file 'test_on_merge.py'
--- test_on_merge.py 2010-07-28 14:28:57 +0000
+++ test_on_merge.py 2010-08-17 13:34:50 +0000
@@ -12,42 +12,39 @@
12from StringIO import StringIO12from StringIO import StringIO
13import psycopg213import psycopg2
14from subprocess import Popen, PIPE, STDOUT14from subprocess import Popen, PIPE, STDOUT
15from signal import SIGKILL, SIGTERM15from signal import SIGKILL, SIGTERM, SIGINT, SIGHUP
16from select import select16from select import select
1717
18
18# The TIMEOUT setting (expressed in seconds) affects how long a test will run19# The TIMEOUT setting (expressed in seconds) affects how long a test will run
19# before it is deemed to be hung, and then appropriately terminated.20# before it is deemed to be hung, and then appropriately terminated.
20# It's principal use is preventing a PQM job from hanging indefinitely and21# It's principal use is preventing a PQM job from hanging indefinitely and
21# backing up the queue.22# backing up the queue.
22# e.g. Usage: TIMEOUT = 60 * 1523# e.g. Usage: TIMEOUT = 60 * 10
23# This will set the timeout to 15 minutes.24# This will set the timeout to 10 minutes.
24TIMEOUT = 60 * 1525TIMEOUT = 60 * 10
26
27HERE = os.path.dirname(os.path.realpath(__file__))
28
2529
26def main():30def main():
27 """Call bin/test with whatever arguments this script was run with.31 """Call bin/test with whatever arguments this script was run with.
2832
29 If the tests ran ok (last line of stderr is 'OK<return>') then suppress33 Prior to running the tests this script sets up the test database.
30 output and exit(0).
3134
32 Otherwise, print output and exit(1).35 Returns 1 on error, otherwise it returns the testrunner's exit code.
33 """36 """
34 here = os.path.dirname(os.path.realpath(__file__))37 if setup_test_database() != 0:
35
36 # Tabnanny
37 # NB. If tabnanny raises an exception, run
38 # python /usr/lib/python2.5/tabnanny.py -vv lib/canonical
39 # for more detailed output.
40 org_stdout = sys.stdout
41 sys.stdout = StringIO()
42 tabnanny.check(os.path.join(here, 'lib', 'canonical'))
43 tabnanny_results = sys.stdout.getvalue()
44 sys.stdout = org_stdout
45 if len(tabnanny_results) > 0:
46 print '---- tabnanny bitching ----'
47 print tabnanny_results
48 print '---- end tabnanny bitching ----'
49 return 138 return 1
5039
40 return run_test_process()
41
42
43def setup_test_database():
44 """Set up a test instance of our postgresql database.
45
46 Returns 0 for success, 1 for errors.
47 """
51 # Sanity check PostgreSQL version. No point in trying to create a test48 # Sanity check PostgreSQL version. No point in trying to create a test
52 # database when PostgreSQL is too old.49 # database when PostgreSQL is too old.
53 con = psycopg2.connect('dbname=template1')50 con = psycopg2.connect('dbname=template1')
@@ -91,8 +88,7 @@
91 con.close()88 con.close()
9289
93 # Build the template database. Tests duplicate this.90 # Build the template database. Tests duplicate this.
94 here = os.path.dirname(os.path.realpath(__file__))91 schema_dir = os.path.join(HERE, 'database', 'schema')
95 schema_dir = os.path.join(here, 'database', 'schema')
96 if os.system('cd %s; make test > /dev/null' % (schema_dir)) != 0:92 if os.system('cd %s; make test > /dev/null' % (schema_dir)) != 0:
97 print 'Failed to create database or load sampledata.'93 print 'Failed to create database or load sampledata.'
98 return 194 return 1
@@ -134,70 +130,118 @@
134 con.close()130 con.close()
135 del con131 del con
136132
133 return 0
134
135
136def run_test_process():
137 """Start the testrunner process and return its exit code."""
137 print 'Running tests.'138 print 'Running tests.'
138 os.chdir(here)139 os.chdir(HERE)
140
141 # We run the test suite under a virtual frame buffer server so that the
142 # JavaScript integration test suite can run.
139 cmd = [143 cmd = [
140 'xvfb-run',144 '/usr/bin/xvfb-run',
141 "--error-file=/var/tmp/xvfb-errors.log",145 "--error-file=/var/tmp/xvfb-errors.log",
142 "--server-args='-screen 0 1024x768x24'",146 "--server-args='-screen 0 1024x768x24'",
143 os.path.join(here, 'bin', 'test')] + sys.argv[1:]147 os.path.join(HERE, 'bin', 'test')] + sys.argv[1:]
144 command_line = ' '.join(cmd)148 command_line = ' '.join(cmd)
145 print command_line149 print "Running command:", command_line
146150
147 # Run the test suite and return the error code151 # Run the test suite. Make the suite the leader of a new process group
148 #return call(cmd)152 # so that we can signal the group without signaling ourselves.
149153 xvfb_proc = Popen(
150 proc = Popen(154 command_line,
151 command_line, stdin=PIPE, stdout=PIPE, stderr=STDOUT, shell=True)155 stdout=PIPE,
152 proc.stdin.close()156 stderr=STDOUT,
153157 preexec_fn=os.setpgrp,
154 # Do proc.communicate(), but timeout if there's no activity on stdout or158 shell=True)
155 # stderr for too long.159
156 open_readers = set([proc.stdout])160 # This code is very similar to what takes place in Popen._communicate(),
161 # but this code times out if there is no activity on STDOUT for too long.
162 # This keeps us from blocking when reading from a hung testrunner, allows
163 # us to time out if the child process hangs, and avoids issues when using
164 # Popen.communicate() with large data sets.
165 open_readers = set([xvfb_proc.stdout])
157 while open_readers:166 while open_readers:
158 rlist, wlist, xlist = select(open_readers, [], [], TIMEOUT)167 rlist, wlist, xlist = select(open_readers, [], [], TIMEOUT)
159168
160 if len(rlist) == 0:169 if len(rlist) == 0:
161 if proc.poll() is not None:170 # The select() statement timed out!
171
172 if xvfb_proc.poll() is not None:
173 # The process we were watching died.
162 break174 break
163 print ("\nA test appears to be hung. There has been no output for"175
164 " %d seconds. Sending SIGTERM." % TIMEOUT)176 cleanup_hung_testrunner(xvfb_proc)
165 killem(proc.pid, SIGTERM)
166 time.sleep(3)
167 if proc.poll() is not None:
168 print ("\nSIGTERM did not work. Sending SIGKILL.")
169 killem(proc.pid, SIGKILL)
170 # Drain the subprocess's stdout and stderr.
171 sys.stdout.write(proc.stdout.read())
172 break177 break
173178
174 if proc.stdout in rlist:179 if xvfb_proc.stdout in rlist:
175 chunk = os.read(proc.stdout.fileno(), 1024)180 # Read a chunk of output from STDOUT.
181 chunk = os.read(xvfb_proc.stdout.fileno(), 1024)
176 sys.stdout.write(chunk)182 sys.stdout.write(chunk)
177 if chunk == "":183 if chunk == "":
178 open_readers.remove(proc.stdout)184 # Gracefully exit the loop if STDOUT is empty.
179185 open_readers.remove(xvfb_proc.stdout)
180 rv = proc.wait()186
187 rv = xvfb_proc.wait()
188
181 if rv == 0:189 if rv == 0:
182 print '\nSuccessfully ran all tests.'190 print
191 print 'Successfully ran all tests.'
183 else:192 else:
184 print '\nTests failed (exit code %d)' % rv193 print
194 print 'Tests failed (exit code %d)' % rv
185195
186 return rv196 return rv
187197
188198
189def killem(pid, signal):199def cleanup_hung_testrunner(process):
190 """Kill the process group leader identified by pid and other group members200 """Kill and clean up the testrunner process and its children."""
191201 print
192 Note that bin/test sets its process to a process group leader.202 print
193 """203 print ("WARNING: A test appears to be hung. There has been no "
204 "output for %d seconds." % TIMEOUT)
205 print "Forcibly shutting down the test suite"
206
207 # This guarantees the process will die. In rare cases
208 # a child process may survive this if they are in a different
209 # process group and they ignore the signals we send their parent.
210 nice_killpg(process.pid)
211
212 # The process should absolutely be dead now.
213 assert process.poll() is not None
214
215 # Drain the subprocess's stdout and stderr.
216 print "The dying processes left behind the following output:"
217 print "--------------- BEGIN OUTPUT ---------------"
218 sys.stdout.write(process.stdout.read())
219 print
220 print "---------------- END OUTPUT ----------------"
221
222
223def nice_killpg(pgid):
224 """Kill a Unix process group using increasingly harmful signals."""
194 try:225 try:
195 os.killpg(os.getpgid(pid), signal)226 print "Process group %d will be killed" % pgid
196 except OSError, x:227
197 if x.errno == errno.ESRCH:228 # Attempt a series of increasingly brutal methods of killing the
229 # process.
230 for signum in [SIGTERM, SIGINT, SIGHUP, SIGKILL]:
231 print "Sending signal %s to process group %d" % (signum, pgid)
232 os.killpg(pgid, signum)
233
234 # Give the processes some time to shut down.
235 time.sleep(3)
236
237 except OSError, exc:
238 if exc.errno == errno.ESRCH:
239 # We tried to call os.killpg() and found the group to be empty.
198 pass240 pass
199 else:241 else:
200 raise242 raise
243 print "Process group %d is now empty." % pgid
244
201245
202if __name__ == '__main__':246if __name__ == '__main__':
203 sys.exit(main())247 sys.exit(main())