Launchpad itself

Reviewer	Date Requested	Status
Māris Fogels (community)		Needs Resubmitting on 2010-06-20
Michael Hudson-Doyle	2010-05-25	Approve on 2010-05-26
Review via email: mp+25981@code.launchpad.net

This proposal has been superseded by a proposal from 2010-07-28.

Description of the change

Hi,

This branch fixes test_on_merge.py so that it now terminates the test suite properly after 10 minutes. This means EC2 instances will no longer hang when they kill the test suite.

This branch changes how the testrunner watchdog handles the process groups of itself and its children. The diff explains why the process group fiddling takes place. In order for the testrunner watchdog to control the process groups I had to remove process group handling from bin/test.

I changed the code a fair bit by introducing many comments and some new functions to encapsulate the existing code. I found these changes necessary to understand what the code was doing, as I only had basic knowledge of Unix process handling before this change. This code should now be easier for others to understand.

I dropped the testrunner timeout to 10 minutes. No test should run that long without output.

I cleaned up the print statements a bit, personal preference.

I added some detailed output to the testrunner kill procedure. This is failing loudly, with the hope that the script's behaviour will be less opaque when things go wrong in the code that handles things going wrong.

This branch was tested with a hand-crafted windmill testrunner harness. The harness forces an hour-long sleep() in one of the windmill tests. I used this harness to test both a local test suite timeout and an ec2 test suite timeout, and to ensure that they can kill off an entire process group.

I will complete a full "make check" run and a full "ec2 test" run with this branch before landing.

Pre-implementation call with: no one
Test command: see above

Maris

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2010-05-26:

Gary asked me to have a look at this.

I think it broadly looks reasonable -- certainly it terms of readability it's way better than what went before -- and it's great that you've tested that it works. The proof is in the pudding here!

Just to be clear, the reason this stopped working is because there's now a "xvfb-run" process 'between' the test_on_merge.py script and the bin/test script?

I have a little apprehension about simply dying if we're the process leader. For a start, it means you can't just run "./test_on_merge.py" from the shell! It also might not work on buildbot -- I think it actually will, unless we change our configuration to specify usePTY=True, but that seems rather fragile. Did you consider re-execing the process if we find we're the process group leader? I've pushed an implementation of this idea to lp:~mwhudson/launchpad/fix-test_on_merge-578886 and it seems to work.

If you merge my change in or do something similar, I think this is good to land.

review: Approve

Revision history for this message

Māris Fogels (mars) wrote on 2010-05-26:

As discussed on IRC, I added in a call to os.fork() that will get us a fresh PID to use as the process group leader. This enables the group-swapping trick to work when the script is run directly.

To make this new bit of complexity clearer I ended up splitting the major actions of main() out into their own functions. This revealed to yet more fixes that were hidden beneath the main() beast's bulk:

* main()'s docstring was dead wrong. I fixed it.
* The 'here' variable was being reused and redefined throughout the function. I made it global.
* The tabnanny code is just plain wrong. It catches tab problems, but misses the rest of the errors that the script may generate. I marked this with an XXX.

The code also makes heavy use of function return codes, which are ugly but effective. I left them.

Everything is being re-tested before submission.

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2010-06-01:

134 + pid = os.fork()
135 + if pid != 0:
136 + # We are the parent process, so we'll wait for our child process to
137 + # do the heavy lifting for us.
138 + pid, status = os.wait()

Why call wait() and not waitpid() here? Alternatively, is it worth checking the pids match? Clearly they _should_ match, but if they didn't it could be very confusing :-)

I like the new approach overall btw.

Revision history for this message

Māris Fogels (mars) wrote on 2010-06-20:

I have had to re-examine a number of assumptions about this code, such as the need to aggressively kill the entire process tree, and the use of os.fork() instead of using Popen's preexec_fn argument.

I'm withdrawing this proposal until the code can be revisited.

review: Needs Resubmitting

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Barki Mustapha

Celso Providelo

Christian Reis

Christy Awad

Colin Watson

Harpianto,ANDI

James Troup

John A Meinel

Kevin bush

Launchpad code reviewers

Launchpad code reviewers from Canonical

Matthew Tanner

Maximiliano Bertacchini

Māris Fogels

Oguz Ersoz

Simon Brakhane

Ubuntu-BR DevOps

William Grant

alhawiti

api.ng

pedro cavazos

todaioan

wenjingwen

to status/vote changes:

Tzaddi

Tzaddi Belding

 === modified file 'buildout-templates/bin/test.in'
 --- buildout-templates/bin/test.in	2010-05-15 17:43:59 +0000
 +++ buildout-templates/bin/test.in	2010-05-31 14:28:31 +0000
@@ -29,15 +29,6 @@
  BUILD_DIR = ${buildout:directory|path-repr}
  CUSTOM_SITE_DIR = ${scripts:parts-directory|path-repr}
--if os.getsid(0) == os.getsid(os.getppid()):
--    # We need to become the process group leader so test_on_merge.py
--    # can reap its children.
--    #
--    # Note that if setpgrp() is used to move a process from one
--    # process group to another (as is done by some shells when
--    # creating pipelines), then both process groups must be part of
--    # the same session.
--    os.setpgrp()
  # Make tests run in a timezone no launchpad developers live in.
  # Our tests need to run in any timezone.
 === modified file 'test_on_merge.py'
 --- test_on_merge.py	2010-04-22 17:30:35 +0000
 +++ test_on_merge.py	2010-05-31 14:28:31 +0000
@@ -12,34 +12,52 @@
  from StringIO import StringIO
  import psycopg2
  from subprocess import Popen, PIPE, STDOUT
--from signal import SIGKILL, SIGTERM
++from signal import SIGKILL, SIGTERM, SIGINT, SIGHUP
  from select import select
++
  # The TIMEOUT setting (expressed in seconds) affects how long a test will run
  # before it is deemed to be hung, and then appropriately terminated.
  # It's principal use is preventing a PQM job from hanging indefinitely and
  # backing up the queue.
--# e.g. Usage: TIMEOUT = 60 * 15
--# This will set the timeout to 15 minutes.
--TIMEOUT = 60 * 15
++# e.g. Usage: TIMEOUT = 60 * 10
++# This will set the timeout to 10 minutes.
++TIMEOUT = 60 * 10
++
++HERE = os.path.dirname(os.path.realpath(__file__))
++
  def main():
      """Call bin/test with whatever arguments this script was run with.
--    If the tests ran ok (last line of stderr is 'OK<return>') then suppress
--    output and exit(0).
--
--    Otherwise, print output and exit(1).
--    """
--    here = os.path.dirname(os.path.realpath(__file__))
--
--    # Tabnanny
--    # NB. If tabnanny raises an exception, run
--    # python /usr/lib/python2.5/tabnanny.py -vv lib/canonical
--    # for more detailed output.
++    Prior to running the tests this script checks the project files with
++    Python2.5's tabnanny and sets up the test database.
++
++    Returns 1 on error, otherwise it returns the testrunner's exit code.
++    """
++    if run_tabnanny() != 0:
++        return 1
++
++    if setup_test_database() != 0:
++        return 1
++
++    return run_test_process()
++
++
++def run_tabnanny():
++    """Run the tabnanny, return its exit code.
++
++    If tabnanny raises an exception, run "python /usr/lib/python2.5/tabnanny.py
++    -vv lib/canonical for more detailed output.
++    """
++    # XXX mars 2010-05-26
++    # Tabnanny reports some of its errors on sys.stderr, so this code is
++    # already wrong.  subprocess.Popen.communicate() would work better.
++    print "Checking the source tree with tabnanny..."
      org_stdout = sys.stdout
      sys.stdout = StringIO()
--    tabnanny.check(os.path.join(here, 'lib', 'canonical'))
++    tabnanny.check(os.path.join(HERE, 'lib', 'canonical'))
++    tabnanny.check(os.path.join(HERE, 'lib', 'lp'))
      tabnanny_results = sys.stdout.getvalue()
      sys.stdout = org_stdout
      if len(tabnanny_results) > 0:
@@ -47,7 +65,16 @@
          print tabnanny_results
          print '---- end tabnanny bitching ----'
          return 1
--
++    else:
++        print "Done"
++        return 0
++
++
++def setup_test_database():
++    """Set up a test instance of our postgresql database.
++
++    Returns 0 for success, 1 for errors.
++    """
      # Sanity check PostgreSQL version. No point in trying to create a test
      # database when PostgreSQL is too old.
      con = psycopg2.connect('dbname=template1')
@@ -91,8 +118,7 @@
      con.close()
      # Build the template database. Tests duplicate this.
--    here = os.path.dirname(os.path.realpath(__file__))
--    schema_dir = os.path.join(here, 'database', 'schema')
++    schema_dir = os.path.join(HERE, 'database', 'schema')
      if os.system('cd %s; make test > /dev/null' % (schema_dir)) != 0:
          print 'Failed to create database or load sampledata.'
          return 1
@@ -134,70 +160,164 @@
      con.close()
      del con
++    return 0
++
++
++def run_test_process():
++    """Start the testrunner process and return its exit code."""
++    # Fork a child process so that we get a new process ID that we can
++    # guarantee is not currently in use as a process group leader. This
++    # addresses the case where this script has been started directly in the
++    # shell using "python foo.py" or "./foo.py".
++    pid = os.fork()
++    if pid != 0:
++        # We are the parent process, so we'll wait for our child process to
++        # do the heavy lifting for us.
++        pid, status = os.wait()
++
++        if os.WIFEXITED(status):
++            return os.WEXITSTATUS(status)
++        else:
++            # We should not reach this code unless something segfaulted in
++            # our child process, or it recieved a signal from some outside
++            # force.
++            raise RuntimeError(
++                "Oops!  The test watchdog was killed by signal %s" % (
++                    os.WTERMSIG(status)))
++
      print 'Running tests.'
--    os.chdir(here)
++    os.chdir(HERE)
++
++    # Play shenanigans with our process group. We want to kill off our child
++    # groups while at the same time not slaughtering ourselves!
++    original_process_group = os.getpgid(0)
++
++    # Make sure we are not already the process group leader.  Otherwise this
++    # trick won't work.
++    assert original_process_group != os.getpid()
++
++    # Change our process group to match our PID, as per POSIX convention.
++    os.setpgrp()
++
++    # We run the test suite under a virtual frame buffer server so that the
++    # JavaScript integration test suite can run.
      cmd = [
          'xvfb-run',
          '-s',
          "'-screen 0 1024x768x24'",
--        os.path.join(here, 'bin', 'test')] + sys.argv[1:]
++        os.path.join(HERE, 'bin', 'test')] + sys.argv[1:]
++
      command_line = ' '.join(cmd)
--    print command_line
++    print "Running command:", command_line
      # Run the test suite and return the error code
--    #return call(cmd)
--
--    proc = Popen(
++    xvfb_proc = Popen(
          command_line, stdin=PIPE, stdout=PIPE, stderr=STDOUT, shell=True)
--    proc.stdin.close()
--
--    # Do proc.communicate(), but timeout if there's no activity on stdout or
--    # stderr for too long.
--    open_readers = set([proc.stdout])
++    xvfb_proc.stdin.close()
++
++    # Restore our original process group, thus removing ourselves from
++    # os.killpg's target list.  Our child process and its children will retain
++    # the process group number matching our PID.
++    os.setpgid(0, original_process_group)
++
++    # This code is very similar to what takes place in Popen._communicate(),
++    # but this code times out if there is no activity on STDOUT for too long.
++    open_readers = set([xvfb_proc.stdout])
      while open_readers:
          rlist, wlist, xlist = select(open_readers, [], [], TIMEOUT)
          if len(rlist) == 0:
--            if proc.poll() is not None:
++            # The select() statement timed out!
++
++            if xvfb_proc.poll() is not None:
++                # The process we were watching died.
                  break
--            print ("\nA test appears to be hung. There has been no output for"
--                " %d seconds. Sending SIGTERM." % TIMEOUT)
--            killem(proc.pid, SIGTERM)
--            time.sleep(3)
--            if proc.poll() is not None:
--                print ("\nSIGTERM did not work. Sending SIGKILL.")
--                killem(proc.pid, SIGKILL)
--            # Drain the subprocess's stdout and stderr.
--            sys.stdout.write(proc.stdout.read())
++
++            cleanup_hung_testrunner(xvfb_proc)
              break
--        if proc.stdout in rlist:
--            chunk = os.read(proc.stdout.fileno(), 1024)
++        if xvfb_proc.stdout in rlist:
++            # Read a chunk of output from STDOUT.
++            chunk = os.read(xvfb_proc.stdout.fileno(), 1024)
              sys.stdout.write(chunk)
              if chunk == "":
--                open_readers.remove(proc.stdout)
--
--    rv = proc.wait()
++                # Gracefully exit the loop if STDOUT is empty.
++                open_readers.remove(xvfb_proc.stdout)
++
++    rv = xvfb_proc.wait()
++
      if rv == 0:
--        print '\nSuccessfully ran all tests.'
++        print
++        print 'Successfully ran all tests.'
      else:
--        print '\nTests failed (exit code %d)' % rv
++        print
++        print 'Tests failed (exit code %d)' % rv
      return rv
--def killem(pid, signal):
--    """Kill the process group leader identified by pid and other group members
--
--    Note that bin/test sets its process to a process group leader.
--    """
++def cleanup_hung_testrunner(process):
++    """Kill and clean up the testrunner process and its children."""
++    print
++    print
++    print ("WARNING: A test appears to be hung. There has been no "
++        "output for %d seconds." % TIMEOUT)
++    print "Forcibly shutting down the test suite"
++
++    # This guarantees the process' group will die.  In rare cases
++    # a child process may survive this if they are in a different
++    # process group and they ignore the signals we send their parent.
++    nice_killpg(process)
++
++    # Drain the subprocess's stdout and stderr.
++    print "The dying processes left behind the following output:"
++    print "--------------- BEGIN OUTPUT ---------------"
++    sys.stdout.write(process.stdout.read())
++    print
++    print "---------------- END OUTPUT ----------------"
++
++
++def nice_killpg(process):
++    """Kill a Unix process group using increasingly harmful signals."""
++    pgid = os.getpgid(process.pid)
++
      try:
--        os.killpg(os.getpgid(pid), signal)
--    except OSError, x:
--        if x.errno == errno.ESRCH:
++        print "Process group %d will be killed" % pgid
++
++        # Attempt a series of increasingly brutal methods of killing the
++        # process.
++        for signum in [SIGTERM, SIGINT, SIGHUP, SIGKILL]:
++            print "Sending signal %s to process group %d" % (signum, pgid)
++            os.killpg(pgid, signum)
++
++            # Give the processes some time to shut down.
++            time.sleep(3)
++
++            # Poll our original child process so that the Popen object can
++            # capture the process' exit code. If we do not do this now it
++            # will be lost by the following call to os.waitpid(). Note that
++            # this also reaps every process in the process group!
++            process.poll()
++
++            # This call will raise ESRCH if the group is empty, or ECHILD if
++            # the group has already been reaped. The exception will exit the
++            # loop for us.
++            os.waitpid(-pgid, os.WNOHANG)   # Check for survivors.
++
++            print "Some processes ignored our signal!"
++
++    except OSError, exc:
++        if exc.errno == errno.ESRCH:
++            # We tried to call os.killpg() and found the group to be empty.
++            pass
++        elif exc.errno == errno.ECHILD:
++            # We tried to poll the process group with os.waitpid() and found
++            # it was empty.
              pass
          else:
              raise
++    print "Process group %d is now empty." % pgid
++
  if __name__ == '__main__':
      sys.exit(main())

Launchpad itself

Merge lp:~mars/launchpad/fix-test_on_merge-578886 into lp:launchpad

Commit message

Description of the change

Preview Diff

Subscribers