For starters, thanks for the thorough review. It is really nice to get feedback on it, and you have certainly thought a lot about the code.

On 10/6/2010 12:02 AM, Michael Hudson-Doyle wrote:
> Review: Needs Information
> Wow, what a branch.  I've been quite picky in this, that's not because
> I hate you <wink> but because it's subtle code and has potential to
> cause issues, so I wanted to make sure all the comments make sense and
> there's no redundant code, etc.
> 
> I worry slightly about what would happen if the forking service fell
> over -- but clearly, we rely on the sftp server itself staying up, so
> this probably isn't such a big deal.  I guess we also need to make
> sure the forking service is running in production.


We could trap failures to connect to the forking server and fall back to regular spawnProcess. The code is already a simple if/else block. However, I wouldn't really want silent degradation of connection. If we want it, it could probably be done with:

transport = None

if config.codehosting.use_forking_server:
  try:
    transport = ...
  except ???:
     pass # We failed, we'll try again

if transport is None:
  transport = reactor.spawnProcess(...)


> 
> All that said, I have no real problems with the code and look forward
> to getting my branches onto and off of Launchpad that bit faster!
> 
> I'm marking this needs information because I want to see a response to my
> comments.  But I don't think any of them are deep.
> 
>> === modified file 'Makefile'
>> --- Makefile	2010-09-07 18:15:01 +0000
>> +++ Makefile	2010-10-06 00:19:31 +0000
>> @@ -253,7 +253,7 @@
>>
>>  run_codehosting: check_schema inplace stop
>>  	$(RM) thread*.request
>> -	bin/run -r librarian,sftp,codebrowse -i $(LPCONFIG)
>> +	bin/run -r librarian,sftp,forker,codebrowse -i $(LPCONFIG)
>>
>>
>>  start_librarian: compile
> 
> You should add the forker service to the run_all target too.

Done.


> 
>> === added directory 'bzrplugins/lpserve'
>> === renamed file 'bzrplugins/lpserve.py' => 'bzrplugins/lpserve/__init__.py'
>> --- bzrplugins/lpserve.py	2010-04-19 06:35:23 +0000
>> +++ bzrplugins/lpserve/__init__.py	2010-10-06 00:19:31 +0000
>> @@ -8,15 +8,33 @@
>>
>>  __metaclass__ = type
>>
>> -__all__ = ['cmd_launchpad_server']
>> -
>> -
>> +__all__ = ['cmd_launchpad_server',
>> +           'cmd_launchpad_forking_service',
>> +          ]
> 
> This shoudl be formatted like this:
> 
> __all__ = [
>     'cmd_launchpad_server',
>     'cmd_launchpad_forking_service',
>     ]

check


...

>> +class LPForkingService(object):
>> +    """A service that can be asked to start a new bzr subprocess via fork.
>> +
>> +    The basic idea is that python startup is very expensive. For example, the
>> +    original 'lp-serve' command could take 2.5s just to start up, before any
>> +    actual actions could be performed.
> 
> Well, it's not really Python startup, its more the importing thats
> slow.  Maybe phrase this as "starting Python and importing Launchpad
> is very...".
> 
>> +    This class provides a service sitting on a socket, which can then be
>> +    requested to fork and run a given bzr command.
>> +
>> +    Clients connect to the socket and make a simple request, which then
>> +    receives a response. The possible requests are:
>> +
>> +        "hello\n":  Trigger a heartbeat to report that the program is still
>> +                    running, and write status information to the log file.
>> +        "quit\n":   Stop the service, but do so 'nicely', waiting for children
>> +                    to exit, etc. Once this is received the service will stop
>> +                    taking new requests on the port.
>> +        "fork <command>\n": Request a new subprocess to be started.
>> +            <command> is the bzr command to be run, such as "rocks" or
>> +            "lp-serve --inet 12".
>> +            The immediate response will be the path-on-disk to a directory full
>> +            of named pipes (fifos) that will be the stdout/stderr/stdin of the
>> +            new process.
> 
> This doesn't quite make it clear what the names of the files will be.
> The obvious guess is correct, but I'd rather be certain.
> 
>> +            If a client holds the socket open, when the child process exits,
>> +            the exit status (as given by 'wait()') will be written to the
>> +            socket.
> 
> This appears to be out of date -- there's also a fork-env command now.
> Is the fork command still useful?

It has been useful in testing, it isn't used in the 'production' code.

...


> 
>> +    DEFAULT_PATH = '/var/run/launchpad_forking_service.sock'
> 
> I'm not sure of the value of providing a default here, as it seems
> that its always in practice going to be overridden by the config.  If
> it makes testing easier, it's worth it I guess.

Well, we override it for testing as well. I wanted to support "bzr lp-forking-service" as something you could run. Also, that seemed to be the 'best' place to put it, which gives hints as to where you would want to put it.

> 
>> +    DEFAULT_PERMISSIONS = 00660 # Permissions on the master socket (rw-rw----)
>> +    WAIT_FOR_CHILDREN_TIMEOUT = 5*60 # Wait no more than 5 min for children
>> +    SOCKET_TIMEOUT = 1.0
>> +    SLEEP_FOR_CHILDREN_TIMEOUT = 1.0
>> +    WAIT_FOR_REQUEST_TIMEOUT = 1.0 # No request should take longer than this to
>> +                                   # be read
>> +
>> +    _fork_function = os.fork
>> +
>> +    def __init__(self, path=DEFAULT_PATH, perms=DEFAULT_PERMISSIONS):
>> +        self.master_socket_path = path
>> +        self._perms = perms
>> +        self._start_time = time.time()
> 
> It's pedantry in the extreme, but does it make more sense to set this
> in main_loop than here?
> 
>> +        self._should_terminate = threading.Event()
> 
> The use of a threading primitive is a bit surprising as this process
> is single threaded.  Is this to make testing easier?

We do run it in another thread in testing. And it seemed more deterministic to trigger it via an event, rather than setting the attribute. It is probably a bit of overkill, but I don't think it is wrong, so I kept it.

> 
>> +        # We address these locally, in case of shutdown socket may be gc'd
>> +        # before we are
>> +        self._socket_timeout = socket.timeout
>> +        self._socket_error = socket.error
> 
> This must have been fun to figure out :/

Copied from other bzr service code. But yes, I'm sure it was a pain there.

> 
>> +        self._socket_timeout = socket.timeout
>> +        self._socket_error = socket.error
> 
> You don't need to do it twice though :-)
> 
>> +        # Map from pid => information
>> +        self._child_processes = {}
> 
> I'd love to know more than just 'information' here.

Done.

> 
>> +        self._children_spawned = 0
>> +
>> +    def _create_master_socket(self):
>> +        self._server_socket = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
>> +        self._server_socket.bind(self.master_socket_path)
>> +        if self._perms is not None:
>> +            os.chmod(self.master_socket_path, self._perms)
> 
> The pedant in me thinks this is a bit racy.  But I don't think it's
> important (in fact, I wonder if the permission stuff is necessary
> really).

I'm sure it isn't necessary in the "what do we need in Launchpad" sense. It seemed reasonable from a "what do you want from a service that allows arbitrary commands to be run from data written to its socket".


> 
>> +        self._server_socket.setsockopt(socket.SOL_SOCKET,
>> +            socket.SO_REUSEADDR, 1)
> 
> Does this work with unix domain sockets?  I didn't think it did, and
> your comments in _cleanup_master_socket lend weight to my position :-)
> 
> Another option would be to use a socket in the 'abstract namespace',
> see man 7 unix, although that might be a bit magical.

I think it is carry over from
 a) When this was a port (AF_INET)
 b) trying to figure out why I couldn't start a new service. It does 
    seem that you have to delete the socket on disk, or you can't bind 
    to that address.

So we can just get rid of it.
 
>> +        self._sockname = self._server_socket.getsockname()
>> +        # self.host = self._sockname[0]
>> +        self.port = self._sockname[1]
> 
> Is there any point to these three lines?

It isn't used, and I got rid of them (as above). It certainly was used when it was an actual random port during testing.

...

>> +    def _register_signals(self):
>> +        """Register a SIGCHILD and SIGTERM handler.
>> +
>> +        If we have a trigger for SIGCHILD then we can quickly respond to
>> +        clients when their process exits. The main risk is getting more EAGAIN
>> +        errors elsewhere.
>> +
>> +        SIGTERM allows us to cleanup nicely before we exit.
>> +        """
>> +        signal.signal(signal.SIGCHLD, self._handle_sigchld)
>> +        signal.signal(signal.SIGTERM, self._handle_sigterm)
> 
> When you run 'make run_codehosting' and C-c it, do you know which
> signal gets sent?  I don't, I have to admit.  I guess SIGINT will work
> fine by default...

Killing the master process is done via SIGINT, it has an atexit hook to fire SIGTERM into the running processes.

I believe it fires SIGTERM.

> 
>> +    def _unregister_signals(self):
>> +        signal.signal(signal.SIGCHLD, signal.SIG_DFL)
>> +        signal.signal(signal.SIGTERM, signal.SIG_DFL)
>> +
>> +    def _create_child_file_descriptors(self, base_path):
>> +        stdin_path = os.path.join(base_path, 'stdin')
>> +        stdout_path = os.path.join(base_path, 'stdout')
>> +        stderr_path = os.path.join(base_path, 'stderr')
>> +        os.mkfifo(stdin_path)
>> +        os.mkfifo(stdout_path)
>> +        os.mkfifo(stderr_path)
>> +
>> +    def _bind_child_file_descriptors(self, base_path):
>> +        import logging
>> +        from bzrlib import ui
> 
> Any reason to do these imports at function level?  If there is, can
> you explain it in a comment?

Moved to global, ui was already there.

> 
>> +        stdin_path = os.path.join(base_path, 'stdin')
>> +        stdout_path = os.path.join(base_path, 'stdout')
>> +        stderr_path = os.path.join(base_path, 'stderr')
>> +        # Opening for writing blocks (or fails), so do those last
> 
> man 3 mkfifo says that opening for both reading and writing blocks.  I
> guess this means it's very important that the client and child process
> open the fifos in the same order!

Well, you can open(O_NBLOCK) for reading. if you do that for writing, it raises an exception if something isn't on the other end. However, it ends up having the file handle in non-blocking mode for the rest of the time, so future bzr+ssh reads fail with "EAGAIN" (would have blocked). I can clarify.


...

>> +    def _close_child_file_descriptons(self):
>> +        sys.stdin.close()
>> +        sys.stderr.close()
>> +        sys.stdout.close()
> 
> This is surely meant to be called _close_child_file_descriptoRs ?
> 

well, not capitalized :).
...

>> +        # We force os._exit() here, because we don't want to unwind the stack,
>> +        # which has complex results. (We can get it to unwind back to the
>> +        # cmd_launchpad_forking_service code, and even back to main() reporting
>> +        # thereturn code, but after that, suddenly the return code changes from
>> +        # a '0' to a '1', with no logging of info.
>> +        # TODO: Should we call sys.exitfunc() here? it allows atexit functions
>> +        #       to fire, however, some of those may be still around from the
>> +        #       parent process, which we don't really want.
> 
> Does bzr install atexit functions?

I know the test suite does. I have very little clue as to what the launchpad oops code does. And the python logging module uses atexit functionality as well.

...

>> +    def fork_one_request(self, conn, client_addr, command_argv, env):
>> +        """Fork myself and serve a request."""
>> +        temp_name = tempfile.mkdtemp(prefix='lp-forking-service-child-')
>> +        # Now that we've set everything up, send the response to the client we
>> +        # create them first, so the client can start trying to connect to them,
>> +        # while we fork and have the child do the same.
>> +        self._children_spawned += 1
>> +        pid = self._fork_function()
>> +        if pid == 0:
>> +            pid = os.getpid()
>> +            trace.mutter('%d spawned' % (pid,))
>> +            self._server_socket.close()
>> +            for env_var, value in env.iteritems():
>> +                osutils.set_or_unset_env(env_var, value)
> 
> It's bascially by chance, but I wanted to mention that this provides a
> place to do something I've wanted to do for a long time: have bzr
> lp-serve processes have BZR_LOG set to something based on their pid.
> Not in this branch though :-)

It is slightly more involved than that, since you have to "trace._push_log_filename()", etc and not just set BZR_LOG.

Also, in discussing with Martin, it is probably better to just have mutter() include the os.getpid() information. Consider Apache logs, where everything is put into a single file, but you can still easily filter it back out again into individual requests.

> 
>> +            # See [Decision #3]
>> +            self._create_child_file_descriptors(temp_name)
>> +            conn.sendall('ok\n%d\n%s\n' % (pid, temp_name))
>> +            conn.close()
>> +            self.become_child(command_argv, temp_name)
>> +            trace.warning('become_child returned!!!')
>> +            sys.exit(1)
> 
> Would os._exit() not be more appropriate here?

If we get here, something bad has happened (we didn't hit the earlier os._exit() state). I don't have any strong hints, but sys.exit() means we'll can get a traceback.

...

>> +            except self._socket_error, e:
>> +                if e.args[0] == errno.EINTR:
>> +                    pass # run shutdown and children checks
>> +                elif e.args[0] != errno.EBADF:
>> +                    # We can get EBADF here while we are shutting down
>> +                    # So we just ignore it for now
>> +                    pass
>> +                else:
>> +                    # Log any other failure mode
>> +                    trace.warning("listening socket error: %s", e)
>> +            else:
>> +                self.log(client_addr, 'connected')
>> +                # TODO: We should probably trap exceptions coming out of this
>> +                #       and log them, so that we don't kill the service because
>> +                #       of an unhandled error
> 
> Er.  Yes.

I'm not 100% sure that having a service just start logging "unhandled exception" is going to get attention. If you are sure that there will be a way for this information to become visible, then I'm more willing to suppress it at this point.

hmm... maybe having "hello" report back if there have been unhandled errors?

...
>> +    def _wait_for_children(self, secs):
>> +        start = time.time()
>> +        end = start + secs
>> +        while self._child_processes:
>> +            self._poll_children()
>> +            if secs > 0 and time.time() > end:
>> +                break
>> +            time.sleep(self.SLEEP_FOR_CHILDREN_TIMEOUT)
> 
> I find the practice of calling this function with an argument of
> self.SLEEP_FOR_CHILDREN_TIMEOUT a little strange.  Isn't that just the
> same as calling self._poll_children()?
> 

The important calls use self.WAIT_FOR_CHILDREN_TIMEOUT. I can see your point, though. Though actually it doesn't. If you read the logic, the "break" comes *after* _poll_children. So the code does:

_poll_children()
sleep(SLEEP_FOR_CHILDREN_TIMEOUT)
_poll_children()
break


>> +    def _shutdown_children(self):
>> +        self._wait_for_children(self.WAIT_FOR_CHILDREN_TIMEOUT)
>> +        if self._child_processes:
>> +            trace.warning('Failed to stop children: %s'
>> +                % ', '.join(map(str, self._child_processes)))
> 
> This comment seems a bit wrong -- there's not actually been an attempt
> to close the children down yet has there?  All we did was wait, and
> they haven't exited.
> 
> A nice feature of the general architecture is that it would be quite
> easy to disconnect listening for more connections from exiting the
> daemon, something that will help one day with reducing the downtime
> associated with an update.  But that can be sorted out later.

Changed it to "Children still running: %s".

I agree on both points. It would also potentially be useful to change the other setup code to check more frequently for whether "use_forking_daemon = True" is set. Which would mean you could disable it without restarting the daemon.

Also, if you used SIGKILL on the daemon, it won't try to kill its children (I think, unless SIGKILL gets inherited?). But yes a "stop without killing children" would be useful. Or a "stop accepting new connections", bring up the new daemon which will accept new connections, and the other slowly exits.

Actually, I think it would do that now if you SIGINT/SIGTERM this process, and start another one right away. With the main caveat that this process needs to delete the socket before the children are gone, so the new one can create it.

...

>> +    def _parse_fork_request(self, conn, client_addr, request):
>> +        if request.startswith('fork-env '):
>> +            while not request.endswith('end\n'):
>> +                request += osutils.read_bytes_from_socket(conn)
> 
> Hang on.  You don't have any guarantee here that
> read_bytes_from_socket won't return 'end\n' and then the start of the
> next request do you?  Hmm, I guess you do really, because 'end\n' is
> always the last thing the client sends.  But that does seem a little
> delicate -- it should at least be documented.

Every connection gets a single request, and doesn't have much more to say. If we want to say more, then we can, but that will just be an evolution of the interface.

I changed the docstring from:

Clients connect to the socket and make a simple request, which then ...

to be

Clients connect to the socket and make a single request, which then ...


> 
>> +                request = request.replace('\r\n', '\n')
>> +            command, env = request[9:].split('\n', 1)
>> +        else:
>> +            command = request[5:].strip()
>> +            env = 'end\n' # No env set
>> +        try:
>> +            command_argv = self.command_to_argv(command)
>> +            env = self.parse_env(env)
>> +        except Exception, e:
>> +            # TODO: Log the traceback?
>> +            self.log(client_addr, 'command or env parsing failed: %r'
>> +                                  % (str(e),))
>> +            conn.sendall('FAILURE\ncommand or env parsing failed: %r'
>> +                         % (str(e),))
>> +        else:
>> +            return command_argv, env
>> +        return None, None
> 
> I think it would be clearer if the "return None, None" was in the
> except clause (which *is* equivalent, I think?).

So it is equivalent, but because falling off the end of a function == "return None" I prefer to be explicit that all return paths return a tuple. I can put a redundant "return None, None" if you think that would be more obvious.


> 
>> +    def serve_one_connection(self, conn, client_addr):
>> +        request = ''
>> +        while '\n' not in request:
>> +            request += osutils.read_bytes_from_socket(conn)
>> +        # telnet likes to use '\r\n' rather than '\n', and it is nice to have
>> +        # an easy way to debug.
>> +        request = request.replace('\r\n', '\n')
> 
> This must be a relic of listening to an INET port?  Can it be deleted now?

It can, though being generous in what you accept does make it easier to play with the system. You seem to feel strongly, so I've removed it and the associated tests.

> 
>> +        self.log(client_addr, 'request: %r' % (request,))
>> +        if request == 'hello\n':
>> +            conn.sendall('ok\nyep, still alive\n')
>> +            self.log_information()
>> +        elif request == 'quit\n':
>> +            self._should_terminate.set()
>> +            conn.sendall('ok\nquit command requested... exiting\n')
>> +        elif request.startswith('fork ') or request.startswith('fork-env '):
>> +            command_argv, env = self._parse_fork_request(conn, client_addr,
>> +                                                         request)
>> +            if command_argv is not None:
>> +                # See [Decision #7]
>> +                # TODO: Do we want to limit the number of children? And/or
>> +                #       prefork additional instances? (the design will need to
>> +                #       change if we prefork and run arbitrary commands.)
>> +                self.fork_one_request(conn, client_addr, command_argv, env)
>> +                # We don't close the conn like other code paths, since we use
>> +                # it again later.
>> +                return
>> +        else:
>> +            self.log(client_addr, 'FAILURE: unknown request: %r' % (request,))
>> +            # See [Decision #8]
>> +            conn.sendall('FAILURE\nunknown request: %r\n' % (request,))
>> +        conn.close()
> 
> I find this logic around closing the connections slightly obtuse.  I
> realize it would be longer but I think closing the connection or not
> in each branch of the elif tree would be easier to follow.
> 

Done. Originally all paths closed it right away, so it made the most sense to only do it once.

>> +class cmd_launchpad_forking_service(Command):
>> +    """Launch a long-running process, where you can ask for new processes.
>> +
>> +    The process will block on a given --port waiting for requests to be made.
> 
> INET legacy again!
> 
>> +    When a request is made, it will fork itself and redirect stdout/in/err to
>> +    fifos on the filesystem, and start running the requested command. The
>> +    caller will be informed where those file handles can be found. Thus it only
>> +    makes sense that the process connecting to the port must be on the same
>> +    system.
>> +    """
>> +
>> +    aliases = ['lp-service']
>> +
>> +    takes_options = [Option('path',
>> +                        help='Listen for connections at PATH',
>> +                        type=str),
>> +                     Option('perms',
>> +                        help='Set the mode bits for the socket, interpreted'
>> +                             ' as an octal integer (same as chmod)'),
>> +                     Option('preload',
>> +                        help="Do/don't preload libraries before startup."),
>> +                     Option('children-timeout', type=int,
>> +                        help="Only wait XX seconds for children to exit"),
> 
> s/XX/this many/ ?

ARG would be valid, since that is the default, but I went with:
 argname='SEC', and s/XX/SEC/


...

>> +    def run(self):
>> +        # Just read line-by-line from stdin, and write out to stdout or stderr
>> +        # depending on the prefix
>> +        for line in sys.stdin:
> 
> In some situations
> 
> while True:
>     line = sys.stdin.readline()
>     if not line:
>         break
> 
> will behave better wrt buffering.  But I guess what you have works for
> what you need it for :-)

I imagine that if anything else reads from the file handle, it could have very different effects. But yes, this is a test object, and it works for what I was hoping for.

...

>> +def load_tests(standard_tests, module, loader):
>> +    standard_tests.addTests(loader.loadTestsFromModuleNames(
>> +        [__name__ + '.' + x for x in [
>> +            'test_lpserve',
>> +        ]]))
>> +    return standard_tests
> 
> Is this actually necessary/useful in the launchpad tree?

BZR_PLUGIN_PATH=... bzr selftest -s bt.test_lpserve is a lot faster to start up than "bin/test ..."

So I've been using it *a lot*. Whether we want to continue to maintain it...


> 
>> === added file 'bzrplugins/lpserve/test_lpserve.py'
>> --- bzrplugins/lpserve/test_lpserve.py	1970-01-01 00:00:00 +0000
>> +++ bzrplugins/lpserve/test_lpserve.py	2010-10-06 00:19:31 +0000
>> @@ -0,0 +1,541 @@
>> +# Copyright 2010 Canonical Ltd.  This software is licensed under the
>> +# GNU Affero General Public License version 3 (see the file LICENSE).
>> +
>> +import os
>> +import signal
>> +import socket
>> +import subprocess
>> +import tempfile
>> +import threading
>> +import time
>> +
>> +from testtools import content
>> +
>> +from bzrlib import (
>> +    osutils,
>> +    tests,
>> +    trace,
>> +    )
>> +from bzrlib.plugins import lpserve
>> +
>> +from canonical.config import config
>> +from lp.codehosting import get_bzr_path, get_BZR_PLUGIN_PATH_for_subprocess
>> +
>> +
>> +class TestingLPForkingServiceInAThread(lpserve.LPForkingService):
>> +    """Wrap starting and stopping an LPForkingService instance in a thread."""
> 
> I think this class does a bit more than that, can you expand a bit?
> For example, the way forking is nobbled.
> 
>> +    # For testing, we set the timeouts much lower, because we want the tests to
>> +    # run quickly
>> +    WAIT_FOR_CHILDREN_TIMEOUT = 0.5
>> +    SOCKET_TIMEOUT = 0.01
>> +    SLEEP_FOR_CHILDREN_TIMEOUT = 0.01
>> +    WAIT_FOR_REQUEST_TIMEOUT = 0.1
> 
> It's a shame you can't use a deterministic fake time source.  But
> settimeout doesn't support that idea :(
> 
>> +    _fork_function = None
> 
> This is a bit surprising.  Do you just do this so that any attempt to
> fork will blow up, rather than do strange things to the test suite?
> If so, comment please.

correct. Added a comment.
...


>> +    def test_multiple_stops(self):
>> +        service = TestingLPForkingServiceInAThread.start_service(self)
>> +        service.stop_service()
>> +        service.stop_service()
> 
> This test could do with an intent revealing comment ("after
> stop_service is called, calling it again does nothing, silently"
> maybe? -- if I've guessed the intent right)
> 

done

...

...

>> +class TestCaseWithSubprocess(tests.TestCaseWithTransport):
>> +    """Override the bzr start_bzr_subprocess command.
>> +
>> +    The launchpad infrastructure requires a fair amount of configuration to get
>> +    paths, etc correct. So this provides that work.
> 
> provides ... what?  Missing word, I think.  And then I realize that
> this probably isn't your typo :-)

"that work", which was just described. But I'll update.


...

>> +        os.remove(path) # service wants create it as a socket
>> +        env_changes = {'BZR_PLUGIN_PATH': lpserve.__path__[0],
>> +                       'BZR_LOG': tempname}
>> +        proc = self.start_bzr_subprocess(
>> +            ['lp-service', '--path', path, '--no-preload',
>> +             '--children-timeout=1'],
>> +            env_changes=env_changes)
>> +        trace.mutter('started lp-service subprocess')
>> +        # preload_line = proc.stderr.readline()
>> +        # self.assertStartsWith(preload_line, 'Preloading')
> 
> Do you want this or not?

Removed. If you remove "--no-preload" you need it, but for the test suite, only 1 test benefits at all from preloading.

If we had a "Layer" or "Resource" for the collection of tests, then it would make more sense to allow it to preload.

...

>> +    def _get_fork_handles(self, path):
>> +        trace.mutter('getting handles for: %s' % (path,))
>> +        stdin_path = os.path.join(path, 'stdin')
>> +        stdout_path = os.path.join(path, 'stdout')
>> +        stderr_path = os.path.join(path, 'stderr')
>> +        # Consider the ordering, the other side should open 'stdin' first, but
>> +        # we want it to block until we open the last one, or we race and it can
>> +        # delete the other handles before we get to open them.
> 
> As above, on my reading of the man pages, this reasoning is bogus.

It was slightly different with O_NBLOCK, but I've cleaned up the comment.

...

>> === modified file 'lib/canonical/config/schema-lazr.conf'
>> --- lib/canonical/config/schema-lazr.conf	2010-09-24 22:30:48 +0000
>> +++ lib/canonical/config/schema-lazr.conf	2010-10-06 00:19:31 +0000
>> @@ -301,6 +301,18 @@
>>  # datatype: string
>>  logfile: -
>>
>> +# The location of the log file used by the LaunchpadForkingService
>> +# datatype: string
>> +forker_logfile: -
>> +
>> +# Should we be using the forking daemon? Or should we be calling spawnProcess
>> +# instead?
>> +# datatype: boolean
>> +use_forking_daemon: False
>> +# What disk path will the daemon listen on
>> +# datatype: string
>> +forking_daemon_socket: /var/tmp/launchpad_forking_service.sock
> 
> I think I'd leave this defaulting to emptry or something invalid in
> the schema.

It is hard to create something invalid (unless I put it in /root or something... ) I could change it to something in the local directory:

forking_daemon_socket: launchpad_forking_service.sock

Having it be the "standard" place seemed reasonable.

...

>> +class ForkingSessionService(Service):
>> +    """A lp-forking-service for handling ssh access."""
>> +
>> +    # TODO: The SFTP (and bzr+ssh) server depends fairly heavily on this
> 
> The /codehosting/ sftp &c.  We have others now :)

Well it is called "sftp" in the rest of the code (make run_codehosting), but I'll change it.


...

>>
>> +class _WaitForExit(process.ProcessReader):
> 
> I don't really understand why this inherits from ProcessReader.  But
> I'm 1600 lines into this review now, so maybe the problem is at my end
> :-)

Because ProcessReader already calls back to its accompanied "Process" object to let it know that data has been read, the handle has been closed, etc.

It also allows me to implement "dataReceived" instead of doRead(), IIRC.


> 
>> +    """Wait on a socket for the exit status."""
>> +
>> +    def __init__(self, reactor, proc, sock):
>> +        super(_WaitForExit, self).__init__(reactor, proc, 'exit',
>> +                                           sock.fileno())
>> +        self._sock = sock
>> +        self.connected = 1
>> +
>> +    def close(self):
>> +        self._sock.close()
>> +
>> +    def dataReceived(self, data):
>> +        # TODO: how do we handle getting only *some* of the content?, Maybe we
>> +        #       need to read more bytes first...
> 
> Yeah.  You want lineReceiver's logic really.

Which is... where?

> 
>> +        # This is the only thing we do differently from the standard
>> +        # ProcessReader. When we get data on this socket, we need to treat it
>> +        # as a return code, or a failure.
>> +        if not data.startswith('exited'):
>> +            # Bad data, we want to signal that we are closing the connection
>> +            # TODO: How?
>> +            # self.proc.?
> 
> childConnectionLost ?

I'll call it, I really don't know how to test it, or how this sort of thing would happen in reality, etc.

> 

...

>> +    def _sendMessageToService(self, message):
>> +        """Send a message to the Forking service and get the response"""
>> +        path = config.codehosting.forking_daemon_socket
>> +        client_sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
>> +        log.msg('Connecting to Forking Service @ socket: %s for %r'
>> +                % (path, message))
>> +        try:
>> +            client_sock.connect(path)
>> +            client_sock.sendall(message)
>> +            # We define the requests to be no bigger than 1kB. (For now)
>> +            response = client_sock.recv(1024)
>> +        except socket.error, e:
>> +            # TODO: What exceptions should be raised?
>> +            #       Raising the raw exception seems to kill the twisted reactor
>> +            #       Note that if the connection is refused, we *could* just
>> +            #       fall back on a regular 'spawnProcess' call.
> 
> What does the client see if the forking service is not running?

Unable to "open channel 0" or some other such ssh failure. ("exec request on channel 0" ?)


...

>> +    # Implemented because of childConnectionLost
>> +    # Adapted from twisted.internet.Process
>> +    # Note: Process.maybeCallProcessEnded() tries to reapProcess() at this
>> +    #       point, but the daemon will be doing the reaping for us (we can't
>> +    #       because the process isn't a direct child.)
>> +    def maybeCallProcessEnded(self):
>> +        if self.pipes:
>> +            # Not done if we still have open pipes
>> +            return
>> +        if not self.lostProcess:
>> +            return
>> +        process.BaseProcess.maybeCallProcessEnded(self)
>> +
>> +    # pauseProducing, present in process.py, not a IProcessTransport interface
> 
> It's a shame there's so much copy & paste here.  Also, the use of
> blocking code in a Twisted application always makes me uneasy, but I
> don't know how serious a problem it is here.

I could do a little bit less by inheriting from Process and not calling Process.__init__ (since that auto-registers with reapProcess which doesn't work here.)

What blocking code concerns you? At some point, you *do* need to write to the socket. And these paths are generally called once socket.poll() tells us it is ok.

I followed pretty closely to what Process et al are doing. I'm certainly open to suggestions, as the Twisted code is still a bit beyond my understanding.

...
>> +    def _spawn(self, protocol, executable, arguments, env):
> 
> I would really like to see this method grow a docstring.
> ...


>>      # Extract the hostname from the supermirror root config.
>>      hostname = urlparse.urlparse(config.codehosting.supermirror_root)[1]
>>      environment['BZR_EMAIL'] = '%s@%s' % (avatar.username, hostname)
>> -    return RestrictedExecOnlySession(
>> +    klass = RestrictedExecOnlySession
>> +    # TODO: Use a FeatureFlag to enable this in a more fine-grained approach.
>> +    #       If the forking daemon has been spawned, then we can use it if the
>> +    #       feature is set to true for the given user, etc.
>> +    #       A global config is a good first step, but does require restarting
>> +    #       the service to change the flag. Or does 'config' support SIGHUP?
> 
> To answer the question, no.
> 
>> +    if config.codehosting.use_forking_daemon:
>> +        klass = ForkingRestrictedExecOnlySession
> 
> It's good that we can so simply disable the new feature, if it is
> unstable in production.

...

>> +
>> +# TODO: What base *should* I be using?
>> +class Test_WaitForExit(AvatarTestCase):
> 
> I think testtools.TestCase or lp.testing.TestCase is probably enough
> for these tests.

Thanks.

...

>> +    def test_avatarAdaptsToOnlyRestrictedSession(self):
>> +        config.push('codehosting-no-forking',
>> +            "[codehosting]\nuse_forking_daemon: False\n")
>> +        self.addCleanup(config.pop, 'codehosting-no-forking')
> 
> If you have lp.testing.TestCase in your test class hierarchy, you can
> say self.pushConfig(...) instead of this.

AvatarTestCase inherits from the twisted test case runners, and *doesn't* have pushConfig.

> 
>> +        session = ISession(self.avatar)
>> +        self.failIf(isinstance(session, ForkingRestrictedExecOnlySession),
>> +            "ISession(avatar) shouldn't adapt to "
>> +            " ForkingRestrictedExecOnlySession when forking is disabled. ")
>> +
>> +    def test_avatarAdaptsToForkingRestrictedExecOnlySession(self):
>> +        # config.push('codehosting-forking',
>> +        #     "[codehosting]\nuse_forking_daemon: True\n")
>> +        # self.addCleanup(config.pop, 'codehosting-forking')
> 
> I think you can leave this here, to be super explicit.
> 

restored.

...

>>  def test_suite():
>> -    return unittest.TestLoader().loadTestsFromName(__name__)
>> +    from bzrlib import tests
>> +    from bzrlib.plugins import lpserve
>> +
>> +    loader = tests.TestLoader()
>> +    suite = loader.loadTestsFromName(__name__)
>> +    suite = lpserve.load_tests(suite, lpserve, loader)
>> +    return suite
> 
> Ah ok, so load_tests does get called :-)
> 
> Cheers,
> mwh
> 

Right, I made it so that you can run the normal "bzr selftest -s bp.lp_serve" tests, or you can run it in 'bin/test'.

John
=:->