There are many unrelated changes in this diff, which is part of why the line
count is so enormous.  I guess the target branch isn't the one you branched
from?  It does make the review harder than it should be, and certainly much 
more intimidating at a glance than it deserves, so it'd be worth fixing :)

SIGCHLD: is this handler uninstalled in the child?  If not, there's a risk of
EINTR issues if the child spawns its own subprocesses.  Ah, yes, it is.  Good :)

> +    # [Decision #4]
> +    #   Exit information
> +    #       How do we inform the client process that the child has exited?

This section seems to describe the questions without clearly stating what the
answer was.  I have to read through the wall of text fairly carefully to see
what the choice was.

> +    #       There is some possibility that files won't get flushed, etc. So we
> +    #       may want to be calling sys.exitfunc() first. Note that bzr itself

Yes, I think it would be good to call sys.exitfunc() for the reasons you give.
I'd be particularly worried about incompletely written log files... it would be
a shame to have the most interesting log content never get flushed!  (The end of
the log is often the most interesting bit.)

> +    # [Decision #8]
> +    #   env vars

This section doesn't make clear what the issue is, nor what the conclusion is.
What does “this data” refer to?  Are env vars the problem, or solution to some
problem?

> +    # [Decision #9]
> +    #   nicer errors to clients
> +    #       This service is meant to be run only on the local system. As such,
> +    #       we don't try to be extra defensive about leaking information to
> +    #       clients. Instead we try to be helpful, and tell them as much as we
> +    #       know about what went wrong.

The client is connected to a remote user though.  So the information returned
here won't be repeated back to that remote client?

I'm worried about this becase we already echo more information back to the
remote client than really makes sense, e.g. random warnings on stderr are sent
to the remote client, and tend to confuse users.

> +    _fork_function = os.fork

I assume this is to make unit testing easier, which is fine.  Peeking ahead in
the diff I see that you test this at least in part by subclassing, which isn't
so great: I feel it violates the “Use the front door first” guideline for tests,
because this isn't a class intended for subclassing in production code, so the
tests are assuming a more intimate (and not sharply defined) interface with this
code than is ideal.  I'm not sure it's worth spending effort trying to find a
better way, but hopefully you finding this interesting food for thought.

> +        # XXX: Cheap hack. by this point bzrlib has opened stderr for logging
> +        #      (as part of starting the service process in the first place). As
> +        #      such, it has a stream handler that writes to stderr. logging
> +        #      tries to flush and close that, but the file is already closed.
> +        #      This just supresses that exception
> +        logging.raiseExceptions = False

IIRC the Launchpad coding standard requires XXX comments to have a name and date
attached, and possibly a bug number too.

On the other hand, the logging module docs actually recommend setting
raiseExceptions to False in production!  So perhaps this can just be a comment
without an XXX?

> +    def become_child(self, command_argv, path):
> +        """We are in the spawned child code, do our magic voodoo."""
> +        # Stop tracking new signals
> +        self._unregister_signals()
> +        # Reset the start time
> +        trace._bzr_log_start_time = time.time()
> +        trace.mutter('%d starting %r'
> +                     % (os.getpid(), command_argv,))
> +        self.host = None
> +        self.port = None
> +        self._sockname = None
> +        self._bind_child_file_descriptors(path)
> +        self._run_child_command(command_argv)

Two questions, somewhat related:

1) Won't the server and child share a bzr log file descriptor?   That seems
likely to be confusing, particularly when different processes compete to rotate
it, quite possibly leaving some processes holding a file descriptor to an
unlinked file.

2) What about any other file descriptors other than 0,1,2... should the child
close those too, to try avoid surprises with shared state?

> +        # TODO: Should we call sys.exitfunc() here? it allows atexit functions
> +        #       to fire, however, some of those may be still around from the
> +        #       parent process, which we don't really want.
> +        ## sys.exitfunc()

It's a bit like you want to run the __exit__() function from bzrlib.initialize's
return value before you fork... but then some of the work that bzrlib.initialize
does is presumably some of the work you want to avoid redoing in every child...

> +    @staticmethod
> +    def command_to_argv(command_str):
> +        """Convert a 'foo bar' style command to [u'foo', u'bar']"""
> +        # command_str must be a utf-8 string
> +        return [s.decode('utf-8') for s in shlex.split(command_str)]

Personally, I'd have been tempted to use bencode or \0 delimited values here
(instead of space/newline delimited) so that you can reliably represent all
possible argvs.

> +    def log_information(self):
> +        """Log the status information.
> +
> +        This includes stuff like number of children, and ... ?
> +        """

I expect your LP reviewer will request a more precise comment here :)

> +        self._poll_children()

It's a bit odd for a method called “log_information” to first invoke a method
that attempts to actively mutate the object state.  I guess it's probably ok,
but... it feels a bit odd.  I'm not sure it's a good idea.

> +    def _poll_children(self):
> +        """See if children are still running, etc.
> +
> +        One interesting hook here would be to track memory consumption, etc.
> +        """
> +        to_remove = []

That variable appears to be unused.  pyflakes is your friend :)

> +                shutil.rmtree(c_path)

Maybe worth passing ignore_errors=True here?

> +    def _shutdown_children(self):
[...]
> +                if os.path.exists(c_path):
> +                    trace.warning('Cleaning up after immortal child %d: %s\n'
> +                                  % (c_id, c_path))
> +                    shutil.rmtree(c_path)

This is duplicated somewhat with _poll_children, which seems bad.

> +libraries_to_preload = [

I wonder how we'll notice when the libraries_to_preload ought to be updated?

A comment explaining how you derived this list in the first place would be good.

> +        try:
> +            cleanup_environment()
[...]
> +        finally:
> +            restore_environment()

That's not a reliable use of try-finally... and reliability is what try-finally
is for.  It should be:

    cleanup_environment()
    try:
        ...
    finally:
        restore_environment()

I'm afraid in this instance that probably means multiple, nested try-finally :(

> +    TODO: This should probably use testresources, or layers somehow...

Nothing should use layers.  Just ask LP's technical architect :)

> +        if one_byte_at_a_time:
> +            for byte in message:
> +                client_sock.send(byte)
> +        else:
> +            client_sock.sendall(message)

Haven't I seen that code already in this diff?

> +    def test_fork_multiple_children(self):
> +        paths = []
> +        for idx in range(4):
> +            paths.append(self.send_fork_request('launchpad-replay'))
> +        for idx in [3, 2, 0, 1]:

Why in that order?

> +    @property
> +    def should_launch(self):
> +        # We tie this directly into the SFTP service. If one should launch,
> +        # then both of them should
> +        return (config.codehosting.launch and
> +                config.codehosting.use_forking_daemon)

I'm confused: the comment is phrased as an OR, but the code uses an AND.
Perhaps the variables in the code don't directly correspond to the terms in the
comment?  The phrasing certainly leads me to expect correspondence.

lib/lp/codehosting/sshserver/session.py:

> +    def dataReceived(self, data):
> +        # TODO: how do we handle getting only *some* of the content?, Maybe we
> +        #       need to read more bytes first...

!

Please, please, please fix this before deploying.

If you don't have enough bytes yet, just append them to a buffer and return.

> +    # Design decisions
> +    # [Decision #a]

Counting in hex? :)

> +    #   Inherit from process.BaseProcess
> +    #       This seems slightly risky, as process.BaseProcess is actually
> +    #       imported from twisted.internet._baseprocess.BaseProcess. The
> +    #       real-world Process then actually inherits from process._BaseProcess
> +    #       I've also had to copy a fair amount from the actual Process
> +    #       command.
> +    #       One option would be to inherit from process.Process, and just
> +    #       override stuff like __init__ and reapProcess which I don't want to
> +    #       do in the same way. (Is it ok not to call your Base classes
> +    #       __init__ if you don't want to do that exact work?)

Please file bugs on Twisted about the limitations in its API you encounter so
they can be fixed (and then add a link to them in this comment).  We have lots
of Twisted devs in this company.

> +        # Note: _exiter forms a GC cycle, since it points to us, and we hold a
> +        #       reference to it

This is true, but I'm not sure why it's noteworthy.  Lots of Python code creates
cycles.

> +    def test_dataReceived_bad_data(self):
> +        # XXX: The dataReceived code calls 'log.err' which ends up getting
> +        #      printed during the test run. How do I suppress that or even
> +        #      better, check that it does so?

Perhaps twisted.trial.unittest.TestCase.flushLoggedErrors?

> === added file 'lp_service_interface.txt'

I don't think this belongs in the root of the tree.