John Arbash Meinel wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Unfortunately this won't get put on the MP, because Launchpad now > rejects all my emails. But I really wanted to reply to your comments inline. Ok. I won't trim the quoted text for other people's benefit. Warm up those scrolling fingers! > > > > ... > > >> + # [Decision #4] > >> + # Exit information > >> + # How do we inform the client process that the child has exited? > > > > This section seems to describe the questions without clearly stating what the > > answer was. I have to read through the wall of text fairly carefully to see > > what the choice was. > > New summary. > > > > >> + # There is some possibility that files won't get flushed, etc. So we > >> + # may want to be calling sys.exitfunc() first. Note that bzr itself > > > > Yes, I think it would be good to call sys.exitfunc() for the reasons you give. > > I'd be particularly worried about incompletely written log files... it would be > > a shame to have the most interesting log content never get flushed! (The end of > > the log is often the most interesting bit.) > > I switched to this, though I'm not really sure it is the right thing. > (When *do* buffered files get synced to disk? It seems that it might be > during stack unwinding, which we explicitly avoid via os._exitfunc(), > which bzr has been doing for a long time now...) We call os._exitfunc from the outermost frame, so we don't avoid any stack unwinding. We do it to avoid wasting time that would be spent garbage collecting perhaps 100000s of objects when the OS can free that memory for us more cheaply. > > > > > >> + # [Decision #8] > >> + # env vars > > > > This section doesn't make clear what the issue is, nor what the conclusion is. > > What does “this data” refer to? Are env vars the problem, or solution to some > > problem? > > Removed. Passing env vars from the Twisted process to the spawned child > is a bit tricky. I did end up going with a syntax that can be parsed > without ambiguity, though it is limited to 8-bit without encoding > information. (AFAIK, so are real env vars) > > > > >> + # [Decision #9] > >> + # nicer errors to clients > >> + # This service is meant to be run only on the local system. As such, > >> + # we don't try to be extra defensive about leaking information to > >> + # clients. Instead we try to be helpful, and tell them as much as we > >> + # know about what went wrong. > > > > The client is connected to a remote user though. So the information returned > > here won't be repeated back to that remote client? > > Wrong client. The 'client' here is twisted. I clarified it with: > nicer errors on the request socket Ok. > > > > > I'm worried about this becase we already echo more information back to the > > remote client than really makes sense, e.g. random warnings on stderr are sent > > to the remote client, and tend to confuse users. > > That is true, though often the alternative would be to tell them > nothing, which would have been worse. (Getting Permission Error can be > confusing when it is a remote thing, but just getting failures is worse.) Well, giving a generic “server failed” error tends avoid confusing users into thinking that the problem is on their end. > > > > >> + _fork_function = os.fork > > > > I assume this is to make unit testing easier, which is fine. Peeking ahead in > > the diff I see that you test this at least in part by subclassing, which isn't > > so great: I feel it violates the “Use the front door first” guideline for tests, > > because this isn't a class intended for subclassing in production code, so the > > tests are assuming a more intimate (and not sharply defined) interface with this > > code than is ideal. I'm not sure it's worth spending effort trying to find a > > better way, but hopefully you finding this interesting food for thought. > > > > This wasn't actually used anywhere, instead I subclassed and overrode > specific functions. > > 1) There are a few direct tests that subclass to be able to not fork > and run the service in just a local thread. > 2) There are other tests that exercise the full stack by actually > spawning a process that then forks. (Which is about as front-door as > you can get) > However, it costs a lot of startup time (roughly 1s per test, up to > 3s on my machine) > 3) While you aren't supposed to subclass, you also aren't supposed to > replace the forking function, either. Perhaps you didn't intend this, but these points make it sound like a choice between “subclass” or “run full stack in new process”. That's not true; if nothing else you can monkey-patch individual attributes on instances (or even module-globals, although replacing the module-wide 'os' with one with a test double for 'fork' would be a bit gross). And often there's a way to refactor to improve testability. As I said in the initial review, you don't have to change this, but my experience has been that I've usually regretted using this technique. > > >> + # XXX: Cheap hack. by this point bzrlib has opened stderr for logging > >> + # (as part of starting the service process in the first place). As > >> + # such, it has a stream handler that writes to stderr. logging > >> + # tries to flush and close that, but the file is already closed. > >> + # This just supresses that exception > >> + logging.raiseExceptions = False > > > > IIRC the Launchpad coding standard requires XXX comments to have a name and date > > attached, and possibly a bug number too. > > > > On the other hand, the logging module docs actually recommend setting > > raiseExceptions to False in production! So perhaps this can just be a comment > > without an XXX? > > updated. > > > > > >> + def become_child(self, command_argv, path): > >> + """We are in the spawned child code, do our magic voodoo.""" > >> + # Stop tracking new signals > >> + self._unregister_signals() > >> + # Reset the start time > >> + trace._bzr_log_start_time = time.time() > >> + trace.mutter('%d starting %r' > >> + % (os.getpid(), command_argv,)) > >> + self.host = None > >> + self.port = None > >> + self._sockname = None > >> + self._bind_child_file_descriptors(path) > >> + self._run_child_command(command_argv) > > > > Two questions, somewhat related: > > > > 1) Won't the server and child share a bzr log file descriptor? That seems > > likely to be confusing, particularly when different processes compete to rotate > > it, quite possibly leaving some processes holding a file descriptor to an > > unlinked file. > > We only rotate on open. Yes the handle is shared. I never got any useful > feedback about how things should be logging, and it does work fine in > practice. > > I suppose a problem would be that if this service is never stopped, we > will just keep logging to the same file forever. (Even if the sysadmins > rotate the file outside of the process.) > > > > > 2) What about any other file descriptors other than 0,1,2... should the child > > close those too, to try avoid surprises with shared state? > > Possibly, though you have to watch out for the ones you actually want to > keep, like the log file. > > > >> + # TODO: Should we call sys.exitfunc() here? it allows atexit functions > >> + # to fire, however, some of those may be still around from the > >> + # parent process, which we don't really want. > >> + ## sys.exitfunc() > > > > It's a bit like you want to run the __exit__() function from bzrlib.initialize's > > return value before you fork... but then some of the work that bzrlib.initialize > > does is presumably some of the work you want to avoid redoing in every child... > > I would be fine running bzrlib.initialize() again (I don't think it is > very expensive yet), but also that state isn't available anywhere to > __exit__ on... Yeah, that's a shame. Perhaps something to fix in bzrlib at some point. > > > > >> + @staticmethod > >> + def command_to_argv(command_str): > >> + """Convert a 'foo bar' style command to [u'foo', u'bar']""" > >> + # command_str must be a utf-8 string > >> + return [s.decode('utf-8') for s in shlex.split(command_str)] > > > > Personally, I'd have been tempted to use bencode or \0 delimited values here > > (instead of space/newline delimited) so that you can reliably represent all > > possible argvs. > > That is true, but bencode is pretty terrible for something you ever need > to look at manually, and it was *very* useful to be able to manually > connect to the service and chat with it. That's a fair point. It'd be nice to find a way to allow the full range of valid values, though. We pretty rapidly regretted this sort of limitation in the initial HPSS protocol. Or perhaps allow some sort of protocol versioning so it can be addressed later with less pain? Although given you control how the child is spawned that can probably be taken care of fairly easily out-of-band without cramming it into the protocol for this socket. > > > > > >> + def log_information(self): > >> + """Log the status information. > >> + > >> + This includes stuff like number of children, and ... ? > >> + """ > > > > I expect your LP reviewer will request a more precise comment here :) > > > > Not from my experience so far with lp reviewers. But honestly, I don't Back in my day they would :P > know what we want to put here, and expect it to change over time. So I > could remove the last sentence and lose any specific information, or it > could be clear that it is a bit under defined. Well, what you just said there is not clearly communicated by that docstring. Perhaps add this text to the docstring? > > > >> + self._poll_children() > > > > It's a bit odd for a method called “log_information” to first invoke a method > > that attempts to actively mutate the object state. I guess it's probably ok, > > but... it feels a bit odd. I'm not sure it's a good idea. > > To get he most accurate information, it is good to check the state of > the information you are going to write.... > > > > >> + def _poll_children(self): > >> + """See if children are still running, etc. > >> + > >> + One interesting hook here would be to track memory consumption, etc. > >> + """ > >> + to_remove = [] > > > > That variable appears to be unused. pyflakes is your friend :) > > > >> + shutil.rmtree(c_path) > > > > Maybe worth passing ignore_errors=True here? > > I generally avoid suppressing errors, but if I'm not going to do > anything with them either... (it would be nice to at least log them) You can pass an error handling function to rmtree instead, if you'd like to do that. > > > > >> + def _shutdown_children(self): > > [...] > >> + if os.path.exists(c_path): > >> + trace.warning('Cleaning up after immortal child %d: %s\n' > >> + % (c_id, c_path)) > >> + shutil.rmtree(c_path) > > > > This is duplicated somewhat with _poll_children, which seems bad. > > With a comment just before it that we should never reach this point > because it we just sent SIGKILL? Oh, I missed that. Duplicated code apparently stands out more for me than comments! > > > > >> +libraries_to_preload = [ > > > > I wonder how we'll notice when the libraries_to_preload ought to be updated? > > > > A comment explaining how you derived this list in the first place would be good. > > > > Done. > > >> + try: > >> + cleanup_environment() > > [...] > >> + finally: > >> + restore_environment() > > > > That's not a reliable use of try-finally... and reliability is what try-finally > > is for. It should be: > > > > cleanup_environment() > > try: > > ... > > finally: > > restore_environment() > > > > I'm afraid in this instance that probably means multiple, nested try-finally :( > > > > I don't think this is my code. Ah, I think it was code you moved (or perhaps diff noise?) > > > >> + TODO: This should probably use testresources, or layers somehow... > > > > Nothing should use layers. Just ask LP's technical architect :) > > > >> + if one_byte_at_a_time: > >> + for byte in message: > >> + client_sock.send(byte) > >> + else: > >> + client_sock.sendall(message) > > > > Haven't I seen that code already in this diff? > > One is talking to the in-process service, the other is in a subprocess. > The code is similar but not identical. I don't really like local > functions for what is logically a class attribute, and TestCase > inheritance makes it difficult to pull out a common ancestor. > > > > >> + def test_fork_multiple_children(self): > >> + paths = [] > >> + for idx in range(4): > >> + paths.append(self.send_fork_request('launchpad-replay')) > >> + for idx in [3, 2, 0, 1]: > > > > Why in that order? > > Something not in-order and not in reverse order, but not a different > order every test run. (random.shuffle()) Again, what you just told me there (the *intent*) is not communicated to me by the code. Please add a comment with that info. > > > > >> + @property > >> + def should_launch(self): > >> + # We tie this directly into the SFTP service. If one should launch, > >> + # then both of them should > >> + return (config.codehosting.launch and > >> + config.codehosting.use_forking_daemon) > > > > I'm confused: the comment is phrased as an OR, but the code uses an AND. > > Perhaps the variables in the code don't directly correspond to the terms in the > > comment? The phrasing certainly leads me to expect correspondence. > > > > The comment isn't an OR. (if one then both => AND). However, I can just > remove the comment. Yes, not sure what I was thinking! I guess I misread it. > > > lib/lp/codehosting/sshserver/session.py: > > > >> + def dataReceived(self, data): > >> + # TODO: how do we handle getting only *some* of the content?, Maybe we > >> + # need to read more bytes first... > > > > ! > > > > Please, please, please fix this before deploying. > > > > If you don't have enough bytes yet, just append them to a buffer and return. > > > > I've never had this fail, in forking 50 concurrent requests, etc. All > messages are a single "send()" over the local socket, and generally < 10 > bytes. I can do it, but I don't see it being particularly easy to get > right. The total content is: > > "exited 500\n" > > The main issue is detecting that the message is complete, I suppose I > could check for a newline... Yes, do that. I know that in the current situation it will work, but it's making assumptions that aren't strictly guaranteed by the OS or libraries, and it's a problem waiting to happen if someone repurposes this code for a more demanding situation, or if e.g. someone decides to transmit some extra info, or something else strange changes. FTP cannot distinguish between “data transfer complete” and “data transfer died a bit before completion”. It also mostly works... but we don't have to repeat the mistakes of the past :) > > > >> + # Design decisions > >> + # [Decision #a] > > > > Counting in hex? :) > > Interesting that it worked out that way, but no, just trying to use a > different "numbering" scheme to not have to worry about mixed > allocations or confusion. > > > > >> + # Inherit from process.BaseProcess > >> + # This seems slightly risky, as process.BaseProcess is actually > >> + # imported from twisted.internet._baseprocess.BaseProcess. The > >> + # real-world Process then actually inherits from process._BaseProcess > >> + # I've also had to copy a fair amount from the actual Process > >> + # command. > >> + # One option would be to inherit from process.Process, and just > >> + # override stuff like __init__ and reapProcess which I don't want to > >> + # do in the same way. (Is it ok not to call your Base classes > >> + # __init__ if you don't want to do that exact work?) > > > > Please file bugs on Twisted about the limitations in its API you encounter so > > they can be fixed (and then add a link to them in this comment). We have lots > > of Twisted devs in this company. > > > >> + # Note: _exiter forms a GC cycle, since it points to us, and we hold a > >> + # reference to it > > > > This is true, but I'm not sure why it's noteworthy. Lots of Python code creates > > cycles. > > *I* try to avoid it. > > > > >> + def test_dataReceived_bad_data(self): > >> + # XXX: The dataReceived code calls 'log.err' which ends up getting > >> + # printed during the test run. How do I suppress that or even > >> + # better, check that it does so? > > > > Perhaps twisted.trial.unittest.TestCase.flushLoggedErrors? > > When I asked Aaron his comment was "I think we run a script in a > subprocess so that we can redirect stderr." Which didn't seem a great > answer. > > self.flushLoggedErrors() also doesn't change anything. The output looks > like: > test_dataReceived_bad_data > (lp.codehosting.sshserver.tests.test_session.Test_WaitForExit)"Got > invalid exit information: 'bogus\\n'" > > test_dataReceived_ends_cleanly > (lp.codehosting.sshserver.tests.test_session.Test_WaitForExit) > Strange. > I think flushLoggedErrors() is more about something with the test case > itself, not the twisted "log" object. > > > > >> === added file 'lp_service_interface.txt' > > > > I don't think this belongs in the root of the tree. > > > > Removed. > > John > =:-> > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.9 (Cygwin) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iEYEARECAAYFAkymV3IACgkQJdeBCYSNAAOPcACghJY4KHVHT6viSLNbH4BTmxBA > fCEAn0hIgjDkpE1s6IWZ10C2kSI+0Z9A > =NM3n > -----END PGP SIGNATURE----- > -Andrew.