On 12-10-02 02:13 PM, Barry Warsaw wrote: > On Oct 01, 2012, at 09:30 PM, Robert Bruce Park wrote: > >> Twitter branch mostly done. Some TODOs remain. I am mp'ing a little bit >> prematurely because it's still the easiest way to see an overall diff ;-) > > The branch is looking pretty good. I'm glad you were able to verify that the > OAuth signatures were working, at least for public tweets. I actually verified that the OAuth signatures worked for *every* API endpoint *except* sending new direct messages (we can receive them ok, too). That's what debug_twitter_live.py was all about, every time I'd write a new protocol operation, I'd modify it to do that new operation, then run it, and I'd see on my live twitter account that it worked. > I guess more > research will have to be done to figure out why private replies are broken. Yeah, I still have no idea. > Do you intend to include the debug_twitter_*.py scripts in the branch? If so, > let's put them in a tools subdirectory (i.e. not in the gwibber Python > package). That way they'll be easier to omit from the Debian package. I hadn't originally. At first it just started off as 'from gwibber.utils.account import AccountManager; a = AccountManager(None); a._accounts['6/twitter'].protocol.receive()' because I was sick of typing that out 10,000 times in an interactive python shell every time I wanted to test a change I made to the code. But once the script was born, it grew, and then ken expanded it for his own testing purposes, which is included here as well. In fact, *all* of the style nits you mention were written by ken ;-) I had assumed that you would just not merge them. If you think there's value in keeping them in a tools directory, I guess that's ok. I don't have a strong opinion either way. >> + account.protocol.user() >> + account.protocol.delete('252823527978311680') > > What does this number correspond to? Is it private information? That is a public tweet I made with a previous incarnation of this script, that I've now deleted by running this script as-is. No worries. >> + def _get_url(self, url, data=None): >> + """Access the Twitter API with correct OAuth signed headers.""" >> + # TODO start enforcing some rate limiting. > > What are your thoughts on how to do this rate limiting? Is it something that > Twitter provides through their API? (in a similar manner to Facebook's > pagination support?). I wonder if there's some commonality refactoring that > should go on here, at least for determining the constants for the rate > limiting? I haven't looked closely at Twitter's rate limiting standards yet, but I wrote that comment because I ran into Twitter's limit at one point and they stopped me from downloading tweets that I wanted. All I know is that each API endpoint has a different maximum number of requests that can be made "per window" and the length of time of the window is different, too. So basically I'm envisioning a dict that maps URLs to ints, the int for a given URL will increase once per each _get_url invocation, and then we just need to figure out how long to pause before making the request in order not to hit the limit. Pausing shouldn't be a big deal because everything is threaded anyway, so unrelated requests can continue in other threads while a rate-limited request can sleep peacefully in it's own thread. >> + # "Client" == "Consumer" in oauthlib parlance. >> + client_key = self._account.auth.parameters['ConsumerKey'] >> + client_secret = self._account.auth.parameters['ConsumerSecret'] > > Can these fail? Are the other parts of your code prepared to handle the > resulting KeyErrors if these parameters are missing? Do you have any tests > for those situations? No, these are constants defined by Twitter at app-registration time, stored in libaccounts-sso package. Can never fail. They may change in the event that somebody nefarious compromises them (which ought to be easy to do for an open-source package), but they won't change during runtime. >> + def _publish_tweet(self, tweet): >> + """Publish a single tweet into the Dee.SharedModel.""" >> + tweet_id = tweet.get('id_str', '') > > Does it still make sense to publish this tweet if there is no id_str? Won't > that possibly result in row collisions, since the only unique item in the key > will be the empty string for multiple id_str-less entries? > > I think in my Facebook branch, I just skip the JSON response entry if there's > no message_id. Yeah, that's probably reasonable. I just write 'adict.get' out of habit for the sake of being as flexible as possible. >> +# https://dev.twitter.com/docs/api/1.1/get/statuses/mentions_timeline >> + @feature >> + def mentions(self): >> + """Gather the tweets that mention us.""" >> + url = self._timeline.format('mentions') >> + for tweet in self._get_url(url): >> + self._publish_tweet(tweet) > > Do you have tests for each of these? I know they *look* ridiculously similar, > but each should have a test. I wasn't really sure how to test them, to be honest. 'test_home' is really an indirect test of _get_url and _publish_tweet, and then these individual operations, I wasn't sure how to test them without touching the live service. Or the opposite way, a nearly no-op test that mocks out *everything* so aggressively and then just tests that it calls the method that it calls. I guess that's what you're looking for, then? Mock out self._get_url and self._publish_tweet and just test that _get_url was called with the right URL, and then _publish_tweet was called one or more times? >> +# https://dev.twitter.com/docs/api/1.1/get/statuses/user_timeline >> + @feature >> + def user(self, screen_name=''): >> + """Gather the tweets from a specific user. >> + >> + If screen_name is not specified, then gather the tweets >> + written by the currently authenticated user. >> + """ >> + url = self._user_timeline.format(screen_name) > > So, if screen_name is not given, the resulting url will be something like: > > https://api.twitter.com/1.1/statuses/user_timeline?screen_name= > > is that a valid url for getting the tweets of the current user? Or do you > need to check for screen_name missing and insert self.user_id or some such? > If the latter is the case, then you should default the screen_name argument to > None and check for that. Nope, '...?screen_name=' is a valid way of requesting the current user's tweets. I tested this all with debug_twitter_live.py as I was writing it ;-) >> +# https://dev.twitter.com/docs/api/1.1/get/lists/list >> + @feature >> + def lists(self): >> + """Gather the tweets from the lists that the we are subscribed to.""" >> + url = self._api_base.format(endpoint='lists/list') >> + for twitlist in self._get_url(url): >> + self.list(twitlist.get('id_str', '')) > > I think these methods are untested. > > Also, what happens if get_json() returns an error? Are these methods prepared > to handle problems in communicating with the service? Of course, they're > going to be run in a sub-thread, but is there something better you can do than > let the sub-thread exceptions just fill up the log file? Can you maybe catch > exceptions and print something more useful to the log file? Well, I forget the exact error code, but in the event that we hit the rate limit, get_json will raise an exception with an appropriate HTTP error code and a meaningful string along with it, such as 'rate limit exceeded' or something similar. I didn't really see any issue with that just bubbling straight up into the log file. No need to duplicate the same error-checking code directly in every single method. > Also, this seems like quite a common idiom: > > for tweet in self._get_url(url): > self._publish_tweet(tweet) > > Is there any opportunity for refactoring some commonality? Originally I had written the method as _publish_tweets (plural) and it was built-in with the ability to iterate over multiple tweets. That worked for a good long while, but then there were just a couple API endpoints that only returned a single tweet object rather than a list of tweets, and so then I needed to refactor the method to be able to handle both cases, and this is what I came up with. If you really want, we can have something like this: def _publish_tweets(self, url): for tweet in self._get_url(url): self._publish_tweet(tweet) But I just thought it was too simple to even bother. >> + @feature >> + def receive(self): >> + """Gather and publish all incoming messages.""" >> + # TODO I know mentions and lists are actually incorporated >> + # within the home timeline, but calling them explicitely will >> + # ensure that more of those types of messages appear in the >> + # timeline (eg, so they don't get drown out by everything >> + # else). I'm not sure how necessary it is, though. >> + self.home() >> + self.mentions() >> + self.lists() >> + self.private() > > What happens if you get an exception in one of these streams? Does it make > sense to still try to publish the other streams? What kind of exceptions are you suspecting there to be? The only thing I can think of would be in the event of a failed login, but if that was the case then they would all fail equally, and really the user would have to go back to UOA and re-register his account properly. Not a lot we can do about that here. > >> + >> + @feature >> + def send_private(self, screen_name, message): >> + # TODO: This is utterly broken. 403s no matter what. > > :( Does it work in Quantal gwibber? It seems to. As you said, more research is necessary. One problem is that quantal gwibber is using exclusively Twitter API v1, which is deprecated and will be going away actually quite soon (good timing for this rewrite!). My code is all using the newer API v1.1, so it's not always helpful to look at what quantal gwibber does, because the API isn't always the same. >> +# https://dev.twitter.com/docs/api/1.1/post/statuses/update >> + @feature >> + def send_thread(self, message_id, message): >> + """Send a reply message to message_id. >> + >> + Note that you have to @mention the message_id owner's screen name in >> + order for Twitter to actually accept this as a reply. Otherwise it will >> + just be an ordinary tweet.""" > > Closing triple-quote should be on the next line, by itself, under the 'j' in > 'just'. Oh yeah, oops. >> +# https://dev.twitter.com/docs/api/1.1/post/statuses/destroy/%3Aid >> + @feature >> + def delete(self, message_id): >> + url = self._api_base.format(endpoint='statuses/destroy/%s' % message_id) > > Any reason you're using % interpolation here? Perhaps either use {} and > .format(), or just concatenation (e.g.: 'statuses/destroy/' + message_id)? Yeah, it looks really ugly to me when I see 'string{}'.format('foo{}'.format('bar')), but I also wanted to allow message_id to be an int rather than requiring it be a string, which is what + concatenation would have done. % interpolation just made the int/str conversion easy without being too verbose. >> +# https://dev.twitter.com/docs/api/1.1/post/statuses/retweet/%3Aid >> + @feature >> + def retweet(self, message_id): >> + url = self._api_base.format(endpoint='statuses/retweet/%s' % message_id) > > Same. Same. :-P >> @mock.patch('gwibber.utils.base.Model', TestModel) >> @mock.patch('gwibber.utils.base._seen_messages', {}) >> + @mock.patch('gwibber.utils.base._seen_ids', {}) > > Do you really need to mock both _seen_messages and _seen_ids? What's the > difference between these two? Can they ever get out of sync? Do we need > additional tests to ensure that the constraints of these two data structures > are always correct? Do we need additional tests for when they contain cache > hits? Sorry, I was in a rush when I wrote this. Base._publish now adds to both of these so it seemed safer to mock both for consistency, but it's not clear to me right off the top of my head whether it's strictly necessary for the test itself. >> + def test_unpublish(self): >> + base = Base(FakeAccount()) >> + self.assertEqual(0, TestModel.get_n_rows()) >> + self.assertTrue(base._publish( >> + message_id='1234', >> + stream='messages', >> + sender='fred', >> + sender_nick='freddy', >> + from_me=True, >> + timestamp='today', >> + message='hello, @jimmy', >> + likes=10, >> + liked=True)) > > Since this is specifically a test of deleting rows, you probably don't > necessarily need to test the return value of _publish(). I guess it can't > hurt either. Yeah, this was just a quick copy&paste from a different test, in order to set up an environment in which to test message deletion. >> def tearDown(self): >> + # Ensure that any log entries we haven't tested just get consumed so >> + # as to isolate out test logger from other tests. >> self.log_mock.stop() > > The diff doesn't show it, but does setUp() set Base._SYNCHRONIZE = True? No, I don't need the synchronize stuff because I write the tests to just test the synchronous methods directly. I never invoke the thread-starting __call__ in the tests. > === modified file 'gwibber/gwibber/utils/base.py' > --- gwibber/gwibber/utils/base.py 2012-09-25 20:25:42 +0000 > +++ gwibber/gwibber/utils/base.py 2012-10-02 14:33:58 +0000 >> @@ -42,6 +42,7 @@ >> # representing the rows matching those keys. It is used for quickly finding >> # duplicates when we want to insert new rows into the model. >> _seen_messages = {} >> +_seen_ids = {} > > What is _seen_ids for and how is it used differently than _seen_messages? Do > we still need both of these? If so, can you adjust the comment to explain > what these are for? _seen_ids maps message ids to rowiters. It was the only way I could think to allow deletion of messages without iterating over the whole SharedModel terribly. Because at delete-time, we only have access to the message_id, not the sender & message data that would be required to look up the rowiter in _seen_messages. >> + def _unpublish(self, message_id): >> + """Remove message_id from the Dee.SharedModel.""" >> + triple = [self.__class__.__name__.lower(), >> + self._account.id, >> + message_id] >> + >> + row_iter = _seen_ids.pop(message_id) >> + if row_iter is None: >> + log.error('Tried to delete an invalid message id.') >> + return > > Have you tested the 'miss' logic? What happens when you .pop() with a missing > key? (this is a trick question :) Well, I haven't specified a default, which means it will raise KeyError if the id is not present. This doesn't particularly bother me, because a) the threading logic will safely log and ignore the exception, and b) this can only be invoked by clicking on the delete button on a message that's being displayed in gwibber, so the message_id being valid is almost *tautological*: if message_id wasn't valid, then how the hell did this method even get invoked? > What happens if the .pop() does work? Does anything need to happen to > _seen_messages to keep it in sync? Isn't the row associated with the row_iter > going to be removed? If so, what happens to the same row_iter sitting as a > value in the _seen_messages mapping? Right, they would become out of sync. But only insofar as there will be a sender/message key pointing at a rowiter that points at a row that doesn't exist. I'm not certain that this is particularly harmful, beyond being a minor memory leak. I suppose in some extreme case if a person deletes a message and then tries to submit an identical message, it would cause us to attempt to append that message onto a row that doesn't exist. I don't really see that being a useful real-world scenario, because in my mind the use-cases for deleting a message would be to either a) leave it deleted, or b) to submit a *different* message in it's place. Anyway, do you know of an easy way to do a dict.pop() that pops based on values instead of keys? Or are we going to have to scan the whole dict in an inefficient way? Or is there maybe some kind of alternate data structure we should be looking at? Some kind two-way dict where both keys and values are hashed and indexed? > >> + >> + row = Model.get_row(row_iter) >> + if triple in row[IDS_IDX]: >> + if len(row[IDS_IDX]) == 1: >> + # Message only exists on Twitter, delete it > > Actually, the method isn't Twitter specific is it? If you think _unpublish() > is only useful for Twitter, should it really go in the base class? Or maybe > the comment just needs updating? Sorry, yeah, comment needs updating. It is a general-purpose method. It would need to be defined in the base class anyway because of it's use of IDS_IDX, Model, etc. Those are things that aren't imported into twitter.py, and would be sloppy to do so when the imports are all already here. > === modified file 'gwibber/gwibber/utils/download.py' > --- gwibber/gwibber/utils/download.py 2012-09-14 20:29:11 +0000 > +++ gwibber/gwibber/utils/download.py 2012-10-02 14:33:58 +0000 >> @@ -51,11 +51,11 @@ >> url = self.url >> headers = self.headers.copy() >> if self.params is not None: >> - params = urlencode(self.params) >> + params = urlencode(self.params).replace('+', '%20') > > Why do this? (Yes, I know you explained it to me, but pretend I'm a > first-time reader of the code. Explain it to them! :). Because urlencode does not provide any option to use quote() instead of quote_plus() which it enforces, but Twitter requires percent-encoded spaces, and this is harmless to any other protocol. Also it's easier than re-implementing urlencode from scratch. hooray for code reuse! > >> if self.post: >> data = params.encode('utf-8') >> headers['Content-Type'] = ( >> - 'application/x-www-form-urlencoded; charset=utf-8') >> + 'application/x-www-form-urlencoded') > > Why do this too? This one you haven't explained even to me! :) It's because oauthlib is incredibly brittle. It scans for precisely the string 'application/x-www-form-urlencoded' and raises an exception if it's not exactly that. So it seems this was necessary in order to get OAuth signatures to even function. On second thought however, it's not clear to me what oauthlib actually *does* with that header when it's signing (I'm pretty sure the value of that header is not included in the base string that gets signed). So it might be worthwhile to undo this change and then see if anything breaks. This was just one of those things that I did back when I was really struggling to get any signature working at all, and then after I got it working, I kept it because a) I forgot that I'd done it, and b) it didn't occur to me until just now that it might not have been necessary to make the signatures work. So in other words, I'll test this ;-) Sorry I'm not on IRC, I just got back into town from my big train trip and I'm having some troubles with emacs24/system updaty type problems. I'll get it working shortly and then fix up this mp for you by tonight.