On 12-10-02 02:13 PM, Barry Warsaw wrote:
> On Oct 01, 2012, at 09:30 PM, Robert Bruce Park wrote:
>
>> Twitter branch mostly done. Some TODOs remain. I am mp'ing a little bit
>> prematurely because it's still the easiest way to see an overall diff ;-)
>
> The branch is looking pretty good.  I'm glad you were able to verify that the
> OAuth signatures were working, at least for public tweets.

I actually verified that the OAuth signatures worked for *every* API 
endpoint *except* sending new direct messages (we can receive them ok, 
too). That's what debug_twitter_live.py was all about, every time I'd 
write a new protocol operation, I'd modify it to do that new operation, 
then run it, and I'd see on my live twitter account that it worked.

>  I guess more
> research will have to be done to figure out why private replies are broken.

Yeah, I still have no idea.

> Do you intend to include the debug_twitter_*.py scripts in the branch?  If so,
> let's put them in a tools subdirectory (i.e. not in the gwibber Python
> package).  That way they'll be easier to omit from the Debian package.

I hadn't originally. At first it just started off as 'from 
gwibber.utils.account import AccountManager; a = AccountManager(None); 
a._accounts['6/twitter'].protocol.receive()' because I was sick of 
typing that out 10,000 times in an interactive python shell every time I 
wanted to test a change I made to the code. But once the script was 
born, it grew, and then ken expanded it for his own testing purposes, 
which is included here as well.

In fact, *all* of the style nits you mention were written by ken ;-)

I had assumed that you would just not merge them. If you think there's 
value in keeping them in a tools directory, I guess that's ok. I don't 
have a strong opinion either way.

>> +    account.protocol.user()
>> +    account.protocol.delete('252823527978311680')
>
> What does this number correspond to?  Is it private information?

That is a public tweet I made with a previous incarnation of this 
script, that I've now deleted by running this script as-is. No worries.

>> +    def _get_url(self, url, data=None):
>> +        """Access the Twitter API with correct OAuth signed headers."""
>> +        # TODO start enforcing some rate limiting.
>
> What are your thoughts on how to do this rate limiting?  Is it something that
> Twitter provides through their API?  (in a similar manner to Facebook's
> pagination support?).  I wonder if there's some commonality refactoring that
> should go on here, at least for determining the constants for the rate
> limiting?

I haven't looked closely at Twitter's rate limiting standards yet, but I 
wrote that comment because I ran into Twitter's limit at one point and 
they stopped me from downloading tweets that I wanted.

All I know is that each API endpoint has a different maximum number of 
requests that can be made "per window" and the length of time of the 
window is different, too. So basically I'm envisioning a dict that maps 
URLs to ints, the int for a given URL will increase once per each 
_get_url invocation, and then we just need to figure out how long to 
pause before making the request in order not to hit the limit. Pausing 
shouldn't be a big deal because everything is threaded anyway, so 
unrelated requests can continue in other threads while a rate-limited 
request can sleep peacefully in it's own thread.

>> +        # "Client" == "Consumer" in oauthlib parlance.
>> +        client_key = self._account.auth.parameters['ConsumerKey']
>> +        client_secret = self._account.auth.parameters['ConsumerSecret']
>
> Can these fail?  Are the other parts of your code prepared to handle the
> resulting KeyErrors if these parameters are missing?  Do you have any tests
> for those situations?

No, these are constants defined by Twitter at app-registration time, 
stored in libaccounts-sso package. Can never fail. They may change in 
the event that somebody nefarious compromises them (which ought to be 
easy to do for an open-source package), but they won't change during 
runtime.

>> +    def _publish_tweet(self, tweet):
>> +        """Publish a single tweet into the Dee.SharedModel."""
>> +        tweet_id = tweet.get('id_str', '')
>
> Does it still make sense to publish this tweet if there is no id_str?  Won't
> that possibly result in row collisions, since the only unique item in the key
> will be the empty string for multiple id_str-less entries?
>
> I think in my Facebook branch, I just skip the JSON response entry if there's
> no message_id.

Yeah, that's probably reasonable. I just write 'adict.get' out of habit 
for the sake of being as flexible as possible.

>> +# https://dev.twitter.com/docs/api/1.1/get/statuses/mentions_timeline
>> +    @feature
>> +    def mentions(self):
>> +        """Gather the tweets that mention us."""
>> +        url = self._timeline.format('mentions')
>> +        for tweet in self._get_url(url):
>> +            self._publish_tweet(tweet)
>
> Do you have tests for each of these?  I know they *look* ridiculously similar,
> but each should have a test.

I wasn't really sure how to test them, to be honest. 'test_home' is 
really an indirect test of _get_url and _publish_tweet, and then these 
individual operations, I wasn't sure how to test them without touching 
the live service. Or the opposite way, a nearly no-op test that mocks 
out *everything* so aggressively and then just tests that it calls the 
method that it calls. I guess that's what you're looking for, then? Mock 
out self._get_url and self._publish_tweet and just test that _get_url 
was called with the right URL, and then _publish_tweet was called one or 
more times?

>> +# https://dev.twitter.com/docs/api/1.1/get/statuses/user_timeline
>> +    @feature
>> +    def user(self, screen_name=''):
>> +        """Gather the tweets from a specific user.
>> +
>> +        If screen_name is not specified, then gather the tweets
>> +        written by the currently authenticated user.
>> +        """
>> +        url = self._user_timeline.format(screen_name)
>
> So, if screen_name is not given, the resulting url will be something like:
>
> https://api.twitter.com/1.1/statuses/user_timeline?screen_name=
>
> is that a valid url for getting the tweets of the current user?  Or do you
> need to check for screen_name missing and insert self.user_id or some such?
> If the latter is the case, then you should default the screen_name argument to
> None and check for that.

Nope, '...?screen_name=' is a valid way of requesting the current user's 
tweets. I tested this all with debug_twitter_live.py as I was writing it ;-)

>> +# https://dev.twitter.com/docs/api/1.1/get/lists/list
>> +    @feature
>> +    def lists(self):
>> +        """Gather the tweets from the lists that the we are subscribed to."""
>> +        url = self._api_base.format(endpoint='lists/list')
>> +        for twitlist in self._get_url(url):
>> +            self.list(twitlist.get('id_str', ''))
>
> I think these methods are untested.
>
> Also, what happens if get_json() returns an error?  Are these methods prepared
> to handle problems in communicating with the service?  Of course, they're
> going to be run in a sub-thread, but is there something better you can do than
> let the sub-thread exceptions just fill up the log file?  Can you maybe catch
> exceptions and print something more useful to the log file?

Well, I forget the exact error code, but in the event that we hit the 
rate limit, get_json will raise an exception with an appropriate HTTP 
error code and a meaningful string along with it, such as 'rate limit 
exceeded' or something similar. I didn't really see any issue with that 
just bubbling straight up into the log file. No need to duplicate the 
same error-checking code directly in every single method.

> Also, this seems like quite a common idiom:
>
>          for tweet in self._get_url(url):
>              self._publish_tweet(tweet)
>
> Is there any opportunity for refactoring some commonality?

Originally I had written the method as _publish_tweets (plural) and it 
was built-in with the ability to iterate over multiple tweets. That 
worked for a good long while, but then there were just a couple API 
endpoints that only returned a single tweet object rather than a list of 
tweets, and so then I needed to refactor the method to be able to handle 
both cases, and this is what I came up with.

If you really want, we can have something like this:

def _publish_tweets(self, url):
     for tweet in self._get_url(url):
         self._publish_tweet(tweet)

But I just thought it was too simple to even bother.

>> +    @feature
>> +    def receive(self):
>> +        """Gather and publish all incoming messages."""
>> +        # TODO I know mentions and lists are actually incorporated
>> +        # within the home timeline, but calling them explicitely will
>> +        # ensure that more of those types of messages appear in the
>> +        # timeline (eg, so they don't get drown out by everything
>> +        # else). I'm not sure how necessary it is, though.
>> +        self.home()
>> +        self.mentions()
>> +        self.lists()
>> +        self.private()
>
> What happens if you get an exception in one of these streams?  Does it make
> sense to still try to publish the other streams?

What kind of exceptions are you suspecting there to be? The only thing I 
can think of would be in the event of a failed login, but if that was 
the case then they would all fail equally, and really the user would 
have to go back to UOA and re-register his account properly. Not a lot 
we can do about that here.

>
>> +
>> +    @feature
>> +    def send_private(self, screen_name, message):
>> +        # TODO: This is utterly broken. 403s no matter what.
>
> :(  Does it work in Quantal gwibber?

It seems to. As you said, more research is necessary.

One problem is that quantal gwibber is using exclusively Twitter API v1, 
which is deprecated and will be going away actually quite soon (good 
timing for this rewrite!). My code is all using the newer API v1.1, so 
it's not always helpful to look at what quantal gwibber does, because 
the API isn't always the same.

>> +# https://dev.twitter.com/docs/api/1.1/post/statuses/update
>> +    @feature
>> +    def send_thread(self, message_id, message):
>> +        """Send a reply message to message_id.
>> +
>> +        Note that you have to @mention the message_id owner's screen name in
>> +        order for Twitter to actually accept this as a reply. Otherwise it will
>> +        just be an ordinary tweet."""
>
> Closing triple-quote should be on the next line, by itself, under the 'j' in
> 'just'.

Oh yeah, oops.

>> +# https://dev.twitter.com/docs/api/1.1/post/statuses/destroy/%3Aid
>> +    @feature
>> +    def delete(self, message_id):
>> +        url = self._api_base.format(endpoint='statuses/destroy/%s' % message_id)
>
> Any reason you're using % interpolation here?  Perhaps either use {} and
> .format(), or just concatenation (e.g.: 'statuses/destroy/' + message_id)?

Yeah, it looks really ugly to me when I see 
'string{}'.format('foo{}'.format('bar')), but I also wanted to allow 
message_id to be an int rather than requiring it be a string, which is 
what + concatenation would have done.

% interpolation just made the int/str conversion easy without being too 
verbose.

>> +# https://dev.twitter.com/docs/api/1.1/post/statuses/retweet/%3Aid
>> +    @feature
>> +    def retweet(self, message_id):
>> +        url = self._api_base.format(endpoint='statuses/retweet/%s' % message_id)
>
> Same.

Same. :-P

>>       @mock.patch('gwibber.utils.base.Model', TestModel)
>>       @mock.patch('gwibber.utils.base._seen_messages', {})
>> +    @mock.patch('gwibber.utils.base._seen_ids', {})
>
> Do you really need to mock both _seen_messages and _seen_ids?  What's the
> difference between these two?  Can they ever get out of sync?  Do we need
> additional tests to ensure that the constraints of these two data structures
> are always correct?  Do we need additional tests for when they contain cache
> hits?

Sorry, I was in a rush when I wrote this. Base._publish now adds to both 
of these so it seemed safer to mock both for consistency, but it's not 
clear to me right off the top of my head whether it's strictly necessary 
for the test itself.

>> +    def test_unpublish(self):
>> +        base = Base(FakeAccount())
>> +        self.assertEqual(0, TestModel.get_n_rows())
>> +        self.assertTrue(base._publish(
>> +            message_id='1234',
>> +            stream='messages',
>> +            sender='fred',
>> +            sender_nick='freddy',
>> +            from_me=True,
>> +            timestamp='today',
>> +            message='hello, @jimmy',
>> +            likes=10,
>> +            liked=True))
>
> Since this is specifically a test of deleting rows, you probably don't
> necessarily need to test the return value of _publish().  I guess it can't
> hurt either.

Yeah, this was just a quick copy&paste from a different test, in order 
to set up an environment in which to test message deletion.

>>       def tearDown(self):
>> +        # Ensure that any log entries we haven't tested just get consumed so
>> +        # as to isolate out test logger from other tests.
>>           self.log_mock.stop()
>
> The diff doesn't show it, but does setUp() set Base._SYNCHRONIZE = True?

No, I don't need the synchronize stuff because I write the tests to just 
test the synchronous methods directly. I never invoke the 
thread-starting __call__ in the tests.

> === modified file 'gwibber/gwibber/utils/base.py'
> --- gwibber/gwibber/utils/base.py	2012-09-25 20:25:42 +0000
> +++ gwibber/gwibber/utils/base.py	2012-10-02 14:33:58 +0000
>> @@ -42,6 +42,7 @@
>>   # representing the rows matching those keys.  It is used for quickly finding
>>   # duplicates when we want to insert new rows into the model.
>>   _seen_messages = {}
>> +_seen_ids = {}
>
> What is _seen_ids for and how is it used differently than _seen_messages?  Do
> we still need both of these?  If so, can you adjust the comment to explain
> what these are for?

_seen_ids maps message ids to rowiters. It was the only way I could 
think to allow deletion of messages without iterating over the whole 
SharedModel terribly. Because at delete-time, we only have access to the 
message_id, not the sender & message data that would be required to look 
up the rowiter in _seen_messages.

>> +    def _unpublish(self, message_id):
>> +        """Remove message_id from the Dee.SharedModel."""
>> +        triple = [self.__class__.__name__.lower(),
>> +                  self._account.id,
>> +                  message_id]
>> +
>> +        row_iter = _seen_ids.pop(message_id)
>> +        if row_iter is None:
>> +            log.error('Tried to delete an invalid message id.')
>> +            return
>
> Have you tested the 'miss' logic?  What happens when you .pop() with a missing
> key?  (this is a trick question :)

Well, I haven't specified a default, which means it will raise KeyError 
if the id is not present. This doesn't particularly bother me, because 
a) the threading logic will safely log and ignore the exception, and b) 
this can only be invoked by clicking on the delete button on a message 
that's being displayed in gwibber, so the message_id being valid is 
almost *tautological*: if message_id wasn't valid, then how the hell did 
this method even get invoked?

> What happens if the .pop() does work?  Does anything need to happen to
> _seen_messages to keep it in sync?  Isn't the row associated with the row_iter
> going to be removed?  If so, what happens to the same row_iter sitting as a
> value in the _seen_messages mapping?

Right, they would become out of sync. But only insofar as there will be 
a sender/message key pointing at a rowiter that points at a row that 
doesn't exist. I'm not certain that this is particularly harmful, beyond 
being a minor memory leak.

I suppose in some extreme case if a person deletes a message and then 
tries to submit an identical message, it would cause us to attempt to 
append that message onto a row that doesn't exist. I don't really see 
that being a useful real-world scenario, because in my mind the 
use-cases for deleting a message would be to either a) leave it deleted, 
or b) to submit a *different* message in it's place.

Anyway, do you know of an easy way to do a dict.pop() that pops based on 
values instead of keys? Or are we going to have to scan the whole dict 
in an inefficient way? Or is there maybe some kind of alternate data 
structure we should be looking at? Some kind two-way dict where both 
keys and values are hashed and indexed?

>
>> +
>> +        row = Model.get_row(row_iter)
>> +        if triple in row[IDS_IDX]:
>> +            if len(row[IDS_IDX]) == 1:
>> +                # Message only exists on Twitter, delete it
>
> Actually, the method isn't Twitter specific is it?  If you think _unpublish()
> is only useful for Twitter, should it really go in the base class?  Or maybe
> the comment just needs updating?

Sorry, yeah, comment needs updating. It is a general-purpose method. It 
would need to be defined in the base class anyway because of it's use of 
IDS_IDX, Model, etc. Those are things that aren't imported into 
twitter.py, and would be sloppy to do so when the imports are all 
already here.

> === modified file 'gwibber/gwibber/utils/download.py'
> --- gwibber/gwibber/utils/download.py	2012-09-14 20:29:11 +0000
> +++ gwibber/gwibber/utils/download.py	2012-10-02 14:33:58 +0000
>> @@ -51,11 +51,11 @@
>>           url = self.url
>>           headers = self.headers.copy()
>>           if self.params is not None:
>> -            params = urlencode(self.params)
>> +            params = urlencode(self.params).replace('+', '%20')
>
> Why do this?  (Yes, I know you explained it to me, but pretend I'm a
> first-time reader of the code.  Explain it to them! :).

Because urlencode does not provide any option to use quote() instead of 
quote_plus() which it enforces, but Twitter requires percent-encoded 
spaces, and this is harmless to any other protocol. Also it's easier 
than re-implementing urlencode from scratch. hooray for code reuse!

>
>>               if self.post:
>>                   data = params.encode('utf-8')
>>                   headers['Content-Type'] = (
>> -                    'application/x-www-form-urlencoded; charset=utf-8')
>> +                    'application/x-www-form-urlencoded')
>
> Why do this too?  This one you haven't explained even to me! :)

It's because oauthlib is incredibly brittle. It scans for precisely the 
string 'application/x-www-form-urlencoded' and raises an exception if 
it's not exactly that. So it seems this was necessary in order to get 
OAuth signatures to even function.

On second thought however, it's not clear to me what oauthlib actually 
*does* with that header when it's signing (I'm pretty sure the value of 
that header is not included in the base string that gets signed). So it 
might be worthwhile to undo this change and then see if anything breaks. 
This was just one of those things that I did back when I was really 
struggling to get any signature working at all, and then after I got it 
working, I kept it because a) I forgot that I'd done it, and b) it 
didn't occur to me until just now that it might not have been necessary 
to make the signatures work.

So in other words, I'll test this ;-)

Sorry I'm not on IRC, I just got back into town from my big train trip 
and I'm having some troubles with emacs24/system updaty type problems. 
I'll get it working shortly and then fix up this mp for you by tonight.