Merge lp:~jonathanjgster/ibid/jjg into lp:~ibid-core/ibid/old-trunk-pack-0.92

Proposed by Jonathan Groll
Status: Merged
Approved by: Stefano Rivera
Approved revision: 582
Merged at revision: 591
Proposed branch: lp:~jonathanjgster/ibid/jjg
Merge into: lp:~ibid-core/ibid/old-trunk-pack-0.92
Diff against target: None lines
To merge this branch: bzr merge lp:~jonathanjgster/ibid/jjg
Reviewer Review Type Date Requested Status
Michael Gorven Approve
Jonathan Hitchcock Approve
Stefano Rivera Approve
Review via email: mp+4880@code.launchpad.net

This proposal supersedes a proposal from 2009-03-22.

To post a comment you must log in.
Revision history for this message
Jonathan Groll (jonathanjgster) wrote : Posted in a previous version of this proposal

Added new delicious plugin that grabs all URLs and posts to a delicious account defined in config. Plugin can coexist with url plugin.

Revision history for this message
Stefano Rivera (stefanor) wrote : Posted in a previous version of this proposal

I haven't tested it yet, but just from a read of the code:
* I don't like "delname" and "delpwd" rather make them "username" and "password".
* You don't need to pass the username and password to _add_post, they are in self.
* Instead of x.find(y) == -1 you can say y not in x
* Ibid favours pre-compiling regexs and saving them in the Processor object.
* You should probably use get_soup from utils rather than creating your own soup. It knows about encodings and compression.
* Be careful - your url comes in as unicode, but you transform it into a string.

Otherwise, looks good.

review: Needs Fixing
Revision history for this message
Stefano Rivera (stefanor) wrote : Posted in a previous version of this proposal

How about putting the context in the description?

Revision history for this message
Michael Gorven (mgorven) wrote : Posted in a previous version of this proposal

Very good first shot at an Ibid module :-) Just a couple things though:
 * I prefer to put imports into three groups (separated with a newline):
stdlib modules, other modules, and then Ibid modules.
 * Does the grab regex only allow (com|org|net|za) TLDs? If so, it should
accept any URL. The current url plugin is very liberal in what it detects. It
might be an idea to capture anything which looks like a URL, but then
validate it in some way (check that domain exists for example).
 * Don't catch exceptions unless they're expected errors and you're going to
do something useful with them. Ibid catches and logs exceptions raised by
plugins.
 review needs_fixing

review: Needs Fixing
Revision history for this message
Michael Gorven (mgorven) wrote : Posted in a previous version of this proposal

As I see was suggested on IRC, I think that this should be incorporated in the
url plugin, since it's doing the same thing as the current Grab processor,
just to a different location.

Revision history for this message
Michael Gorven (mgorven) wrote : Posted in a previous version of this proposal

Adding the channel and source names as tags might also be useful.

Revision history for this message
Stefano Rivera (stefanor) wrote : Posted in a previous version of this proposal

Hi Michael (2009.03.14_23:36:11_+0200)
> * Does the grab regex only allow (com|org|net|za) TLDs? If so, it should
> accept any URL. The current url plugin is very liberal in what it detects. It
> might be an idea to capture anything which looks like a URL, but then
> validate it in some way (check that domain exists for example).

This looks like it uses the same approach as the current url module:
* Detect things with http:// - they must be URLs
* Detect things with that list of TLDs that look URL-ish. This should probably
  be made more general, when it gets merged with url.py

SR

Revision history for this message
Jonathan Hitchcock (vhata) wrote : Posted in a previous version of this proposal

Not sure I like the idea of checking whether a domain exists, or checking for what the bot thinks is a "valid TLD" - if it looks like a URL, it should be logged, I think. Somebody might say "My site is going to be at http://foo.com/ when my DNS goes through", and the bot might not know about that TLD.

Anyhoo, that aside, it looks like a good module, but the changes above (especially the merging into the existing URL grabber) should be done before the final merge.

review: Needs Fixing
Revision history for this message
Jonathan Groll (jjg-groll) wrote : Posted in a previous version of this proposal

Hi Stefano--,

On Sat, Mar 14, 2009 at 09:24:48AM -0000, Stefano Rivera wrote:
> * You should probably use get_soup from utils rather than creating your own soup. It knows about encodings and compression.

Not sure if I follow your thinking on this one here - is the intention
to get rid of a beautifulsoup dependancy?

Personally, I find this code:
            soup = BeautifulSoup(urlopen(url))
            title = soup.title.string

easier to read, and simpler than calling get_html_parse_tree to
retrieve an etree (and the resultant iteration on that); also what
would be the benefit of using get_html_parse_tree to get back a soup?

All that is required is the (1st) title here.

Cheers,
Jonathan

P.S. of course, the following is also elegant:
import lxml.html
t = lxml.html.parse(url)
title = t.find(".//title").text

Revision history for this message
Michael Gorven (mgorven) wrote : Posted in a previous version of this proposal

On Saturday 21 March 2009 16:03:10 Jonathan Groll wrote:
> On Sat, Mar 14, 2009 at 09:24:48AM -0000, Stefano Rivera wrote:
> > * You should probably use get_soup from utils rather than creating your
> > own soup. It knows about encodings and compression.
>
> Not sure if I follow your thinking on this one here - is the intention
> to get rid of a beautifulsoup dependancy?

get_html_parse_tree() (which used to be called get_soup()) returns a
BeautifulSoup tree by default. So your line would simply be:

soup = get_html_parse_tree(url)

Revision history for this message
Jonathan Groll (jonathanjgster) wrote : Posted in a previous version of this proposal

Code has been adjusted as requested - code is now in url.py, and most of the cruft has been removed; re's are compiled; delname and delpwd are now username and password; the line context ('message_raw') is added as delicious extended information, as well as the source and channel to the tags; most of the exception handling logic has been removed to let ibid take care of; finally the data posted to delicious is utf-8 urlencoded.

Existing code in url.py that wasn't touched, but was pointed out as needing to be fixed was left alone, until after this merge:
= The regex to match URL's could do with some love
= Instead of x.find(y) == -1 you can say y not in x

I've changed the URL title retrieval logic to use element tree rather than beautiful soup. At least it can do xe.com now.

Anyway, please merge. I'll leave PetitChouChou (my ibid instance) running in #ibid for a day or so, it'll log url's to delicious.com/ibidtest.

Revision history for this message
Stefano Rivera (stefanor) wrote : Posted in a previous version of this proposal

Ah. Now that it's in URL, the username and password fields need to be renamed again. Sorry.
How about: delicious_username and delicious_password

Line 72 - should have spaces around the +

The obfusc section looks like it's very IRC specific. Have you tested with other protocols?

I don't think the try-except logic in _get_title is necessary. It goes against ibid-style (I make the same mistake all the time).

I'd put the delicious regexs inside the Delicious class.

Aside from that, no complaints. Will test it on hardy now.

Revision history for this message
Stefano Rivera (stefanor) wrote : Posted in a previous version of this proposal

> Existing code in url.py that wasn't touched, but was pointed out as needing to be fixed was left
> alone, until after this merge

Feel free to fix that stuff - this branch looks about ready and we might as well merge it all in one.

Revision history for this message
Stefano Rivera (stefanor) wrote : Posted in a previous version of this proposal

> Will test it on hardy now.

Works like a charm

Revision history for this message
Stefano Rivera (stefanor) wrote : Posted in a previous version of this proposal

> I don't think the try-except logic in _get_title is necessary. It goes against ibid-style

Sorry, I misunderstood the reason for that. But you should change it to log.exception rather than log.error

Revision history for this message
Stefano Rivera (stefanor) wrote : Posted in a previous version of this proposal

> The obfusc section looks like it's very IRC specific. Have you tested with other protocols?

In my jabber test, this resulted in my JID being attached as a tag. My JID happens to be my email address, like many people.

Revision history for this message
Stefano Rivera (stefanor) wrote : Posted in a previous version of this proposal

Almost there

review: Needs Fixing
Revision history for this message
Jonathan Groll (jonathanjgster) wrote :

Minor changes following Stefano's recommendations; regex's now in Delicious class; Code has now been tested with both individual and groupchat jabber; failure to extract title results in log.exception instead of log.error

Also, tags that could contain valueless information are now not posted, eg. stefanor@4F6873FA.3661FC01.FAE4B284.IP; similarly there seems to be no value to tagging the connection name with a jabber source

Revision history for this message
Stefano Rivera (stefanor) wrote :

I approve this conditional on e.message being changed to unicode(e) during merge.

.message is deprecated in Python 2.6

review: Approve
Revision history for this message
Stefano Rivera (stefanor) wrote :

I presume you know that different IRC networks do masking differntly - the IP regex may not be sufficient - but I can live with that for now.

Revision history for this message
Jonathan Hitchcock (vhata) :
review: Approve
Revision history for this message
Michael Gorven (mgorven) wrote :

On Wednesday 25 March 2009 14:14:36 Stefano Rivera wrote:
> I approve this conditional on e.message being changed to unicode(e) during
> merge.

The message shouldn't be explicitly logged at all -- log.exception() will do
that automatically.

You also can't rely on the source name (the Jabber source could be called
anything). Instead of "event.source == 'jabber'" you should
use "ibid.sources[event.source].type == 'jabber'".

I approve if these two things are changed.
 review approve

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'ibid/config.ini'
2--- ibid/config.ini 2009-03-17 07:19:20 +0000
3+++ ibid/config.ini 2009-03-25 10:35:11 +0000
4@@ -69,6 +69,9 @@
5 source = atrum
6 channel = "#ibid"
7 server = localhost
8+ [[url]]
9+ delicious_username = ibidtest
10+ delicious_password = a123456
11
12 [databases]
13 ibid = sqlite:///ibid.db
14
15=== modified file 'ibid/plugins/url.py'
16--- ibid/plugins/url.py 2009-03-08 13:16:28 +0000
17+++ ibid/plugins/url.py 2009-03-25 10:35:11 +0000
18@@ -1,6 +1,7 @@
19 from datetime import datetime
20 from urllib import urlencode
21-from urllib2 import urlopen, HTTPRedirectHandler, build_opener, HTTPError
22+from urllib2 import urlopen, HTTPRedirectHandler, build_opener, HTTPError, HTTPBasicAuthHandler, install_opener
23+import logging
24 import re
25
26 from sqlalchemy import Column, Integer, Unicode, DateTime, UnicodeText, ForeignKey, Table
27@@ -9,8 +10,11 @@
28 from ibid.plugins import Processor, match, handler
29 from ibid.config import Option
30 from ibid.models import Base
31-
32-help = {'url': u'Captures URLs seen in channel, and shortens and lengthens URLs'}
33+from ibid.utils import get_html_parse_tree
34+
35+help = {'url': u'Captures URLs seen in channel to database and/or to delicious, and shortens and lengthens URLs'}
36+
37+log = logging.getLogger('plugins.url')
38
39 class URL(Base):
40 __table__ = Table('urls', Base.metadata,
41@@ -27,10 +31,79 @@
42 self.identity_id = identity_id
43 self.time = datetime.now()
44
45+class Delicious():
46+
47+ at_re = re.compile('@\S+?\.')
48+ ip_re = re.compile('\.IP$')
49+
50+ def add_post(self,username,password,event,url=None):
51+ "Posts a URL to delicious.com"
52+
53+ date = datetime.now()
54+ title = self._get_title(url)
55+
56+ connection_body = event.sender['connection'].split('!')
57+ if len(connection_body) == 1:
58+ connection_body.append(event.sender['connection'])
59+
60+ if self.ip_re.search(connection_body[1]) != None:
61+ connection_body[1] = ""
62+
63+ if event.source == 'jabber':
64+ obfusc_conn = ""
65+ obfusc_chan = event.channel.replace('@', '^')
66+ else:
67+ obfusc_conn = self.at_re.sub('^', connection_body[1])
68+ obfusc_chan = self.at_re.sub('^', event.channel)
69+
70+
71+ tags = event.sender['nick'] + " " + obfusc_conn + " " + obfusc_chan \
72+ + " " + event.source
73+
74+ data = {
75+ 'url' : url,
76+ 'description' : title,
77+ 'tags' : tags,
78+ 'replace' : u"yes",
79+ 'dt' : date,
80+ 'extended' : event.message_raw
81+ }
82+
83+ self._set_auth(username,password)
84+ posturl = "https://api.del.icio.us/v1/posts/add?" \
85+ + urlencode(data, 'utf-8')
86+ resp = urlopen(posturl).read()
87+ if 'done' in resp:
88+ log.info(u"Successfully posted url %s to delicious, posted in channel %s by nick %s at time %s", \
89+ url, event.channel, event.sender['nick'], date)
90+ else:
91+ log.error(u"Error posting url %s to delicious: %s", url, response)
92+
93+ def _get_title(self,url):
94+ "Gets the title of a page"
95+ try:
96+ headers = {'User-Agent': 'Mozilla/5.0'}
97+ etree = get_html_parse_tree(url, None, headers, 'etree')
98+ title = etree.findtext("head/title")
99+ return title
100+ except Exception, e:
101+ log.exception(u"Delicious logic - error determining the title for url %s: %s", url, e.message)
102+ return url
103+
104+ def _set_auth(self,username,password):
105+ "Provides HTTP authentication on username and password"
106+ auth_handler = HTTPBasicAuthHandler()
107+ auth_handler.add_password('del.icio.us API', 'https://api.del.icio.us', username, password)
108+ opener = build_opener(auth_handler)
109+ install_opener(opener)
110+
111 class Grab(Processor):
112
113 addressed = False
114 processed = True
115+ username = Option('delicious_username', 'delicious account name')
116+ password = Option('delicious_password', 'delicious account password')
117+ delicious = Delicious()
118
119 @match(r'((?:\S+://|(?:www|ftp)\.)\S+|\S+\.(?:com|org|net|za)\S*)')
120 def grab(self, event, url):
121@@ -46,6 +119,9 @@
122 session.flush()
123 session.close()
124
125+ if self.username != None:
126+ self.delicious.add_post(self.username, self.password, event, url)
127+
128 class Shorten(Processor):
129 u"""shorten <url>"""
130 feature = 'url'
131@@ -75,7 +151,7 @@
132
133 def setup(self):
134 self.lengthen.im_func.pattern = re.compile(r'^((?:%s)\S+)$' % '|'.join([re.escape(service) for service in self.services]), re.I)
135-
136+
137 @handler
138 def lengthen(self, event, url):
139 opener = build_opener(NullRedirect())
140@@ -88,5 +164,5 @@
141
142 f.close()
143 event.addresponse(u"No redirect")
144-
145+
146 # vi: set et sta sw=4 ts=4:

Subscribers

People subscribed via source and target branches