Merge lp:~jonathanjgster/ibid/jjg into lp:~ibid-core/ibid/old-trunk-pack-0.92
- jjg
- Merge into old-trunk-pack-0.92
Status: | Merged |
---|---|
Approved by: | Stefano Rivera |
Approved revision: | 582 |
Merged at revision: | 591 |
Proposed branch: | lp:~jonathanjgster/ibid/jjg |
Merge into: | lp:~ibid-core/ibid/old-trunk-pack-0.92 |
Diff against target: | None lines |
To merge this branch: | bzr merge lp:~jonathanjgster/ibid/jjg |
Related bugs: |
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Michael Gorven | Approve | ||
Jonathan Hitchcock | Approve | ||
Stefano Rivera | Approve | ||
Review via email: mp+4880@code.launchpad.net |
This proposal supersedes a proposal from 2009-03-22.
Commit message
Description of the change
Jonathan Groll (jonathanjgster) wrote : Posted in a previous version of this proposal | # |
Stefano Rivera (stefanor) wrote : Posted in a previous version of this proposal | # |
I haven't tested it yet, but just from a read of the code:
* I don't like "delname" and "delpwd" rather make them "username" and "password".
* You don't need to pass the username and password to _add_post, they are in self.
* Instead of x.find(y) == -1 you can say y not in x
* Ibid favours pre-compiling regexs and saving them in the Processor object.
* You should probably use get_soup from utils rather than creating your own soup. It knows about encodings and compression.
* Be careful - your url comes in as unicode, but you transform it into a string.
Otherwise, looks good.
Stefano Rivera (stefanor) wrote : Posted in a previous version of this proposal | # |
How about putting the context in the description?
Michael Gorven (mgorven) wrote : Posted in a previous version of this proposal | # |
Very good first shot at an Ibid module :-) Just a couple things though:
* I prefer to put imports into three groups (separated with a newline):
stdlib modules, other modules, and then Ibid modules.
* Does the grab regex only allow (com|org|net|za) TLDs? If so, it should
accept any URL. The current url plugin is very liberal in what it detects. It
might be an idea to capture anything which looks like a URL, but then
validate it in some way (check that domain exists for example).
* Don't catch exceptions unless they're expected errors and you're going to
do something useful with them. Ibid catches and logs exceptions raised by
plugins.
review needs_fixing
Michael Gorven (mgorven) wrote : Posted in a previous version of this proposal | # |
As I see was suggested on IRC, I think that this should be incorporated in the
url plugin, since it's doing the same thing as the current Grab processor,
just to a different location.
Michael Gorven (mgorven) wrote : Posted in a previous version of this proposal | # |
Adding the channel and source names as tags might also be useful.
Stefano Rivera (stefanor) wrote : Posted in a previous version of this proposal | # |
Hi Michael (2009.03.
> * Does the grab regex only allow (com|org|net|za) TLDs? If so, it should
> accept any URL. The current url plugin is very liberal in what it detects. It
> might be an idea to capture anything which looks like a URL, but then
> validate it in some way (check that domain exists for example).
This looks like it uses the same approach as the current url module:
* Detect things with http:// - they must be URLs
* Detect things with that list of TLDs that look URL-ish. This should probably
be made more general, when it gets merged with url.py
SR
Jonathan Hitchcock (vhata) wrote : Posted in a previous version of this proposal | # |
Not sure I like the idea of checking whether a domain exists, or checking for what the bot thinks is a "valid TLD" - if it looks like a URL, it should be logged, I think. Somebody might say "My site is going to be at http://
Anyhoo, that aside, it looks like a good module, but the changes above (especially the merging into the existing URL grabber) should be done before the final merge.
Jonathan Groll (jjg-groll) wrote : Posted in a previous version of this proposal | # |
Hi Stefano--,
On Sat, Mar 14, 2009 at 09:24:48AM -0000, Stefano Rivera wrote:
> * You should probably use get_soup from utils rather than creating your own soup. It knows about encodings and compression.
Not sure if I follow your thinking on this one here - is the intention
to get rid of a beautifulsoup dependancy?
Personally, I find this code:
soup = BeautifulSoup(
title = soup.title.string
easier to read, and simpler than calling get_html_parse_tree to
retrieve an etree (and the resultant iteration on that); also what
would be the benefit of using get_html_parse_tree to get back a soup?
All that is required is the (1st) title here.
Cheers,
Jonathan
P.S. of course, the following is also elegant:
import lxml.html
t = lxml.html.
title = t.find(
Michael Gorven (mgorven) wrote : Posted in a previous version of this proposal | # |
On Saturday 21 March 2009 16:03:10 Jonathan Groll wrote:
> On Sat, Mar 14, 2009 at 09:24:48AM -0000, Stefano Rivera wrote:
> > * You should probably use get_soup from utils rather than creating your
> > own soup. It knows about encodings and compression.
>
> Not sure if I follow your thinking on this one here - is the intention
> to get rid of a beautifulsoup dependancy?
get_html_
BeautifulSoup tree by default. So your line would simply be:
soup = get_html_
Jonathan Groll (jonathanjgster) wrote : Posted in a previous version of this proposal | # |
Code has been adjusted as requested - code is now in url.py, and most of the cruft has been removed; re's are compiled; delname and delpwd are now username and password; the line context ('message_raw') is added as delicious extended information, as well as the source and channel to the tags; most of the exception handling logic has been removed to let ibid take care of; finally the data posted to delicious is utf-8 urlencoded.
Existing code in url.py that wasn't touched, but was pointed out as needing to be fixed was left alone, until after this merge:
= The regex to match URL's could do with some love
= Instead of x.find(y) == -1 you can say y not in x
I've changed the URL title retrieval logic to use element tree rather than beautiful soup. At least it can do xe.com now.
Anyway, please merge. I'll leave PetitChouChou (my ibid instance) running in #ibid for a day or so, it'll log url's to delicious.
Stefano Rivera (stefanor) wrote : Posted in a previous version of this proposal | # |
Ah. Now that it's in URL, the username and password fields need to be renamed again. Sorry.
How about: delicious_username and delicious_password
Line 72 - should have spaces around the +
The obfusc section looks like it's very IRC specific. Have you tested with other protocols?
I don't think the try-except logic in _get_title is necessary. It goes against ibid-style (I make the same mistake all the time).
I'd put the delicious regexs inside the Delicious class.
Aside from that, no complaints. Will test it on hardy now.
Stefano Rivera (stefanor) wrote : Posted in a previous version of this proposal | # |
> Existing code in url.py that wasn't touched, but was pointed out as needing to be fixed was left
> alone, until after this merge
Feel free to fix that stuff - this branch looks about ready and we might as well merge it all in one.
Stefano Rivera (stefanor) wrote : Posted in a previous version of this proposal | # |
> Will test it on hardy now.
Works like a charm
Stefano Rivera (stefanor) wrote : Posted in a previous version of this proposal | # |
> I don't think the try-except logic in _get_title is necessary. It goes against ibid-style
Sorry, I misunderstood the reason for that. But you should change it to log.exception rather than log.error
Stefano Rivera (stefanor) wrote : Posted in a previous version of this proposal | # |
> The obfusc section looks like it's very IRC specific. Have you tested with other protocols?
In my jabber test, this resulted in my JID being attached as a tag. My JID happens to be my email address, like many people.
Stefano Rivera (stefanor) wrote : Posted in a previous version of this proposal | # |
Almost there
Jonathan Groll (jonathanjgster) wrote : | # |
Minor changes following Stefano's recommendations; regex's now in Delicious class; Code has now been tested with both individual and groupchat jabber; failure to extract title results in log.exception instead of log.error
Also, tags that could contain valueless information are now not posted, eg. stefanor@
Stefano Rivera (stefanor) wrote : | # |
I approve this conditional on e.message being changed to unicode(e) during merge.
.message is deprecated in Python 2.6
Stefano Rivera (stefanor) wrote : | # |
I presume you know that different IRC networks do masking differntly - the IP regex may not be sufficient - but I can live with that for now.
Jonathan Hitchcock (vhata) : | # |
Michael Gorven (mgorven) wrote : | # |
On Wednesday 25 March 2009 14:14:36 Stefano Rivera wrote:
> I approve this conditional on e.message being changed to unicode(e) during
> merge.
The message shouldn't be explicitly logged at all -- log.exception() will do
that automatically.
You also can't rely on the source name (the Jabber source could be called
anything). Instead of "event.source == 'jabber'" you should
use "ibid.sources[
I approve if these two things are changed.
review approve
Preview Diff
1 | === modified file 'ibid/config.ini' |
2 | --- ibid/config.ini 2009-03-17 07:19:20 +0000 |
3 | +++ ibid/config.ini 2009-03-25 10:35:11 +0000 |
4 | @@ -69,6 +69,9 @@ |
5 | source = atrum |
6 | channel = "#ibid" |
7 | server = localhost |
8 | + [[url]] |
9 | + delicious_username = ibidtest |
10 | + delicious_password = a123456 |
11 | |
12 | [databases] |
13 | ibid = sqlite:///ibid.db |
14 | |
15 | === modified file 'ibid/plugins/url.py' |
16 | --- ibid/plugins/url.py 2009-03-08 13:16:28 +0000 |
17 | +++ ibid/plugins/url.py 2009-03-25 10:35:11 +0000 |
18 | @@ -1,6 +1,7 @@ |
19 | from datetime import datetime |
20 | from urllib import urlencode |
21 | -from urllib2 import urlopen, HTTPRedirectHandler, build_opener, HTTPError |
22 | +from urllib2 import urlopen, HTTPRedirectHandler, build_opener, HTTPError, HTTPBasicAuthHandler, install_opener |
23 | +import logging |
24 | import re |
25 | |
26 | from sqlalchemy import Column, Integer, Unicode, DateTime, UnicodeText, ForeignKey, Table |
27 | @@ -9,8 +10,11 @@ |
28 | from ibid.plugins import Processor, match, handler |
29 | from ibid.config import Option |
30 | from ibid.models import Base |
31 | - |
32 | -help = {'url': u'Captures URLs seen in channel, and shortens and lengthens URLs'} |
33 | +from ibid.utils import get_html_parse_tree |
34 | + |
35 | +help = {'url': u'Captures URLs seen in channel to database and/or to delicious, and shortens and lengthens URLs'} |
36 | + |
37 | +log = logging.getLogger('plugins.url') |
38 | |
39 | class URL(Base): |
40 | __table__ = Table('urls', Base.metadata, |
41 | @@ -27,10 +31,79 @@ |
42 | self.identity_id = identity_id |
43 | self.time = datetime.now() |
44 | |
45 | +class Delicious(): |
46 | + |
47 | + at_re = re.compile('@\S+?\.') |
48 | + ip_re = re.compile('\.IP$') |
49 | + |
50 | + def add_post(self,username,password,event,url=None): |
51 | + "Posts a URL to delicious.com" |
52 | + |
53 | + date = datetime.now() |
54 | + title = self._get_title(url) |
55 | + |
56 | + connection_body = event.sender['connection'].split('!') |
57 | + if len(connection_body) == 1: |
58 | + connection_body.append(event.sender['connection']) |
59 | + |
60 | + if self.ip_re.search(connection_body[1]) != None: |
61 | + connection_body[1] = "" |
62 | + |
63 | + if event.source == 'jabber': |
64 | + obfusc_conn = "" |
65 | + obfusc_chan = event.channel.replace('@', '^') |
66 | + else: |
67 | + obfusc_conn = self.at_re.sub('^', connection_body[1]) |
68 | + obfusc_chan = self.at_re.sub('^', event.channel) |
69 | + |
70 | + |
71 | + tags = event.sender['nick'] + " " + obfusc_conn + " " + obfusc_chan \ |
72 | + + " " + event.source |
73 | + |
74 | + data = { |
75 | + 'url' : url, |
76 | + 'description' : title, |
77 | + 'tags' : tags, |
78 | + 'replace' : u"yes", |
79 | + 'dt' : date, |
80 | + 'extended' : event.message_raw |
81 | + } |
82 | + |
83 | + self._set_auth(username,password) |
84 | + posturl = "https://api.del.icio.us/v1/posts/add?" \ |
85 | + + urlencode(data, 'utf-8') |
86 | + resp = urlopen(posturl).read() |
87 | + if 'done' in resp: |
88 | + log.info(u"Successfully posted url %s to delicious, posted in channel %s by nick %s at time %s", \ |
89 | + url, event.channel, event.sender['nick'], date) |
90 | + else: |
91 | + log.error(u"Error posting url %s to delicious: %s", url, response) |
92 | + |
93 | + def _get_title(self,url): |
94 | + "Gets the title of a page" |
95 | + try: |
96 | + headers = {'User-Agent': 'Mozilla/5.0'} |
97 | + etree = get_html_parse_tree(url, None, headers, 'etree') |
98 | + title = etree.findtext("head/title") |
99 | + return title |
100 | + except Exception, e: |
101 | + log.exception(u"Delicious logic - error determining the title for url %s: %s", url, e.message) |
102 | + return url |
103 | + |
104 | + def _set_auth(self,username,password): |
105 | + "Provides HTTP authentication on username and password" |
106 | + auth_handler = HTTPBasicAuthHandler() |
107 | + auth_handler.add_password('del.icio.us API', 'https://api.del.icio.us', username, password) |
108 | + opener = build_opener(auth_handler) |
109 | + install_opener(opener) |
110 | + |
111 | class Grab(Processor): |
112 | |
113 | addressed = False |
114 | processed = True |
115 | + username = Option('delicious_username', 'delicious account name') |
116 | + password = Option('delicious_password', 'delicious account password') |
117 | + delicious = Delicious() |
118 | |
119 | @match(r'((?:\S+://|(?:www|ftp)\.)\S+|\S+\.(?:com|org|net|za)\S*)') |
120 | def grab(self, event, url): |
121 | @@ -46,6 +119,9 @@ |
122 | session.flush() |
123 | session.close() |
124 | |
125 | + if self.username != None: |
126 | + self.delicious.add_post(self.username, self.password, event, url) |
127 | + |
128 | class Shorten(Processor): |
129 | u"""shorten <url>""" |
130 | feature = 'url' |
131 | @@ -75,7 +151,7 @@ |
132 | |
133 | def setup(self): |
134 | self.lengthen.im_func.pattern = re.compile(r'^((?:%s)\S+)$' % '|'.join([re.escape(service) for service in self.services]), re.I) |
135 | - |
136 | + |
137 | @handler |
138 | def lengthen(self, event, url): |
139 | opener = build_opener(NullRedirect()) |
140 | @@ -88,5 +164,5 @@ |
141 | |
142 | f.close() |
143 | event.addresponse(u"No redirect") |
144 | - |
145 | + |
146 | # vi: set et sta sw=4 ts=4: |
Added new delicious plugin that grabs all URLs and posts to a delicious account defined in config. Plugin can coexist with url plugin.