Merge lp:~jonathanjgster/ibid/jjg into lp:~ibid-core/ibid/old-trunk-pack-0.92

Proposed by Jonathan Groll
Status: Merged
Approved by: Stefano Rivera
Approved revision: 582
Merged at revision: 591
Proposed branch: lp:~jonathanjgster/ibid/jjg
Merge into: lp:~ibid-core/ibid/old-trunk-pack-0.92
Diff against target: None lines
To merge this branch: bzr merge lp:~jonathanjgster/ibid/jjg
Reviewer Review Type Date Requested Status
Michael Gorven Approve
Jonathan Hitchcock Approve
Stefano Rivera Approve
Review via email: mp+4880@code.launchpad.net

This proposal supersedes a proposal from 2009-03-22.

To post a comment you must log in.
Revision history for this message
Jonathan Groll (jonathanjgster) wrote : Posted in a previous version of this proposal

Added new delicious plugin that grabs all URLs and posts to a delicious account defined in config. Plugin can coexist with url plugin.

Revision history for this message
Stefano Rivera (stefanor) wrote : Posted in a previous version of this proposal

I haven't tested it yet, but just from a read of the code:
* I don't like "delname" and "delpwd" rather make them "username" and "password".
* You don't need to pass the username and password to _add_post, they are in self.
* Instead of x.find(y) == -1 you can say y not in x
* Ibid favours pre-compiling regexs and saving them in the Processor object.
* You should probably use get_soup from utils rather than creating your own soup. It knows about encodings and compression.
* Be careful - your url comes in as unicode, but you transform it into a string.

Otherwise, looks good.

review: Needs Fixing
Revision history for this message
Stefano Rivera (stefanor) wrote : Posted in a previous version of this proposal

How about putting the context in the description?

Revision history for this message
Michael Gorven (mgorven) wrote : Posted in a previous version of this proposal

Very good first shot at an Ibid module :-) Just a couple things though:
 * I prefer to put imports into three groups (separated with a newline):
stdlib modules, other modules, and then Ibid modules.
 * Does the grab regex only allow (com|org|net|za) TLDs? If so, it should
accept any URL. The current url plugin is very liberal in what it detects. It
might be an idea to capture anything which looks like a URL, but then
validate it in some way (check that domain exists for example).
 * Don't catch exceptions unless they're expected errors and you're going to
do something useful with them. Ibid catches and logs exceptions raised by
plugins.
 review needs_fixing

review: Needs Fixing
Revision history for this message
Michael Gorven (mgorven) wrote : Posted in a previous version of this proposal

As I see was suggested on IRC, I think that this should be incorporated in the
url plugin, since it's doing the same thing as the current Grab processor,
just to a different location.

Revision history for this message
Michael Gorven (mgorven) wrote : Posted in a previous version of this proposal

Adding the channel and source names as tags might also be useful.

Revision history for this message
Stefano Rivera (stefanor) wrote : Posted in a previous version of this proposal

Hi Michael (2009.03.14_23:36:11_+0200)
> * Does the grab regex only allow (com|org|net|za) TLDs? If so, it should
> accept any URL. The current url plugin is very liberal in what it detects. It
> might be an idea to capture anything which looks like a URL, but then
> validate it in some way (check that domain exists for example).

This looks like it uses the same approach as the current url module:
* Detect things with http:// - they must be URLs
* Detect things with that list of TLDs that look URL-ish. This should probably
  be made more general, when it gets merged with url.py

SR

Revision history for this message
Jonathan Hitchcock (vhata) wrote : Posted in a previous version of this proposal

Not sure I like the idea of checking whether a domain exists, or checking for what the bot thinks is a "valid TLD" - if it looks like a URL, it should be logged, I think. Somebody might say "My site is going to be at http://foo.com/ when my DNS goes through", and the bot might not know about that TLD.

Anyhoo, that aside, it looks like a good module, but the changes above (especially the merging into the existing URL grabber) should be done before the final merge.

review: Needs Fixing
Revision history for this message
Jonathan Groll (jjg-groll) wrote : Posted in a previous version of this proposal

Hi Stefano--,

On Sat, Mar 14, 2009 at 09:24:48AM -0000, Stefano Rivera wrote:
> * You should probably use get_soup from utils rather than creating your own soup. It knows about encodings and compression.

Not sure if I follow your thinking on this one here - is the intention
to get rid of a beautifulsoup dependancy?

Personally, I find this code:
            soup = BeautifulSoup(urlopen(url))
            title = soup.title.string

easier to read, and simpler than calling get_html_parse_tree to
retrieve an etree (and the resultant iteration on that); also what
would be the benefit of using get_html_parse_tree to get back a soup?

All that is required is the (1st) title here.

Cheers,
Jonathan

P.S. of course, the following is also elegant:
import lxml.html
t = lxml.html.parse(url)
title = t.find(".//title").text

Revision history for this message
Michael Gorven (mgorven) wrote : Posted in a previous version of this proposal

On Saturday 21 March 2009 16:03:10 Jonathan Groll wrote:
> On Sat, Mar 14, 2009 at 09:24:48AM -0000, Stefano Rivera wrote:
> > * You should probably use get_soup from utils rather than creating your
> > own soup. It knows about encodings and compression.
>
> Not sure if I follow your thinking on this one here - is the intention
> to get rid of a beautifulsoup dependancy?

get_html_parse_tree() (which used to be called get_soup()) returns a
BeautifulSoup tree by default. So your line would simply be:

soup = get_html_parse_tree(url)

Revision history for this message
Jonathan Groll (jonathanjgster) wrote : Posted in a previous version of this proposal

Code has been adjusted as requested - code is now in url.py, and most of the cruft has been removed; re's are compiled; delname and delpwd are now username and password; the line context ('message_raw') is added as delicious extended information, as well as the source and channel to the tags; most of the exception handling logic has been removed to let ibid take care of; finally the data posted to delicious is utf-8 urlencoded.

Existing code in url.py that wasn't touched, but was pointed out as needing to be fixed was left alone, until after this merge:
= The regex to match URL's could do with some love
= Instead of x.find(y) == -1 you can say y not in x

I've changed the URL title retrieval logic to use element tree rather than beautiful soup. At least it can do xe.com now.

Anyway, please merge. I'll leave PetitChouChou (my ibid instance) running in #ibid for a day or so, it'll log url's to delicious.com/ibidtest.

Revision history for this message
Stefano Rivera (stefanor) wrote : Posted in a previous version of this proposal

Ah. Now that it's in URL, the username and password fields need to be renamed again. Sorry.
How about: delicious_username and delicious_password

Line 72 - should have spaces around the +

The obfusc section looks like it's very IRC specific. Have you tested with other protocols?

I don't think the try-except logic in _get_title is necessary. It goes against ibid-style (I make the same mistake all the time).

I'd put the delicious regexs inside the Delicious class.

Aside from that, no complaints. Will test it on hardy now.

Revision history for this message
Stefano Rivera (stefanor) wrote : Posted in a previous version of this proposal

> Existing code in url.py that wasn't touched, but was pointed out as needing to be fixed was left
> alone, until after this merge

Feel free to fix that stuff - this branch looks about ready and we might as well merge it all in one.

Revision history for this message
Stefano Rivera (stefanor) wrote : Posted in a previous version of this proposal

> Will test it on hardy now.

Works like a charm

Revision history for this message
Stefano Rivera (stefanor) wrote : Posted in a previous version of this proposal

> I don't think the try-except logic in _get_title is necessary. It goes against ibid-style

Sorry, I misunderstood the reason for that. But you should change it to log.exception rather than log.error

Revision history for this message
Stefano Rivera (stefanor) wrote : Posted in a previous version of this proposal

> The obfusc section looks like it's very IRC specific. Have you tested with other protocols?

In my jabber test, this resulted in my JID being attached as a tag. My JID happens to be my email address, like many people.

Revision history for this message
Stefano Rivera (stefanor) wrote : Posted in a previous version of this proposal

Almost there

review: Needs Fixing
Revision history for this message
Jonathan Groll (jonathanjgster) wrote :

Minor changes following Stefano's recommendations; regex's now in Delicious class; Code has now been tested with both individual and groupchat jabber; failure to extract title results in log.exception instead of log.error

Also, tags that could contain valueless information are now not posted, eg. stefanor@4F6873FA.3661FC01.FAE4B284.IP; similarly there seems to be no value to tagging the connection name with a jabber source

Revision history for this message
Stefano Rivera (stefanor) wrote :

I approve this conditional on e.message being changed to unicode(e) during merge.

.message is deprecated in Python 2.6

review: Approve
Revision history for this message
Stefano Rivera (stefanor) wrote :

I presume you know that different IRC networks do masking differntly - the IP regex may not be sufficient - but I can live with that for now.

Revision history for this message
Jonathan Hitchcock (vhata) :
review: Approve
Revision history for this message
Michael Gorven (mgorven) wrote :

On Wednesday 25 March 2009 14:14:36 Stefano Rivera wrote:
> I approve this conditional on e.message being changed to unicode(e) during
> merge.

The message shouldn't be explicitly logged at all -- log.exception() will do
that automatically.

You also can't rely on the source name (the Jabber source could be called
anything). Instead of "event.source == 'jabber'" you should
use "ibid.sources[event.source].type == 'jabber'".

I approve if these two things are changed.
 review approve

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'ibid/config.ini'
--- ibid/config.ini 2009-03-17 07:19:20 +0000
+++ ibid/config.ini 2009-03-25 10:35:11 +0000
@@ -69,6 +69,9 @@
69 source = atrum69 source = atrum
70 channel = "#ibid"70 channel = "#ibid"
71 server = localhost71 server = localhost
72 [[url]]
73 delicious_username = ibidtest
74 delicious_password = a123456
7275
73[databases]76[databases]
74 ibid = sqlite:///ibid.db77 ibid = sqlite:///ibid.db
7578
=== modified file 'ibid/plugins/url.py'
--- ibid/plugins/url.py 2009-03-08 13:16:28 +0000
+++ ibid/plugins/url.py 2009-03-25 10:35:11 +0000
@@ -1,6 +1,7 @@
1from datetime import datetime1from datetime import datetime
2from urllib import urlencode2from urllib import urlencode
3from urllib2 import urlopen, HTTPRedirectHandler, build_opener, HTTPError3from urllib2 import urlopen, HTTPRedirectHandler, build_opener, HTTPError, HTTPBasicAuthHandler, install_opener
4import logging
4import re5import re
56
6from sqlalchemy import Column, Integer, Unicode, DateTime, UnicodeText, ForeignKey, Table7from sqlalchemy import Column, Integer, Unicode, DateTime, UnicodeText, ForeignKey, Table
@@ -9,8 +10,11 @@
9from ibid.plugins import Processor, match, handler10from ibid.plugins import Processor, match, handler
10from ibid.config import Option11from ibid.config import Option
11from ibid.models import Base12from ibid.models import Base
1213from ibid.utils import get_html_parse_tree
13help = {'url': u'Captures URLs seen in channel, and shortens and lengthens URLs'}14
15help = {'url': u'Captures URLs seen in channel to database and/or to delicious, and shortens and lengthens URLs'}
16
17log = logging.getLogger('plugins.url')
1418
15class URL(Base):19class URL(Base):
16 __table__ = Table('urls', Base.metadata,20 __table__ = Table('urls', Base.metadata,
@@ -27,10 +31,79 @@
27 self.identity_id = identity_id31 self.identity_id = identity_id
28 self.time = datetime.now()32 self.time = datetime.now()
2933
34class Delicious():
35
36 at_re = re.compile('@\S+?\.')
37 ip_re = re.compile('\.IP$')
38
39 def add_post(self,username,password,event,url=None):
40 "Posts a URL to delicious.com"
41
42 date = datetime.now()
43 title = self._get_title(url)
44
45 connection_body = event.sender['connection'].split('!')
46 if len(connection_body) == 1:
47 connection_body.append(event.sender['connection'])
48
49 if self.ip_re.search(connection_body[1]) != None:
50 connection_body[1] = ""
51
52 if event.source == 'jabber':
53 obfusc_conn = ""
54 obfusc_chan = event.channel.replace('@', '^')
55 else:
56 obfusc_conn = self.at_re.sub('^', connection_body[1])
57 obfusc_chan = self.at_re.sub('^', event.channel)
58
59
60 tags = event.sender['nick'] + " " + obfusc_conn + " " + obfusc_chan \
61 + " " + event.source
62
63 data = {
64 'url' : url,
65 'description' : title,
66 'tags' : tags,
67 'replace' : u"yes",
68 'dt' : date,
69 'extended' : event.message_raw
70 }
71
72 self._set_auth(username,password)
73 posturl = "https://api.del.icio.us/v1/posts/add?" \
74 + urlencode(data, 'utf-8')
75 resp = urlopen(posturl).read()
76 if 'done' in resp:
77 log.info(u"Successfully posted url %s to delicious, posted in channel %s by nick %s at time %s", \
78 url, event.channel, event.sender['nick'], date)
79 else:
80 log.error(u"Error posting url %s to delicious: %s", url, response)
81
82 def _get_title(self,url):
83 "Gets the title of a page"
84 try:
85 headers = {'User-Agent': 'Mozilla/5.0'}
86 etree = get_html_parse_tree(url, None, headers, 'etree')
87 title = etree.findtext("head/title")
88 return title
89 except Exception, e:
90 log.exception(u"Delicious logic - error determining the title for url %s: %s", url, e.message)
91 return url
92
93 def _set_auth(self,username,password):
94 "Provides HTTP authentication on username and password"
95 auth_handler = HTTPBasicAuthHandler()
96 auth_handler.add_password('del.icio.us API', 'https://api.del.icio.us', username, password)
97 opener = build_opener(auth_handler)
98 install_opener(opener)
99
30class Grab(Processor):100class Grab(Processor):
31101
32 addressed = False102 addressed = False
33 processed = True103 processed = True
104 username = Option('delicious_username', 'delicious account name')
105 password = Option('delicious_password', 'delicious account password')
106 delicious = Delicious()
34107
35 @match(r'((?:\S+://|(?:www|ftp)\.)\S+|\S+\.(?:com|org|net|za)\S*)')108 @match(r'((?:\S+://|(?:www|ftp)\.)\S+|\S+\.(?:com|org|net|za)\S*)')
36 def grab(self, event, url):109 def grab(self, event, url):
@@ -46,6 +119,9 @@
46 session.flush()119 session.flush()
47 session.close()120 session.close()
48121
122 if self.username != None:
123 self.delicious.add_post(self.username, self.password, event, url)
124
49class Shorten(Processor):125class Shorten(Processor):
50 u"""shorten <url>"""126 u"""shorten <url>"""
51 feature = 'url'127 feature = 'url'
@@ -75,7 +151,7 @@
75151
76 def setup(self):152 def setup(self):
77 self.lengthen.im_func.pattern = re.compile(r'^((?:%s)\S+)$' % '|'.join([re.escape(service) for service in self.services]), re.I)153 self.lengthen.im_func.pattern = re.compile(r'^((?:%s)\S+)$' % '|'.join([re.escape(service) for service in self.services]), re.I)
78 154
79 @handler155 @handler
80 def lengthen(self, event, url):156 def lengthen(self, event, url):
81 opener = build_opener(NullRedirect())157 opener = build_opener(NullRedirect())
@@ -88,5 +164,5 @@
88164
89 f.close()165 f.close()
90 event.addresponse(u"No redirect")166 event.addresponse(u"No redirect")
91 167
92# vi: set et sta sw=4 ts=4:168# vi: set et sta sw=4 ts=4:

Subscribers

People subscribed via source and target branches