Merge lp:~jonathanjgster/ibid/jjg into lp:~ibid-core/ibid/old-trunk-pack-0.92

Proposed by Jonathan Groll
Status: Superseded
Proposed branch: lp:~jonathanjgster/ibid/jjg
Merge into: lp:~ibid-core/ibid/old-trunk-pack-0.92
Diff against target: None lines
To merge this branch: bzr merge lp:~jonathanjgster/ibid/jjg
Reviewer Review Type Date Requested Status
Jonathan Hitchcock Needs Fixing
Michael Gorven Needs Fixing
Stefano Rivera Needs Fixing
Review via email: mp+4482@code.launchpad.net

This proposal has been superseded by a proposal from 2009-03-22.

To post a comment you must log in.
Revision history for this message
Jonathan Groll (jonathanjgster) wrote :

Added new delicious plugin that grabs all URLs and posts to a delicious account defined in config. Plugin can coexist with url plugin.

Revision history for this message
Stefano Rivera (stefanor) wrote :

I haven't tested it yet, but just from a read of the code:
* I don't like "delname" and "delpwd" rather make them "username" and "password".
* You don't need to pass the username and password to _add_post, they are in self.
* Instead of x.find(y) == -1 you can say y not in x
* Ibid favours pre-compiling regexs and saving them in the Processor object.
* You should probably use get_soup from utils rather than creating your own soup. It knows about encodings and compression.
* Be careful - your url comes in as unicode, but you transform it into a string.

Otherwise, looks good.

review: Needs Fixing
Revision history for this message
Stefano Rivera (stefanor) wrote :

How about putting the context in the description?

Revision history for this message
Michael Gorven (mgorven) wrote :

Very good first shot at an Ibid module :-) Just a couple things though:
 * I prefer to put imports into three groups (separated with a newline):
stdlib modules, other modules, and then Ibid modules.
 * Does the grab regex only allow (com|org|net|za) TLDs? If so, it should
accept any URL. The current url plugin is very liberal in what it detects. It
might be an idea to capture anything which looks like a URL, but then
validate it in some way (check that domain exists for example).
 * Don't catch exceptions unless they're expected errors and you're going to
do something useful with them. Ibid catches and logs exceptions raised by
plugins.
 review needs_fixing

review: Needs Fixing
Revision history for this message
Michael Gorven (mgorven) wrote :

As I see was suggested on IRC, I think that this should be incorporated in the
url plugin, since it's doing the same thing as the current Grab processor,
just to a different location.

Revision history for this message
Michael Gorven (mgorven) wrote :

Adding the channel and source names as tags might also be useful.

Revision history for this message
Stefano Rivera (stefanor) wrote :

Hi Michael (2009.03.14_23:36:11_+0200)
> * Does the grab regex only allow (com|org|net|za) TLDs? If so, it should
> accept any URL. The current url plugin is very liberal in what it detects. It
> might be an idea to capture anything which looks like a URL, but then
> validate it in some way (check that domain exists for example).

This looks like it uses the same approach as the current url module:
* Detect things with http:// - they must be URLs
* Detect things with that list of TLDs that look URL-ish. This should probably
  be made more general, when it gets merged with url.py

SR

Revision history for this message
Jonathan Hitchcock (vhata) wrote :

Not sure I like the idea of checking whether a domain exists, or checking for what the bot thinks is a "valid TLD" - if it looks like a URL, it should be logged, I think. Somebody might say "My site is going to be at http://foo.com/ when my DNS goes through", and the bot might not know about that TLD.

Anyhoo, that aside, it looks like a good module, but the changes above (especially the merging into the existing URL grabber) should be done before the final merge.

review: Needs Fixing
Revision history for this message
Jonathan Groll (jjg-groll) wrote :

Hi Stefano--,

On Sat, Mar 14, 2009 at 09:24:48AM -0000, Stefano Rivera wrote:
> * You should probably use get_soup from utils rather than creating your own soup. It knows about encodings and compression.

Not sure if I follow your thinking on this one here - is the intention
to get rid of a beautifulsoup dependancy?

Personally, I find this code:
            soup = BeautifulSoup(urlopen(url))
            title = soup.title.string

easier to read, and simpler than calling get_html_parse_tree to
retrieve an etree (and the resultant iteration on that); also what
would be the benefit of using get_html_parse_tree to get back a soup?

All that is required is the (1st) title here.

Cheers,
Jonathan

P.S. of course, the following is also elegant:
import lxml.html
t = lxml.html.parse(url)
title = t.find(".//title").text

lp:~jonathanjgster/ibid/jjg updated
575. By Jonathan Groll <jonathan@speedy>

Remerged with trunk

576. By Jonathan Groll <jonathan@speedy>

some changes made, more to do still

577. By Jonathan Groll <jonathan@speedy>

Further recommended changes

578. By Jonathan Groll <jonathan@speedy>

pre-release with beautiful soup untouched

579. By Jonathan Groll <jonathan@speedy>

removed old delicious.py file

Revision history for this message
Michael Gorven (mgorven) wrote :

On Saturday 21 March 2009 16:03:10 Jonathan Groll wrote:
> On Sat, Mar 14, 2009 at 09:24:48AM -0000, Stefano Rivera wrote:
> > * You should probably use get_soup from utils rather than creating your
> > own soup. It knows about encodings and compression.
>
> Not sure if I follow your thinking on this one here - is the intention
> to get rid of a beautifulsoup dependancy?

get_html_parse_tree() (which used to be called get_soup()) returns a
BeautifulSoup tree by default. So your line would simply be:

soup = get_html_parse_tree(url)

lp:~jonathanjgster/ibid/jjg updated
580. By Jonathan Groll <jonathan@speedy>

Delicious title retrieval now uses element tree

581. By Jonathan Groll

Merged with trunk

582. By Jonathan Groll

delicious logging mark III

583. By Jonathan Groll

Delicious IV -> freenode regexs and no e.message

Unmerged revisions

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'ibid/config.ini'
2--- ibid/config.ini 2009-03-02 11:59:56 +0000
3+++ ibid/config.ini 2009-03-14 07:38:01 +0000
4@@ -1,4 +1,4 @@
5-load = core, basic, info, log, http, irc, seen, identity, admin, auth, help, test, sources, config, bzr, network, factoid, roshambo, eval, crypto, tools, morse, dict, lookup, url, google, memo, karma, feeds, trac, buildbot, apt, misc, math, imdb
6+load = core, basic, info, log, http, irc, seen, identity, admin, auth, help, test, sources, config, bzr, network, factoid, roshambo, eval, crypto, tools, morse, dict, lookup, url, google, memo, karma, feeds, trac, buildbot, apt, misc, math, imdb, delicious
7 botname = Ibid
8 logging = logging.ini
9
10@@ -70,6 +70,9 @@
11 source = atrum
12 channel = "#ibid"
13 server = localhost
14+ [[delicious]]
15+ delname = ibidtest
16+ delpwd = a123456
17
18 [databases]
19 ibid = sqlite:///ibid.db
20
21=== added file 'ibid/plugins/delicious.py'
22--- ibid/plugins/delicious.py 1970-01-01 00:00:00 +0000
23+++ ibid/plugins/delicious.py 2009-03-14 07:38:01 +0000
24@@ -0,0 +1,99 @@
25+from datetime import datetime
26+from urllib import urlencode
27+from BeautifulSoup import BeautifulSoup
28+
29+import urllib2
30+import re
31+import htmlentitydefs
32+import logging
33+
34+import ibid
35+from ibid.plugins import Processor, match, handler
36+from ibid.config import Option
37+
38+help = {'delicious': u'Saves URLs seen in channel to configured delicious account'}
39+log = logging.getLogger('plugins.delicious')
40+
41+class Grab(Processor):
42+
43+ addressed = False
44+ processed = True
45+ delname = Option('delname', 'delicious account name')
46+ delpwd = Option('delpwd', 'delicious account password')
47+
48+ @match(r'((?:\S+://|(?:www|ftp)\.)\S+|\S+\.(?:com|org|net|za)\S*)')
49+ def grab(self, event, url):
50+ self._add_post(self.delname,self.delpwd,url,event.sender['connection'],event.sender['nick'],event.channel)
51+
52+ def _add_post(self,username,password,url=None,connection=None,nick=None,channel=None):
53+ "Posts a URL to delicious.com"
54+ if url == None:
55+ return
56+ if url.find('://') == -1:
57+ if url.lower().startswith('ftp'):
58+ url = 'ftp://%s' % url
59+ else:
60+ url = 'http://%s' % url
61+
62+ date = datetime.now()
63+ title = self._get_title(url)
64+
65+ connection_body = re.split('!', connection)
66+ if len(connection_body) == 1:
67+ connection_body.append(connection)
68+ obfusc = re.sub('@\S+?\.', '^', connection_body[1])
69+ tags = nick + " " + obfusc
70+
71+ data = {
72+ 'url' : url,
73+ 'description' : title,
74+ 'tags' : tags,
75+ 'replace' : 'yes',
76+ 'dt' : date,
77+ }
78+
79+ try:
80+ self._set_auth(username,password)
81+ posturl = "https://api.del.icio.us/v1/posts/add?"+urlencode(data)
82+ resp = urllib2.urlopen(posturl).read()
83+ if resp.find('done') > 0:
84+ log.info(u"Successfully posted url %s posted in channel %s by nick %s at time %s", url, channel, nick, date)
85+ else:
86+ log.error(u"Error posting url %s: %s", url, response)
87+
88+ except urllib2.URLError, e:
89+ log.error(u"Error posting url %s: %s", url, e.message)
90+ except Exception, e:
91+ log.error(u"Error posting url %s: %s", url, e.message)
92+
93+ def _get_title(self,url):
94+ "Gets the title of a page"
95+ try:
96+ soup = BeautifulSoup(urllib2.urlopen(url))
97+ title = str(soup.title.string)
98+ ## doing a de_entity results in > 'ascii' codec can't encode character u'\xab' etc.
99+ ## leaving this code here in case someone works out how to get urllib2 to post unicode?
100+ #final_title = self._de_entity(title)
101+ return title
102+ except Exception, e:
103+ log.error(u"Error determining the title for url %s: %s", url, e.message)
104+ return url
105+
106+ def _set_auth(self,username,password):
107+ "Provides HTTP authentication on username and password"
108+ auth_handler = urllib2.HTTPBasicAuthHandler()
109+ auth_handler.add_password('del.icio.us API', 'https://api.del.icio.us', username, password)
110+ opener = urllib2.build_opener(auth_handler)
111+ urllib2.install_opener(opener)
112+
113+ def _de_entity(self,text):
114+ "Remove HTML entities, and replace with their characters"
115+ replace = lambda match: unichr(int(match.group(1)))
116+ text = re.sub("&#(\d+);", replace, text)
117+
118+ replace = lambda match: unichr(htmlentitydefs.name2codepoint[match.group(1)])
119+ text = re.sub("&(\w+);", replace, text)
120+ return text
121+
122+
123+# vi: set et sta sw=4 ts=4:

Subscribers

People subscribed via source and target branches