Merge lp:~jonathanjgster/ibid/jjg into lp:~ibid-core/ibid/old-trunk-pack-0.92
- jjg
- Merge into old-trunk-pack-0.92
Status: | Superseded |
---|---|
Proposed branch: | lp:~jonathanjgster/ibid/jjg |
Merge into: | lp:~ibid-core/ibid/old-trunk-pack-0.92 |
Diff against target: | None lines |
To merge this branch: | bzr merge lp:~jonathanjgster/ibid/jjg |
Related bugs: |
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Jonathan Hitchcock | Needs Fixing | ||
Michael Gorven | Needs Fixing | ||
Stefano Rivera | Needs Fixing | ||
Review via email: mp+4482@code.launchpad.net |
This proposal has been superseded by a proposal from 2009-03-22.
Commit message
Description of the change
Jonathan Groll (jonathanjgster) wrote : | # |
Stefano Rivera (stefanor) wrote : | # |
I haven't tested it yet, but just from a read of the code:
* I don't like "delname" and "delpwd" rather make them "username" and "password".
* You don't need to pass the username and password to _add_post, they are in self.
* Instead of x.find(y) == -1 you can say y not in x
* Ibid favours pre-compiling regexs and saving them in the Processor object.
* You should probably use get_soup from utils rather than creating your own soup. It knows about encodings and compression.
* Be careful - your url comes in as unicode, but you transform it into a string.
Otherwise, looks good.
Stefano Rivera (stefanor) wrote : | # |
How about putting the context in the description?
Michael Gorven (mgorven) wrote : | # |
Very good first shot at an Ibid module :-) Just a couple things though:
* I prefer to put imports into three groups (separated with a newline):
stdlib modules, other modules, and then Ibid modules.
* Does the grab regex only allow (com|org|net|za) TLDs? If so, it should
accept any URL. The current url plugin is very liberal in what it detects. It
might be an idea to capture anything which looks like a URL, but then
validate it in some way (check that domain exists for example).
* Don't catch exceptions unless they're expected errors and you're going to
do something useful with them. Ibid catches and logs exceptions raised by
plugins.
review needs_fixing
Michael Gorven (mgorven) wrote : | # |
As I see was suggested on IRC, I think that this should be incorporated in the
url plugin, since it's doing the same thing as the current Grab processor,
just to a different location.
Michael Gorven (mgorven) wrote : | # |
Adding the channel and source names as tags might also be useful.
Stefano Rivera (stefanor) wrote : | # |
Hi Michael (2009.03.
> * Does the grab regex only allow (com|org|net|za) TLDs? If so, it should
> accept any URL. The current url plugin is very liberal in what it detects. It
> might be an idea to capture anything which looks like a URL, but then
> validate it in some way (check that domain exists for example).
This looks like it uses the same approach as the current url module:
* Detect things with http:// - they must be URLs
* Detect things with that list of TLDs that look URL-ish. This should probably
be made more general, when it gets merged with url.py
SR
Jonathan Hitchcock (vhata) wrote : | # |
Not sure I like the idea of checking whether a domain exists, or checking for what the bot thinks is a "valid TLD" - if it looks like a URL, it should be logged, I think. Somebody might say "My site is going to be at http://
Anyhoo, that aside, it looks like a good module, but the changes above (especially the merging into the existing URL grabber) should be done before the final merge.
Jonathan Groll (jjg-groll) wrote : | # |
Hi Stefano--,
On Sat, Mar 14, 2009 at 09:24:48AM -0000, Stefano Rivera wrote:
> * You should probably use get_soup from utils rather than creating your own soup. It knows about encodings and compression.
Not sure if I follow your thinking on this one here - is the intention
to get rid of a beautifulsoup dependancy?
Personally, I find this code:
soup = BeautifulSoup(
title = soup.title.string
easier to read, and simpler than calling get_html_parse_tree to
retrieve an etree (and the resultant iteration on that); also what
would be the benefit of using get_html_parse_tree to get back a soup?
All that is required is the (1st) title here.
Cheers,
Jonathan
P.S. of course, the following is also elegant:
import lxml.html
t = lxml.html.
title = t.find(
- 575. By Jonathan Groll <jonathan@speedy>
-
Remerged with trunk
- 576. By Jonathan Groll <jonathan@speedy>
-
some changes made, more to do still
- 577. By Jonathan Groll <jonathan@speedy>
-
Further recommended changes
- 578. By Jonathan Groll <jonathan@speedy>
-
pre-release with beautiful soup untouched
- 579. By Jonathan Groll <jonathan@speedy>
-
removed old delicious.py file
Michael Gorven (mgorven) wrote : | # |
On Saturday 21 March 2009 16:03:10 Jonathan Groll wrote:
> On Sat, Mar 14, 2009 at 09:24:48AM -0000, Stefano Rivera wrote:
> > * You should probably use get_soup from utils rather than creating your
> > own soup. It knows about encodings and compression.
>
> Not sure if I follow your thinking on this one here - is the intention
> to get rid of a beautifulsoup dependancy?
get_html_
BeautifulSoup tree by default. So your line would simply be:
soup = get_html_
- 580. By Jonathan Groll <jonathan@speedy>
-
Delicious title retrieval now uses element tree
- 581. By Jonathan Groll
-
Merged with trunk
- 582. By Jonathan Groll
-
delicious logging mark III
- 583. By Jonathan Groll
-
Delicious IV -> freenode regexs and no e.message
Unmerged revisions
Preview Diff
1 | === modified file 'ibid/config.ini' |
2 | --- ibid/config.ini 2009-03-02 11:59:56 +0000 |
3 | +++ ibid/config.ini 2009-03-14 07:38:01 +0000 |
4 | @@ -1,4 +1,4 @@ |
5 | -load = core, basic, info, log, http, irc, seen, identity, admin, auth, help, test, sources, config, bzr, network, factoid, roshambo, eval, crypto, tools, morse, dict, lookup, url, google, memo, karma, feeds, trac, buildbot, apt, misc, math, imdb |
6 | +load = core, basic, info, log, http, irc, seen, identity, admin, auth, help, test, sources, config, bzr, network, factoid, roshambo, eval, crypto, tools, morse, dict, lookup, url, google, memo, karma, feeds, trac, buildbot, apt, misc, math, imdb, delicious |
7 | botname = Ibid |
8 | logging = logging.ini |
9 | |
10 | @@ -70,6 +70,9 @@ |
11 | source = atrum |
12 | channel = "#ibid" |
13 | server = localhost |
14 | + [[delicious]] |
15 | + delname = ibidtest |
16 | + delpwd = a123456 |
17 | |
18 | [databases] |
19 | ibid = sqlite:///ibid.db |
20 | |
21 | === added file 'ibid/plugins/delicious.py' |
22 | --- ibid/plugins/delicious.py 1970-01-01 00:00:00 +0000 |
23 | +++ ibid/plugins/delicious.py 2009-03-14 07:38:01 +0000 |
24 | @@ -0,0 +1,99 @@ |
25 | +from datetime import datetime |
26 | +from urllib import urlencode |
27 | +from BeautifulSoup import BeautifulSoup |
28 | + |
29 | +import urllib2 |
30 | +import re |
31 | +import htmlentitydefs |
32 | +import logging |
33 | + |
34 | +import ibid |
35 | +from ibid.plugins import Processor, match, handler |
36 | +from ibid.config import Option |
37 | + |
38 | +help = {'delicious': u'Saves URLs seen in channel to configured delicious account'} |
39 | +log = logging.getLogger('plugins.delicious') |
40 | + |
41 | +class Grab(Processor): |
42 | + |
43 | + addressed = False |
44 | + processed = True |
45 | + delname = Option('delname', 'delicious account name') |
46 | + delpwd = Option('delpwd', 'delicious account password') |
47 | + |
48 | + @match(r'((?:\S+://|(?:www|ftp)\.)\S+|\S+\.(?:com|org|net|za)\S*)') |
49 | + def grab(self, event, url): |
50 | + self._add_post(self.delname,self.delpwd,url,event.sender['connection'],event.sender['nick'],event.channel) |
51 | + |
52 | + def _add_post(self,username,password,url=None,connection=None,nick=None,channel=None): |
53 | + "Posts a URL to delicious.com" |
54 | + if url == None: |
55 | + return |
56 | + if url.find('://') == -1: |
57 | + if url.lower().startswith('ftp'): |
58 | + url = 'ftp://%s' % url |
59 | + else: |
60 | + url = 'http://%s' % url |
61 | + |
62 | + date = datetime.now() |
63 | + title = self._get_title(url) |
64 | + |
65 | + connection_body = re.split('!', connection) |
66 | + if len(connection_body) == 1: |
67 | + connection_body.append(connection) |
68 | + obfusc = re.sub('@\S+?\.', '^', connection_body[1]) |
69 | + tags = nick + " " + obfusc |
70 | + |
71 | + data = { |
72 | + 'url' : url, |
73 | + 'description' : title, |
74 | + 'tags' : tags, |
75 | + 'replace' : 'yes', |
76 | + 'dt' : date, |
77 | + } |
78 | + |
79 | + try: |
80 | + self._set_auth(username,password) |
81 | + posturl = "https://api.del.icio.us/v1/posts/add?"+urlencode(data) |
82 | + resp = urllib2.urlopen(posturl).read() |
83 | + if resp.find('done') > 0: |
84 | + log.info(u"Successfully posted url %s posted in channel %s by nick %s at time %s", url, channel, nick, date) |
85 | + else: |
86 | + log.error(u"Error posting url %s: %s", url, response) |
87 | + |
88 | + except urllib2.URLError, e: |
89 | + log.error(u"Error posting url %s: %s", url, e.message) |
90 | + except Exception, e: |
91 | + log.error(u"Error posting url %s: %s", url, e.message) |
92 | + |
93 | + def _get_title(self,url): |
94 | + "Gets the title of a page" |
95 | + try: |
96 | + soup = BeautifulSoup(urllib2.urlopen(url)) |
97 | + title = str(soup.title.string) |
98 | + ## doing a de_entity results in > 'ascii' codec can't encode character u'\xab' etc. |
99 | + ## leaving this code here in case someone works out how to get urllib2 to post unicode? |
100 | + #final_title = self._de_entity(title) |
101 | + return title |
102 | + except Exception, e: |
103 | + log.error(u"Error determining the title for url %s: %s", url, e.message) |
104 | + return url |
105 | + |
106 | + def _set_auth(self,username,password): |
107 | + "Provides HTTP authentication on username and password" |
108 | + auth_handler = urllib2.HTTPBasicAuthHandler() |
109 | + auth_handler.add_password('del.icio.us API', 'https://api.del.icio.us', username, password) |
110 | + opener = urllib2.build_opener(auth_handler) |
111 | + urllib2.install_opener(opener) |
112 | + |
113 | + def _de_entity(self,text): |
114 | + "Remove HTML entities, and replace with their characters" |
115 | + replace = lambda match: unichr(int(match.group(1))) |
116 | + text = re.sub("&#(\d+);", replace, text) |
117 | + |
118 | + replace = lambda match: unichr(htmlentitydefs.name2codepoint[match.group(1)]) |
119 | + text = re.sub("&(\w+);", replace, text) |
120 | + return text |
121 | + |
122 | + |
123 | +# vi: set et sta sw=4 ts=4: |
Added new delicious plugin that grabs all URLs and posts to a delicious account defined in config. Plugin can coexist with url plugin.