Ibid

Merge lp:~jonathanjgster/ibid/jjg into lp:~ibid-core/ibid/old-trunk-pack-0.92

jjg
Merge into old-trunk-pack-0.92

Proposed by Jonathan Groll on 2009-03-14

Status:	Superseded
Proposed branch:	lp:~jonathanjgster/ibid/jjg
Merge into:	lp:~ibid-core/ibid/old-trunk-pack-0.92
Diff against target:	None lines
To merge this branch:	bzr merge lp:~jonathanjgster/ibid/jjg
Related bugs:	Link a bug report

Reviewer	Date Requested	Status
Jonathan Hitchcock		Needs Fixing on 2009-03-16
Michael Gorven		Needs Fixing on 2009-03-14
Stefano Rivera	2009-03-14	Needs Fixing on 2009-03-14
Review via email: mp+4482@code.launchpad.net

This proposal has been superseded by a proposal from 2009-03-22.

Revision history for this message

Jonathan Groll (jonathanjgster) wrote on 2009-03-14:

Added new delicious plugin that grabs all URLs and posts to a delicious account defined in config. Plugin can coexist with url plugin.

Revision history for this message

Stefano Rivera (stefanor) wrote on 2009-03-14:

I haven't tested it yet, but just from a read of the code:
* I don't like "delname" and "delpwd" rather make them "username" and "password".
* You don't need to pass the username and password to _add_post, they are in self.
* Instead of x.find(y) == -1 you can say y not in x
* Ibid favours pre-compiling regexs and saving them in the Processor object.
* You should probably use get_soup from utils rather than creating your own soup. It knows about encodings and compression.
* Be careful - your url comes in as unicode, but you transform it into a string.

Otherwise, looks good.

review: Needs Fixing

Revision history for this message

Stefano Rivera (stefanor) wrote on 2009-03-14:

How about putting the context in the description?

Revision history for this message

Michael Gorven (mgorven) wrote on 2009-03-14:

Very good first shot at an Ibid module :-) Just a couple things though:
* I prefer to put imports into three groups (separated with a newline):
stdlib modules, other modules, and then Ibid modules.
* Does the grab regex only allow (com|org|net|za) TLDs? If so, it should
accept any URL. The current url plugin is very liberal in what it detects. It
might be an idea to capture anything which looks like a URL, but then
validate it in some way (check that domain exists for example).
* Don't catch exceptions unless they're expected errors and you're going to
do something useful with them. Ibid catches and logs exceptions raised by
plugins.
review needs_fixing

review: Needs Fixing

Revision history for this message

Michael Gorven (mgorven) wrote on 2009-03-14:

As I see was suggested on IRC, I think that this should be incorporated in the
url plugin, since it's doing the same thing as the current Grab processor,
just to a different location.

Revision history for this message

Michael Gorven (mgorven) wrote on 2009-03-15:

Adding the channel and source names as tags might also be useful.

Revision history for this message

Stefano Rivera (stefanor) wrote on 2009-03-15:

Hi Michael (2009.03.14_23:36:11_+0200)
> * Does the grab regex only allow (com|org|net|za) TLDs? If so, it should
> accept any URL. The current url plugin is very liberal in what it detects. It
> might be an idea to capture anything which looks like a URL, but then
> validate it in some way (check that domain exists for example).

This looks like it uses the same approach as the current url module:
* Detect things with http:// - they must be URLs
* Detect things with that list of TLDs that look URL-ish. This should probably
be made more general, when it gets merged with url.py

Revision history for this message

Jonathan Hitchcock (vhata) wrote on 2009-03-16:

Not sure I like the idea of checking whether a domain exists, or checking for what the bot thinks is a "valid TLD" - if it looks like a URL, it should be logged, I think. Somebody might say "My site is going to be at http://foo.com/ when my DNS goes through", and the bot might not know about that TLD.

Anyhoo, that aside, it looks like a good module, but the changes above (especially the merging into the existing URL grabber) should be done before the final merge.

review: Needs Fixing

Revision history for this message

Jonathan Groll (jjg-groll) wrote on 2009-03-21:

Hi Stefano--,

On Sat, Mar 14, 2009 at 09:24:48AM -0000, Stefano Rivera wrote:
> * You should probably use get_soup from utils rather than creating your own soup. It knows about encodings and compression.

Not sure if I follow your thinking on this one here - is the intention
to get rid of a beautifulsoup dependancy?

Personally, I find this code:
soup = BeautifulSoup(urlopen(url))
title = soup.title.string

easier to read, and simpler than calling get_html_parse_tree to
retrieve an etree (and the resultant iteration on that); also what
would be the benefit of using get_html_parse_tree to get back a soup?

All that is required is the (1st) title here.

Cheers,
Jonathan

P.S. of course, the following is also elegant:
import lxml.html
t = lxml.html.parse(url)
title = t.find(".//title").text

lp:~jonathanjgster/ibid/jjg updated on 2009-03-21

575. By Jonathan Groll <jonathan@speedy> on 2009-03-20: Remerged with trunk
576. By Jonathan Groll <jonathan@speedy> on 2009-03-20: some changes made, more to do still
577. By Jonathan Groll <jonathan@speedy> on 2009-03-21: Further recommended changes
578. By Jonathan Groll <jonathan@speedy> on 2009-03-21: pre-release with beautiful soup untouched
579. By Jonathan Groll <jonathan@speedy> on 2009-03-21: removed old delicious.py file

Revision history for this message

Michael Gorven (mgorven) wrote on 2009-03-21:

On Saturday 21 March 2009 16:03:10 Jonathan Groll wrote:
> On Sat, Mar 14, 2009 at 09:24:48AM -0000, Stefano Rivera wrote:
> > * You should probably use get_soup from utils rather than creating your
> > own soup. It knows about encodings and compression.
>
> Not sure if I follow your thinking on this one here - is the intention
> to get rid of a beautifulsoup dependancy?

get_html_parse_tree() (which used to be called get_soup()) returns a
BeautifulSoup tree by default. So your line would simply be:

soup = get_html_parse_tree(url)

lp:~jonathanjgster/ibid/jjg updated on 2009-03-25

580. By Jonathan Groll <jonathan@speedy> on 2009-03-21: Delicious title retrieval now uses element tree
581. By Jonathan Groll on 2009-03-25: Merged with trunk
582. By Jonathan Groll on 2009-03-25: delicious logging mark III
583. By Jonathan Groll on 2009-03-25: Delicious IV -> freenode regexs and no e.message

Unmerged revisions

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Ibid Dev Team

Jonathan Groll

Jonathan Hitchcock

Michael Gorven

Pierre Nel

 === modified file 'ibid/config.ini'
 --- ibid/config.ini	2009-03-02 11:59:56 +0000
 +++ ibid/config.ini	2009-03-14 07:38:01 +0000
@@ -1,4 +1,4 @@
--load = core, basic, info, log, http, irc, seen, identity, admin, auth, help, test, sources, config, bzr, network, factoid, roshambo, eval, crypto, tools, morse, dict, lookup, url, google, memo, karma, feeds, trac, buildbot, apt, misc, math, imdb
++load = core, basic, info, log, http, irc, seen, identity, admin, auth, help, test, sources, config, bzr, network, factoid, roshambo, eval, crypto, tools, morse, dict, lookup, url, google, memo, karma, feeds, trac, buildbot, apt, misc, math, imdb, delicious
  botname = Ibid
  logging = logging.ini
@@ -70,6 +70,9 @@
          source = atrum
          channel = "#ibid"
          server = localhost
++    [[delicious]]
++        delname = ibidtest
++        delpwd = a123456
  [databases]
      ibid = sqlite:///ibid.db
 === added file 'ibid/plugins/delicious.py'
 --- ibid/plugins/delicious.py	1970-01-01 00:00:00 +0000
 +++ ibid/plugins/delicious.py	2009-03-14 07:38:01 +0000
@@ -0,0 +1,99 @@
++from datetime import datetime
++from urllib import urlencode
++from BeautifulSoup import BeautifulSoup
++
++import urllib2
++import re
++import htmlentitydefs
++import logging
++
++import ibid
++from ibid.plugins import Processor, match, handler
++from ibid.config import Option
++
++help = {'delicious': u'Saves URLs seen in channel to configured delicious account'}
++log  = logging.getLogger('plugins.delicious')
++
++class Grab(Processor):
++
++    addressed = False
++    processed = True
++    delname   = Option('delname', 'delicious account name')
++    delpwd    = Option('delpwd',  'delicious account password')
++
++    @match(r'((?:\S+://|(?:www|ftp)\.)\S+|\S+\.(?:com|org|net|za)\S*)')
++    def grab(self, event, url):
++        self._add_post(self.delname,self.delpwd,url,event.sender['connection'],event.sender['nick'],event.channel)
++
++    def _add_post(self,username,password,url=None,connection=None,nick=None,channel=None):
++        "Posts a URL to delicious.com"
++        if url == None:
++            return
++        if url.find('://') == -1:
++            if url.lower().startswith('ftp'):
++                url = 'ftp://%s' % url
++            else:
++                url = 'http://%s' % url
++
++        date  = datetime.now()
++        title = self._get_title(url)
++
++        connection_body = re.split('!', connection)
++        if len(connection_body) == 1:
++            connection_body.append(connection)
++        obfusc = re.sub('@\S+?\.', '^', connection_body[1])
++        tags = nick + " " + obfusc
++
++        data = {
++            'url' : url,
++            'description' : title,
++            'tags' : tags,
++            'replace' : 'yes',
++            'dt' : date,
++            }
++
++        try:
++            self._set_auth(username,password)
++            posturl = "https://api.del.icio.us/v1/posts/add?"+urlencode(data)
++            resp = urllib2.urlopen(posturl).read()
++            if resp.find('done') > 0:
++                log.info(u"Successfully posted url %s posted in channel %s by nick %s at time %s", url, channel, nick, date)
++            else:
++                log.error(u"Error posting url %s: %s", url, response)
++
++        except urllib2.URLError, e:
++            log.error(u"Error posting url %s: %s", url, e.message)
++        except Exception, e:
++            log.error(u"Error posting url %s: %s", url, e.message)
++
++    def _get_title(self,url):
++        "Gets the title of a page"
++        try:
++            soup = BeautifulSoup(urllib2.urlopen(url))
++            title = str(soup.title.string)
++             ## doing a de_entity results in > 'ascii' codec can't encode character u'\xab' etc.
++             ## leaving this code here in case someone works out how to get urllib2 to post unicode?
++             #final_title = self._de_entity(title)
++            return title
++        except Exception, e:
++            log.error(u"Error determining the title for url %s: %s", url, e.message)
++            return url
++
++    def _set_auth(self,username,password):
++        "Provides HTTP authentication on username and password"
++        auth_handler = urllib2.HTTPBasicAuthHandler()
++        auth_handler.add_password('del.icio.us API', 'https://api.del.icio.us', username, password)
++        opener = urllib2.build_opener(auth_handler)
++        urllib2.install_opener(opener)
++
++    def _de_entity(self,text):
++        "Remove HTML entities, and replace with their characters"
++        replace = lambda match: unichr(int(match.group(1)))
++        text = re.sub("&#(\d+);", replace, text)
++
++        replace = lambda match: unichr(htmlentitydefs.name2codepoint[match.group(1)])
++        text = re.sub("&(\w+);", replace, text)
++        return text
++
++
++# vi: set et sta sw=4 ts=4:

Ibid

Merge lp:~jonathanjgster/ibid/jjg into lp:~ibid-core/ibid/old-trunk-pack-0.92

Commit message

Description of the change

Unmerged revisions

Preview Diff

Subscribers