Ibid

Merge lp:~jonathanjgster/ibid/jjg into lp:~ibid-core/ibid/old-trunk-pack-0.92

jjg
Merge into old-trunk-pack-0.92

Proposed by Jonathan Groll on 2009-03-22

Status:	Superseded
Proposed branch:	lp:~jonathanjgster/ibid/jjg
Merge into:	lp:~ibid-core/ibid/old-trunk-pack-0.92
Diff against target:	None lines
To merge this branch:	bzr merge lp:~jonathanjgster/ibid/jjg
Related bugs:	Link a bug report

Reviewer	Review Type	Date Requested	Status
Stefano Rivera		2009-03-22	Needs Fixing on 2009-03-22
Review via email: mp+4759@code.launchpad.net

This proposal supersedes a proposal from 2009-03-14.

This proposal has been superseded by a proposal from 2009-03-25.

Revision history for this message

Jonathan Groll (jonathanjgster) wrote on 2009-03-14: Posted in a previous version of this proposal

Added new delicious plugin that grabs all URLs and posts to a delicious account defined in config. Plugin can coexist with url plugin.

Revision history for this message

Stefano Rivera (stefanor) wrote on 2009-03-14: Posted in a previous version of this proposal

I haven't tested it yet, but just from a read of the code:
* I don't like "delname" and "delpwd" rather make them "username" and "password".
* You don't need to pass the username and password to _add_post, they are in self.
* Instead of x.find(y) == -1 you can say y not in x
* Ibid favours pre-compiling regexs and saving them in the Processor object.
* You should probably use get_soup from utils rather than creating your own soup. It knows about encodings and compression.
* Be careful - your url comes in as unicode, but you transform it into a string.

Otherwise, looks good.

review: Needs Fixing

Revision history for this message

Stefano Rivera (stefanor) wrote on 2009-03-14: Posted in a previous version of this proposal

How about putting the context in the description?

Revision history for this message

Michael Gorven (mgorven) wrote on 2009-03-14: Posted in a previous version of this proposal

Very good first shot at an Ibid module :-) Just a couple things though:
* I prefer to put imports into three groups (separated with a newline):
stdlib modules, other modules, and then Ibid modules.
* Does the grab regex only allow (com|org|net|za) TLDs? If so, it should
accept any URL. The current url plugin is very liberal in what it detects. It
might be an idea to capture anything which looks like a URL, but then
validate it in some way (check that domain exists for example).
* Don't catch exceptions unless they're expected errors and you're going to
do something useful with them. Ibid catches and logs exceptions raised by
plugins.
review needs_fixing

review: Needs Fixing

Revision history for this message

Michael Gorven (mgorven) wrote on 2009-03-14: Posted in a previous version of this proposal

As I see was suggested on IRC, I think that this should be incorporated in the
url plugin, since it's doing the same thing as the current Grab processor,
just to a different location.

Revision history for this message

Michael Gorven (mgorven) wrote on 2009-03-15: Posted in a previous version of this proposal

Adding the channel and source names as tags might also be useful.

Revision history for this message

Stefano Rivera (stefanor) wrote on 2009-03-15: Posted in a previous version of this proposal

Hi Michael (2009.03.14_23:36:11_+0200)
> * Does the grab regex only allow (com|org|net|za) TLDs? If so, it should
> accept any URL. The current url plugin is very liberal in what it detects. It
> might be an idea to capture anything which looks like a URL, but then
> validate it in some way (check that domain exists for example).

This looks like it uses the same approach as the current url module:
* Detect things with http:// - they must be URLs
* Detect things with that list of TLDs that look URL-ish. This should probably
be made more general, when it gets merged with url.py

Revision history for this message

Jonathan Hitchcock (vhata) wrote on 2009-03-16: Posted in a previous version of this proposal

Not sure I like the idea of checking whether a domain exists, or checking for what the bot thinks is a "valid TLD" - if it looks like a URL, it should be logged, I think. Somebody might say "My site is going to be at http://foo.com/ when my DNS goes through", and the bot might not know about that TLD.

Anyhoo, that aside, it looks like a good module, but the changes above (especially the merging into the existing URL grabber) should be done before the final merge.

review: Needs Fixing

Revision history for this message

Jonathan Groll (jjg-groll) wrote on 2009-03-21: Posted in a previous version of this proposal

Hi Stefano--,

On Sat, Mar 14, 2009 at 09:24:48AM -0000, Stefano Rivera wrote:
> * You should probably use get_soup from utils rather than creating your own soup. It knows about encodings and compression.

Not sure if I follow your thinking on this one here - is the intention
to get rid of a beautifulsoup dependancy?

Personally, I find this code:
soup = BeautifulSoup(urlopen(url))
title = soup.title.string

easier to read, and simpler than calling get_html_parse_tree to
retrieve an etree (and the resultant iteration on that); also what
would be the benefit of using get_html_parse_tree to get back a soup?

All that is required is the (1st) title here.

Cheers,
Jonathan

P.S. of course, the following is also elegant:
import lxml.html
t = lxml.html.parse(url)
title = t.find(".//title").text

Revision history for this message

Michael Gorven (mgorven) wrote on 2009-03-21: Posted in a previous version of this proposal

On Saturday 21 March 2009 16:03:10 Jonathan Groll wrote:
> On Sat, Mar 14, 2009 at 09:24:48AM -0000, Stefano Rivera wrote:
> > * You should probably use get_soup from utils rather than creating your
> > own soup. It knows about encodings and compression.
>
> Not sure if I follow your thinking on this one here - is the intention
> to get rid of a beautifulsoup dependancy?

get_html_parse_tree() (which used to be called get_soup()) returns a
BeautifulSoup tree by default. So your line would simply be:

soup = get_html_parse_tree(url)

Revision history for this message

Jonathan Groll (jonathanjgster) wrote on 2009-03-22:

Code has been adjusted as requested - code is now in url.py, and most of the cruft has been removed; re's are compiled; delname and delpwd are now username and password; the line context ('message_raw') is added as delicious extended information, as well as the source and channel to the tags; most of the exception handling logic has been removed to let ibid take care of; finally the data posted to delicious is utf-8 urlencoded.

Existing code in url.py that wasn't touched, but was pointed out as needing to be fixed was left alone, until after this merge:
= The regex to match URL's could do with some love
= Instead of x.find(y) == -1 you can say y not in x

I've changed the URL title retrieval logic to use element tree rather than beautiful soup. At least it can do xe.com now.

Anyway, please merge. I'll leave PetitChouChou (my ibid instance) running in #ibid for a day or so, it'll log url's to delicious.com/ibidtest.

Revision history for this message

Stefano Rivera (stefanor) wrote on 2009-03-22:

Ah. Now that it's in URL, the username and password fields need to be renamed again. Sorry.
How about: delicious_username and delicious_password

Line 72 - should have spaces around the +

The obfusc section looks like it's very IRC specific. Have you tested with other protocols?

I don't think the try-except logic in _get_title is necessary. It goes against ibid-style (I make the same mistake all the time).

I'd put the delicious regexs inside the Delicious class.

Aside from that, no complaints. Will test it on hardy now.

Revision history for this message

Stefano Rivera (stefanor) wrote on 2009-03-22:

> Existing code in url.py that wasn't touched, but was pointed out as needing to be fixed was left
> alone, until after this merge

Feel free to fix that stuff - this branch looks about ready and we might as well merge it all in one.

Revision history for this message

Stefano Rivera (stefanor) wrote on 2009-03-22:

> Will test it on hardy now.

Works like a charm

Revision history for this message

Stefano Rivera (stefanor) wrote on 2009-03-22:

> I don't think the try-except logic in _get_title is necessary. It goes against ibid-style

Sorry, I misunderstood the reason for that. But you should change it to log.exception rather than log.error

Revision history for this message

Stefano Rivera (stefanor) wrote on 2009-03-22:

> The obfusc section looks like it's very IRC specific. Have you tested with other protocols?

In my jabber test, this resulted in my JID being attached as a tag. My JID happens to be my email address, like many people.

Revision history for this message

Stefano Rivera (stefanor) wrote on 2009-03-22:

Almost there

review: Needs Fixing

lp:~jonathanjgster/ibid/jjg updated on 2009-03-25

581. By Jonathan Groll on 2009-03-25: Merged with trunk
582. By Jonathan Groll on 2009-03-25: delicious logging mark III
583. By Jonathan Groll on 2009-03-25: Delicious IV -> freenode regexs and no e.message

Unmerged revisions

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Ibid Dev Team

Jonathan Groll

Jonathan Hitchcock

Michael Gorven

Pierre Nel

 === modified file 'ibid/config.ini'
 --- ibid/config.ini	2009-03-16 21:33:13 +0000
 +++ ibid/config.ini	2009-03-21 14:50:21 +0000
@@ -70,6 +70,9 @@
          source = atrum
          channel = "#ibid"
          server = localhost
++    [[url]]
++        username = ibidtest
++        password = a123456
  [databases]
      ibid = sqlite:///ibid.db
 === modified file 'ibid/plugins/url.py'
 --- ibid/plugins/url.py	2009-03-08 13:16:28 +0000
 +++ ibid/plugins/url.py	2009-03-21 21:05:09 +0000
@@ -1,6 +1,7 @@
  from datetime import datetime
  from urllib import urlencode
--from urllib2 import urlopen, HTTPRedirectHandler, build_opener, HTTPError
++from urllib2 import urlopen, HTTPRedirectHandler, build_opener, HTTPError, HTTPBasicAuthHandler, install_opener
++import logging
  import re
  from sqlalchemy import Column, Integer, Unicode, DateTime, UnicodeText, ForeignKey, Table
@@ -9,8 +10,13 @@
  from ibid.plugins import Processor, match, handler
  from ibid.config import Option
  from ibid.models import Base
--
--help = {'url': u'Captures URLs seen in channel, and shortens and lengthens URLs'}
++from ibid.utils  import get_html_parse_tree
++
++help = {'url': u'Captures URLs seen in channel to database and/or to delicious, and shortens and lengthens URLs'}
++
++log            = logging.getLogger('plugins.url')
++at_re          = re.compile('@\S+?\.')
++exclamation_re = re.compile('!')
  class URL(Base):
      __table__ = Table('urls', Base.metadata,
@@ -27,10 +33,64 @@
          self.identity_id = identity_id
          self.time = datetime.now()
++class Delicious():
++
++    def add_post(self,username,password,event,url=None):
++        "Posts a URL to delicious.com"
++
++        date  = datetime.now()
++        title = self._get_title(url)
++
++        connection_body = exclamation_re.split(event.sender['connection'])
++        if len(connection_body) == 1:
++            connection_body.append(event.sender['connection'])
++        obfusc = at_re.sub('^', connection_body[1])
++
++        tags = event.sender['nick'] + " " + obfusc + " " + event.channel + " " + event.source
++
++        data = {
++            'url' : url,
++            'description' : title,
++            'tags' : tags,
++            'replace' : u"yes",
++            'dt' : date,
++            'extended' : event.message_raw
++            }
++
++        self._set_auth(username,password)
++        posturl = "https://api.del.icio.us/v1/posts/add?"+urlencode(data, 'utf-8')
++        resp = urlopen(posturl).read()
++        if 'done' in resp:
++            log.info(u"Successfully posted url %s to delicious, posted in channel %s by nick %s at time %s", \
++                     url, event.channel, event.sender['nick'], date)
++        else:
++            log.error(u"Error posting url %s to delicious: %s", url, response)
++
++    def _get_title(self,url):
++        "Gets the title of a page"
++        try:
++            headers = {'User-Agent': 'Mozilla/5.0', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'}
++            etree = get_html_parse_tree(url, None, headers, 'etree')
++            title = etree.findtext("head/title")
++            return title
++        except Exception, e:
++            log.error(u"Delicious logic - error determining the title for url %s: %s", url, e.message)
++            return url
++
++    def _set_auth(self,username,password):
++        "Provides HTTP authentication on username and password"
++        auth_handler = HTTPBasicAuthHandler()
++        auth_handler.add_password('del.icio.us API', 'https://api.del.icio.us', username, password)
++        opener = build_opener(auth_handler)
++        install_opener(opener)
++
  class Grab(Processor):
      addressed = False
      processed = True
++    username  = Option('username', 'delicious account name')
++    password  = Option('password', 'delicious account password')
++    delicious = Delicious()
      @match(r'((?:\S+://|(?:www|ftp)\.)\S+|\S+\.(?:com|org|net|za)\S*)')
      def grab(self, event, url):
@@ -46,6 +106,9 @@
          session.flush()
          session.close()
++        if self.username != None:
++            self.delicious.add_post(self.username, self.password, event, url)
++
  class Shorten(Processor):
      u"""shorten <url>"""
      feature = 'url'
@@ -75,7 +138,7 @@
      def setup(self):
          self.lengthen.im_func.pattern = re.compile(r'^((?:%s)\S+)$' % '|'.join([re.escape(service) for service in self.services]), re.I)
--
++
      @handler
      def lengthen(self, event, url):
          opener = build_opener(NullRedirect())
@@ -88,5 +151,5 @@
          f.close()
          event.addresponse(u"No redirect")
--
++
  # vi: set et sta sw=4 ts=4:

Ibid

Merge lp:~jonathanjgster/ibid/jjg into lp:~ibid-core/ibid/old-trunk-pack-0.92

Commit message

Description of the change

Unmerged revisions

Preview Diff

Subscribers