Merge lp:~jonathanjgster/ibid/jjg into lp:~ibid-core/ibid/old-trunk-pack-0.92

Proposed by Jonathan Groll
Status: Superseded
Proposed branch: lp:~jonathanjgster/ibid/jjg
Merge into: lp:~ibid-core/ibid/old-trunk-pack-0.92
Diff against target: None lines
To merge this branch: bzr merge lp:~jonathanjgster/ibid/jjg
Reviewer Review Type Date Requested Status
Stefano Rivera Needs Fixing
Review via email: mp+4759@code.launchpad.net

This proposal supersedes a proposal from 2009-03-14.

This proposal has been superseded by a proposal from 2009-03-25.

To post a comment you must log in.
Revision history for this message
Jonathan Groll (jonathanjgster) wrote : Posted in a previous version of this proposal

Added new delicious plugin that grabs all URLs and posts to a delicious account defined in config. Plugin can coexist with url plugin.

Revision history for this message
Stefano Rivera (stefanor) wrote : Posted in a previous version of this proposal

I haven't tested it yet, but just from a read of the code:
* I don't like "delname" and "delpwd" rather make them "username" and "password".
* You don't need to pass the username and password to _add_post, they are in self.
* Instead of x.find(y) == -1 you can say y not in x
* Ibid favours pre-compiling regexs and saving them in the Processor object.
* You should probably use get_soup from utils rather than creating your own soup. It knows about encodings and compression.
* Be careful - your url comes in as unicode, but you transform it into a string.

Otherwise, looks good.

review: Needs Fixing
Revision history for this message
Stefano Rivera (stefanor) wrote : Posted in a previous version of this proposal

How about putting the context in the description?

Revision history for this message
Michael Gorven (mgorven) wrote : Posted in a previous version of this proposal

Very good first shot at an Ibid module :-) Just a couple things though:
 * I prefer to put imports into three groups (separated with a newline):
stdlib modules, other modules, and then Ibid modules.
 * Does the grab regex only allow (com|org|net|za) TLDs? If so, it should
accept any URL. The current url plugin is very liberal in what it detects. It
might be an idea to capture anything which looks like a URL, but then
validate it in some way (check that domain exists for example).
 * Don't catch exceptions unless they're expected errors and you're going to
do something useful with them. Ibid catches and logs exceptions raised by
plugins.
 review needs_fixing

review: Needs Fixing
Revision history for this message
Michael Gorven (mgorven) wrote : Posted in a previous version of this proposal

As I see was suggested on IRC, I think that this should be incorporated in the
url plugin, since it's doing the same thing as the current Grab processor,
just to a different location.

Revision history for this message
Michael Gorven (mgorven) wrote : Posted in a previous version of this proposal

Adding the channel and source names as tags might also be useful.

Revision history for this message
Stefano Rivera (stefanor) wrote : Posted in a previous version of this proposal

Hi Michael (2009.03.14_23:36:11_+0200)
> * Does the grab regex only allow (com|org|net|za) TLDs? If so, it should
> accept any URL. The current url plugin is very liberal in what it detects. It
> might be an idea to capture anything which looks like a URL, but then
> validate it in some way (check that domain exists for example).

This looks like it uses the same approach as the current url module:
* Detect things with http:// - they must be URLs
* Detect things with that list of TLDs that look URL-ish. This should probably
  be made more general, when it gets merged with url.py

SR

Revision history for this message
Jonathan Hitchcock (vhata) wrote : Posted in a previous version of this proposal

Not sure I like the idea of checking whether a domain exists, or checking for what the bot thinks is a "valid TLD" - if it looks like a URL, it should be logged, I think. Somebody might say "My site is going to be at http://foo.com/ when my DNS goes through", and the bot might not know about that TLD.

Anyhoo, that aside, it looks like a good module, but the changes above (especially the merging into the existing URL grabber) should be done before the final merge.

review: Needs Fixing
Revision history for this message
Jonathan Groll (jjg-groll) wrote : Posted in a previous version of this proposal

Hi Stefano--,

On Sat, Mar 14, 2009 at 09:24:48AM -0000, Stefano Rivera wrote:
> * You should probably use get_soup from utils rather than creating your own soup. It knows about encodings and compression.

Not sure if I follow your thinking on this one here - is the intention
to get rid of a beautifulsoup dependancy?

Personally, I find this code:
            soup = BeautifulSoup(urlopen(url))
            title = soup.title.string

easier to read, and simpler than calling get_html_parse_tree to
retrieve an etree (and the resultant iteration on that); also what
would be the benefit of using get_html_parse_tree to get back a soup?

All that is required is the (1st) title here.

Cheers,
Jonathan

P.S. of course, the following is also elegant:
import lxml.html
t = lxml.html.parse(url)
title = t.find(".//title").text

Revision history for this message
Michael Gorven (mgorven) wrote : Posted in a previous version of this proposal

On Saturday 21 March 2009 16:03:10 Jonathan Groll wrote:
> On Sat, Mar 14, 2009 at 09:24:48AM -0000, Stefano Rivera wrote:
> > * You should probably use get_soup from utils rather than creating your
> > own soup. It knows about encodings and compression.
>
> Not sure if I follow your thinking on this one here - is the intention
> to get rid of a beautifulsoup dependancy?

get_html_parse_tree() (which used to be called get_soup()) returns a
BeautifulSoup tree by default. So your line would simply be:

soup = get_html_parse_tree(url)

Revision history for this message
Jonathan Groll (jonathanjgster) wrote :

Code has been adjusted as requested - code is now in url.py, and most of the cruft has been removed; re's are compiled; delname and delpwd are now username and password; the line context ('message_raw') is added as delicious extended information, as well as the source and channel to the tags; most of the exception handling logic has been removed to let ibid take care of; finally the data posted to delicious is utf-8 urlencoded.

Existing code in url.py that wasn't touched, but was pointed out as needing to be fixed was left alone, until after this merge:
= The regex to match URL's could do with some love
= Instead of x.find(y) == -1 you can say y not in x

I've changed the URL title retrieval logic to use element tree rather than beautiful soup. At least it can do xe.com now.

Anyway, please merge. I'll leave PetitChouChou (my ibid instance) running in #ibid for a day or so, it'll log url's to delicious.com/ibidtest.

Revision history for this message
Stefano Rivera (stefanor) wrote :

Ah. Now that it's in URL, the username and password fields need to be renamed again. Sorry.
How about: delicious_username and delicious_password

Line 72 - should have spaces around the +

The obfusc section looks like it's very IRC specific. Have you tested with other protocols?

I don't think the try-except logic in _get_title is necessary. It goes against ibid-style (I make the same mistake all the time).

I'd put the delicious regexs inside the Delicious class.

Aside from that, no complaints. Will test it on hardy now.

Revision history for this message
Stefano Rivera (stefanor) wrote :

> Existing code in url.py that wasn't touched, but was pointed out as needing to be fixed was left
> alone, until after this merge

Feel free to fix that stuff - this branch looks about ready and we might as well merge it all in one.

Revision history for this message
Stefano Rivera (stefanor) wrote :

> Will test it on hardy now.

Works like a charm

Revision history for this message
Stefano Rivera (stefanor) wrote :

> I don't think the try-except logic in _get_title is necessary. It goes against ibid-style

Sorry, I misunderstood the reason for that. But you should change it to log.exception rather than log.error

Revision history for this message
Stefano Rivera (stefanor) wrote :

> The obfusc section looks like it's very IRC specific. Have you tested with other protocols?

In my jabber test, this resulted in my JID being attached as a tag. My JID happens to be my email address, like many people.

Revision history for this message
Stefano Rivera (stefanor) wrote :

Almost there

review: Needs Fixing
lp:~jonathanjgster/ibid/jjg updated
581. By Jonathan Groll

Merged with trunk

582. By Jonathan Groll

delicious logging mark III

583. By Jonathan Groll

Delicious IV -> freenode regexs and no e.message

Unmerged revisions

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'ibid/config.ini'
2--- ibid/config.ini 2009-03-16 21:33:13 +0000
3+++ ibid/config.ini 2009-03-21 14:50:21 +0000
4@@ -70,6 +70,9 @@
5 source = atrum
6 channel = "#ibid"
7 server = localhost
8+ [[url]]
9+ username = ibidtest
10+ password = a123456
11
12 [databases]
13 ibid = sqlite:///ibid.db
14
15=== modified file 'ibid/plugins/url.py'
16--- ibid/plugins/url.py 2009-03-08 13:16:28 +0000
17+++ ibid/plugins/url.py 2009-03-21 21:05:09 +0000
18@@ -1,6 +1,7 @@
19 from datetime import datetime
20 from urllib import urlencode
21-from urllib2 import urlopen, HTTPRedirectHandler, build_opener, HTTPError
22+from urllib2 import urlopen, HTTPRedirectHandler, build_opener, HTTPError, HTTPBasicAuthHandler, install_opener
23+import logging
24 import re
25
26 from sqlalchemy import Column, Integer, Unicode, DateTime, UnicodeText, ForeignKey, Table
27@@ -9,8 +10,13 @@
28 from ibid.plugins import Processor, match, handler
29 from ibid.config import Option
30 from ibid.models import Base
31-
32-help = {'url': u'Captures URLs seen in channel, and shortens and lengthens URLs'}
33+from ibid.utils import get_html_parse_tree
34+
35+help = {'url': u'Captures URLs seen in channel to database and/or to delicious, and shortens and lengthens URLs'}
36+
37+log = logging.getLogger('plugins.url')
38+at_re = re.compile('@\S+?\.')
39+exclamation_re = re.compile('!')
40
41 class URL(Base):
42 __table__ = Table('urls', Base.metadata,
43@@ -27,10 +33,64 @@
44 self.identity_id = identity_id
45 self.time = datetime.now()
46
47+class Delicious():
48+
49+ def add_post(self,username,password,event,url=None):
50+ "Posts a URL to delicious.com"
51+
52+ date = datetime.now()
53+ title = self._get_title(url)
54+
55+ connection_body = exclamation_re.split(event.sender['connection'])
56+ if len(connection_body) == 1:
57+ connection_body.append(event.sender['connection'])
58+ obfusc = at_re.sub('^', connection_body[1])
59+
60+ tags = event.sender['nick'] + " " + obfusc + " " + event.channel + " " + event.source
61+
62+ data = {
63+ 'url' : url,
64+ 'description' : title,
65+ 'tags' : tags,
66+ 'replace' : u"yes",
67+ 'dt' : date,
68+ 'extended' : event.message_raw
69+ }
70+
71+ self._set_auth(username,password)
72+ posturl = "https://api.del.icio.us/v1/posts/add?"+urlencode(data, 'utf-8')
73+ resp = urlopen(posturl).read()
74+ if 'done' in resp:
75+ log.info(u"Successfully posted url %s to delicious, posted in channel %s by nick %s at time %s", \
76+ url, event.channel, event.sender['nick'], date)
77+ else:
78+ log.error(u"Error posting url %s to delicious: %s", url, response)
79+
80+ def _get_title(self,url):
81+ "Gets the title of a page"
82+ try:
83+ headers = {'User-Agent': 'Mozilla/5.0', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'}
84+ etree = get_html_parse_tree(url, None, headers, 'etree')
85+ title = etree.findtext("head/title")
86+ return title
87+ except Exception, e:
88+ log.error(u"Delicious logic - error determining the title for url %s: %s", url, e.message)
89+ return url
90+
91+ def _set_auth(self,username,password):
92+ "Provides HTTP authentication on username and password"
93+ auth_handler = HTTPBasicAuthHandler()
94+ auth_handler.add_password('del.icio.us API', 'https://api.del.icio.us', username, password)
95+ opener = build_opener(auth_handler)
96+ install_opener(opener)
97+
98 class Grab(Processor):
99
100 addressed = False
101 processed = True
102+ username = Option('username', 'delicious account name')
103+ password = Option('password', 'delicious account password')
104+ delicious = Delicious()
105
106 @match(r'((?:\S+://|(?:www|ftp)\.)\S+|\S+\.(?:com|org|net|za)\S*)')
107 def grab(self, event, url):
108@@ -46,6 +106,9 @@
109 session.flush()
110 session.close()
111
112+ if self.username != None:
113+ self.delicious.add_post(self.username, self.password, event, url)
114+
115 class Shorten(Processor):
116 u"""shorten <url>"""
117 feature = 'url'
118@@ -75,7 +138,7 @@
119
120 def setup(self):
121 self.lengthen.im_func.pattern = re.compile(r'^((?:%s)\S+)$' % '|'.join([re.escape(service) for service in self.services]), re.I)
122-
123+
124 @handler
125 def lengthen(self, event, url):
126 opener = build_opener(NullRedirect())
127@@ -88,5 +151,5 @@
128
129 f.close()
130 event.addresponse(u"No redirect")
131-
132+
133 # vi: set et sta sw=4 ts=4:

Subscribers

People subscribed via source and target branches