Merge lp:~max-rabkin/ibid/google-translate into lp:~ibid-core/ibid/old-trunk-1.6

Proposed by Max Rabkin
Status: Merged
Approved by: Michael Gorven
Approved revision: 813
Merged at revision: 804
Proposed branch: lp:~max-rabkin/ibid/google-translate
Merge into: lp:~ibid-core/ibid/old-trunk-1.6
Diff against target: 192 lines (+135/-7)
2 files modified
ibid/plugins/google.py (+132/-7)
ibid/utils.py (+3/-0)
To merge this branch: bzr merge lp:~max-rabkin/ibid/google-translate
Reviewer Review Type Date Requested Status
Michael Gorven Approve
Jonathan Hitchcock Approve
Stefano Rivera Approve
Review via email: mp+16402@code.launchpad.net
To post a comment you must log in.
lp:~max-rabkin/ibid/google-translate updated
799. By Max Rabkin

handle translation errors

800. By Max Rabkin

Set user-agent to Ibid/<version> by default in json_webservice

Revision history for this message
Michael Gorven (mgorven) wrote :

+ 'langpair': (src_lang or '') + '|' + (dest_lang or 'en')}

The default language should probably be configurable, but since Ibid is all
English at the moment it's probably not a big deal.

+ headers = {'referrer': self.referrer}

The header is spelt "Referer" (no idea why)...

 review needs_fixing

review: Needs Fixing
Revision history for this message
Stefano Rivera (stefanor) wrote :

> The header is spelt "Referer" (no idea why)...

Please fix that everywhere in google.py

This is getting to the point where I'm happy, but I'd still like to see the ability to use full language names (and 3 letter ones?)

review: Needs Fixing
lp:~max-rabkin/ibid/google-translate updated
801. By Max Rabkin

Spell 'referer' header (in)correctly in google.py

802. By Max Rabkin

allow full names for languages in translation (searching ISO 639 registry, except for special cases)

Revision history for this message
Jonathan Hitchcock (vhata) wrote :

- Line 58 is redundant - it is already handled by line 48

- Lines 31,32 and 59-61 are a bit badly spaced - I think there's a style guide for that.

- I do think that the plugin's variable should be called "referer", if that's how HTTP thinks the word is spelled - keep it consistent everywhere.

- Your @match regular expression is still troublesome:

$ pcretest
PCRE version 7.8 2008-09-05
  re> /^translate\s+(.*?)(?:\s+from\s+(.+?))?(?:\s+to\s+(.+?))?$/
data> translate moving to the country from german to french
 0: translate moving to the country from german to french
 1: moving
 2: <unset>
 3: the country from german to french

You can't make the first catch-all non-greedy, because you want to eat as much of the data as possible before you get to the "from X to Y" part, including any 'from's and 'to's. But, you can't make it greedy, because 'from' and 'to' are optional, so it'll just eat everything and let the 'from' and 'to' parts be unset. In other words, regex can't do what you want it to do.

Knab solved this by using: "translate [ from <lang> ] [ to <lang> ] data..." instead - it's klunky, but it works. I'm not sure what the others think of this...

- Finally, your parsing of ISO-639-2 is a bit suspect. That file is strictly formatted - each line has five parts, separated by pipes - you don't need a complicated multi-line regex to parse it. You can just whip through each line and look for the language in question, then grab the code from that line, can't you? (You may be doing something extra in your regex that I didn't grasp, in which case, my apologies.)

review: Needs Fixing
lp:~max-rabkin/ibid/google-translate updated
803. By Max Rabkin

remove duplicate handling of defaults

804. By Max Rabkin

Simplify "help translate"

805. By Max Rabkin

Uniformly use the HTTP spelling of "referer" in google.py

806. By Max Rabkin

Use free-form @match regex and then extract parts in translation handler

807. By Max Rabkin

fixed multi-line dictionary style

808. By Max Rabkin

only download language lookup once per translation request

809. By Max Rabkin

added remembered group in language-code regex

810. By Max Rabkin

merged

811. By Max Rabkin

remove trailing whitespace

812. By Max Rabkin

use DOTALL when parsing translation requests

Revision history for this message
Stefano Rivera (stefanor) wrote :

Fixed everything I had problems with

review: Approve
Revision history for this message
Jonathan Hitchcock (vhata) wrote :

Approving on condition _make_language_dict() is only called if self.lang_names is None.

Awesome plugin, ta.

review: Approve
lp:~max-rabkin/ibid/google-translate updated
813. By Max Rabkin

only download ISO 639 database once

Revision history for this message
Michael Gorven (mgorven) wrote :

 review approve
 status approved

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'ibid/plugins/google.py'
2--- ibid/plugins/google.py 2009-12-20 20:08:58 +0000
3+++ ibid/plugins/google.py 2009-12-22 16:59:12 +0000
4@@ -1,3 +1,4 @@
5+import codecs
6 from httplib import BadStatusLine
7 import re
8 from urllib import quote
9@@ -7,12 +8,12 @@
10
11 from ibid.plugins import Processor, match
12 from ibid.config import Option
13-from ibid.utils import decode_htmlentities, ibid_version, json_webservice
14+from ibid.utils import decode_htmlentities, ibid_version, json_webservice, cacheable_download
15
16 help = {'google': u'Retrieves results from Google and Google Calculator.'}
17
18 default_user_agent = 'Mozilla/5.0'
19-default_referrer = "http://ibid.omnia.za.net/"
20+default_referer = "http://ibid.omnia.za.net/"
21
22 class GoogleAPISearch(Processor):
23 u"""google [for] <term>
24@@ -21,7 +22,7 @@
25 feature = 'google'
26
27 api_key = Option('api_key', 'Your Google API Key (optional)', None)
28- referrer = Option('referrer', 'The referrer string to use (API searches)', default_referrer)
29+ referer = Option('referer', 'The referer string to use (API searches)', default_referer)
30
31 def _google_api_search(self, query, resultsize="large", country=None):
32 params = {
33@@ -34,10 +35,7 @@
34 if self.api_key:
35 params['key'] = self.api_key
36
37- headers={
38- 'user-agent': "Ibid/%s" % ibid_version() or "dev",
39- 'referrer': self.referrer,
40- }
41+ headers = {'referer': self.referer}
42 return json_webservice('http://ajax.googleapis.com/ajax/services/search/web', params, headers)
43
44 @match(r'^google(?:\.com?)?(?:\.([a-z]{2}))?\s+(?:for\s+)?(.+?)$')
45@@ -125,6 +123,133 @@
46 else:
47 event.addresponse(u'Are you making up words again?')
48
49+class UnknownLanguageException (Exception): pass
50+
51+help['translate'] = u'''Translates a phrase using Google Translate.'''
52+class Translate(Processor):
53+ u"""translate <phrase> [from <language code>] [to <language code>]"""
54+
55+ feature = 'translate'
56+
57+ api_key = Option('api_key', 'Your Google API Key (optional)', None)
58+ referer = Option('referer', 'The referer string to use (API searches)', default_referer)
59+ dest_lang = Option('dest_lang', 'Destination language when none is specified', 'en')
60+
61+ @match(r'^translate\s+(.*)$')
62+ def translate (self, event, data):
63+ if not hasattr(self, 'lang_names'):
64+ self._make_language_dict()
65+
66+ from_re = r'from\s+(?P<from>(?:[-()]|\s|\w)+?)'
67+ to_re = r'to\s+(?P<to>(?:[-()]|\s|\w)+?)'
68+
69+ res = [(from_re, to_re), (to_re, from_re), (to_re,), (from_re,), ()]
70+
71+ # Try all possible specifications of source and target language until we
72+ # find a valid one.
73+ for pat in res:
74+ pat = '(?P<text>.*)' + '\s+'.join(pat) + '$'
75+ m = re.match(pat, data, re.IGNORECASE | re.UNICODE | re.DOTALL)
76+ if m:
77+ dest_lang = m.groupdict().get('to')
78+ src_lang = m.groupdict().get('from')
79+ try:
80+ if dest_lang:
81+ dest_lang = self.language_code(dest_lang)
82+ else:
83+ dest_lang = self.dest_lang
84+
85+ if src_lang:
86+ src_lang = self.language_code(src_lang)
87+ else:
88+ src_lang = ''
89+
90+ self._translate(event, m.group('text'), src_lang, dest_lang)
91+ except UnknownLanguageException:
92+ continue
93+ else:
94+ break
95+ else:
96+ event.addresponse("I've never heard of that language.")
97+
98+ def _translate (self, event, phrase, src_lang, dest_lang):
99+ params = {
100+ 'v': '1.0',
101+ 'q': phrase,
102+ 'langpair': src_lang + '|' + dest_lang,
103+ }
104+ if self.api_key:
105+ params['key'] = self.api_key
106+
107+ headers = {'referer': self.referer}
108+
109+ response = json_webservice(
110+ 'http://ajax.googleapis.com/ajax/services/language/translate',
111+ params, headers)
112+
113+ if response['responseStatus'] == 200:
114+ translated = decode_htmlentities(
115+ response['responseData']['translatedText'])
116+
117+ event.addresponse(translated)
118+ else:
119+ errors = {
120+ 'invalid translation language pair':
121+ "I don't know that language",
122+ 'invalid text':
123+ "there's not much to go on",
124+ 'could not reliably detect source language':
125+ "I'm not sure what language that was",
126+ }
127+
128+ msg = errors.get(response['responseDetails'],
129+ response['responseDetails'])
130+
131+ event.addresponse(u"I couldn't translate that: %s.", msg)
132+
133+ def _make_language_dict (self):
134+ self.lang_names = d = {}
135+
136+ filename = cacheable_download('http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt',
137+ 'google/ISO-639-2_utf-8.txt')
138+ f = codecs.open(filename, 'rU', 'utf-8')
139+ for line in f:
140+ code2B, code2T, code1, englishNames, frenchNames = line.split('|')
141+
142+ # Identify languages by ISO 639-1 code if it exists; otherwise use
143+ # ISO 639-2 (B). Google currently only translates languages with -1
144+ # codes, but will may use -2 (B) codes in the future.
145+ ident = code1 or code2B
146+
147+ d[code2B] = d[code2T] = d[code1] = ident
148+ for name in englishNames.lower().split(';'):
149+ d[name] = ident
150+
151+ del d['']
152+
153+ def language_code (self, name):
154+ """Convert a name to a language code.
155+
156+ Caller must call _make_language_dict first."""
157+
158+ name = name.lower()
159+
160+ m = re.match('^([a-z]{2})(?:-[a-z]{2})?$', name)
161+ if m and m.group(1) in self.lang_names:
162+ return name
163+ if 'simplified' in name:
164+ return 'zh-CN'
165+ if 'traditional' in name:
166+ return 'zh-TW'
167+ if re.search(u'bokm[a\N{LATIN SMALL LETTER A WITH RING ABOVE}]l', name):
168+ # what Google calls Norwegian seems to be Bokmal
169+ return 'no'
170+
171+ try:
172+ return self.lang_names[name]
173+ except KeyError:
174+ raise UnknownLanguageException
175+
176 # This Plugin uses code from youtube-dl
177 # Copyright (c) 2006-2008 Ricardo Garcia Gonzalez
178 # Released under MIT Licence
179
180=== modified file 'ibid/utils.py'
181--- ibid/utils.py 2009-11-08 17:44:35 +0000
182+++ ibid/utils.py 2009-12-22 16:59:12 +0000
183@@ -217,6 +217,9 @@
184 url += '?' + urlencode(params)
185
186 req = urllib2.Request(url, headers=headers)
187+ if not req.has_header('user-agent'):
188+ req.add_header('User-Agent', 'Ibid/' + (ibid_version() or 'dev'))
189+
190 f = urllib2.urlopen(req)
191 data = f.read()
192 f.close()

Subscribers

People subscribed via source and target branches