Merge lp:~edwin-grubbs/launchpad/bug-615655-unicode-oops into lp:launchpad

Proposed by Edwin Grubbs
Status: Merged
Approved by: Aaron Bentley
Approved revision: no longer in the source branch.
Merged at revision: 11428
Proposed branch: lp:~edwin-grubbs/launchpad/bug-615655-unicode-oops
Merge into: lp:launchpad
Diff against target: 204 lines (+71/-44)
3 files modified
lib/canonical/encoding.py (+45/-31)
lib/canonical/launchpad/xmlrpc/mailinglist.py (+11/-1)
lib/lp/registry/doc/message-holds-xmlrpc.txt (+15/-12)
To merge this branch: bzr merge lp:~edwin-grubbs/launchpad/bug-615655-unicode-oops
Reviewer Review Type Date Requested Status
Aaron Bentley (community) Approve
Review via email: mp+33428@code.launchpad.net

Description of the change

Summary
-------

This branch fixes an oops caused by nonascii characters in an email
preventing a str from being converted to a unicode object. Normally,
this means the message is spam, but since we are not absolutely certain
that will be the case, we will just escape the offending characters and
let the mailing list manager review the email in Launchpad.

Tests
-----

./bin/test -vv -t canonical.encoding -t message-holds-xmlrpc.txt

To post a comment you must log in.
Revision history for this message
Aaron Bentley (abentley) wrote :

This needs some work, because 8-bit characters are only illegal in headers. (It is legal, though probably inadvisable, to use Content-transfer-encoding: 8bit or binary for message bodies.)

review: Needs Fixing
Revision history for this message
Edwin Grubbs (edwin-grubbs) wrote :

Hi Aaron,

I've fixed it to only escape the headers. I tried using message_from_string(), but the Message object requires that each header be deleted and then re-added, otherwise it would just create a duplicate header, and I started to worry about corner cases such as duplicate headers, which the method I was using didn't seem to take into account.

-Edwin

=== modified file 'lib/canonical/launchpad/xmlrpc/mailinglist.py'
--- lib/canonical/launchpad/xmlrpc/mailinglist.py 2010-08-23 17:43:51 +0000
+++ lib/canonical/launchpad/xmlrpc/mailinglist.py 2010-08-23 22:03:50 +0000
@@ -8,6 +8,7 @@
     'MailingListAPIView',
     ]

+import re
 import xmlrpclib

 from zope.component import getUtility
@@ -233,10 +234,15 @@
         # though it's much more convenient to just pass 8-bit strings.
         if isinstance(bytes, xmlrpclib.Binary):
             bytes = bytes.data
- # Although it is illegal for an email to have unencoded non-ascii
- # characters, it is better to let the list owner process the
- # message than to cause an oops.
- bytes = escape_nonascii_uniquely(bytes)
+ # Although it is illegal for an email header to have unencoded
+ # non-ascii characters, it is better to let the list owner
+ # process the message than to cause an oops.
+ header_body_separator = re.compile('\r\n\r\n|\r\r|\n\n')
+ match = header_body_separator.search(bytes)
+ header = bytes[:match.start()]
+ header = escape_nonascii_uniquely(header)
+ bytes = header + bytes[match.start():]
+
         mailing_list = getUtility(IMailingListSet).get(team_name)
         message = getUtility(IMessageSet).fromEmail(bytes)
         mailing_list.holdMessage(message)

=== modified file 'lib/lp/registry/doc/message-holds-xmlrpc.txt'
--- lib/lp/registry/doc/message-holds-xmlrpc.txt 2010-08-23 17:50:11 +0000
+++ lib/lp/registry/doc/message-holds-xmlrpc.txt 2010-08-23 22:06:19 +0000
@@ -226,7 +226,7 @@
 Non-ascii messages
 ==================

-Messages with non-ascii in their headers or bodies are not exactly legal
+Messages with non-ascii in their headers are not exactly legal
 (they should be encoded) but do occur especially in spam. These
 messages can be held for moderator approval too. To avoid blowing up
 later if the string is converted to a unicode object, the non-ascii
@@ -239,8 +239,7 @@
     ... Message-ID: <fifth-post\xa9>
     ... Date: Fri, 01 Aug 2000 01:08:59 -0000
     ...
- ... Watch out for badgers! \xa9
- ... Don't double quote characters: =E3=F6=FC
+ ... Don't escape non-ascii characters in the body! \xa9
     ... """)

     >>> import xmlrpclib
@@ -269,8 +268,7 @@
      'Message-ID: <fifth-post\\xa9>',
      'Date: Fri, 01 Aug 2000 01:08:59 -0000',
      '',
- 'Watch out for badgers! \\xa9',
- "Don't double quote characters: =E3=F6=FC"]
+ "Don't escape non-ascii characters in the body! \xa9"]

     >>> held_message_spam.status
      <DBItem PostedMessageStatus.NEW, (0) New status>

Revision history for this message
Aaron Bentley (abentley) wrote :

Thanks for your fixes.

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'lib/canonical/encoding.py'
--- lib/canonical/encoding.py 2009-06-25 05:30:52 +0000
+++ lib/canonical/encoding.py 2010-08-24 16:07:48 +0000
@@ -4,14 +4,17 @@
4"""Character encoding utilities"""4"""Character encoding utilities"""
55
6__metaclass__ = type6__metaclass__ = type
7__all__ = [
8 'ascii_smash',
9 'escape_nonascii_uniquely',
10 'guess',
11 ]
12
7import re13import re
8import codecs14import codecs
9import unicodedata15import unicodedata
10from htmlentitydefs import codepoint2name
11from cStringIO import StringIO16from cStringIO import StringIO
1217
13__all__ = ['guess', 'ascii_smash']
14
15_boms = [18_boms = [
16 (codecs.BOM_UTF16_BE, 'utf_16_be'),19 (codecs.BOM_UTF16_BE, 'utf_16_be'),
17 (codecs.BOM_UTF16_LE, 'utf_16_le'),20 (codecs.BOM_UTF16_LE, 'utf_16_le'),
@@ -151,33 +154,6 @@
151 return unicode(s, 'ISO-8859-1', 'replace')154 return unicode(s, 'ISO-8859-1', 'replace')
152155
153156
154# def unicode_to_unaccented_str(text):
155# """Converts a unicode string into an ascii-only str, converting accented
156# characters to their plain equivalents.
157#
158# >>> unicode_to_unaccented_str(u'')
159# ''
160# >>> unicode_to_unaccented_str(u'foo bar 123')
161# 'foo bar 123'
162# >>> unicode_to_unaccented_str(u'viva S\xe3o Carlos!')
163# 'viva Sao Carlos!'
164# """
165# assert isinstance(text, unicode)
166# L = []
167# for char in text:
168# charnum = ord(char)
169# codepoint = codepoint2name.get(charnum)
170# if codepoint is not None:
171# strchar = codepoint[0]
172# else:
173# try:
174# strchar = char.encode('ascii')
175# except UnicodeEncodeError:
176# strchar = ''
177# L.append(strchar)
178# return ''.join(L)
179
180
181def ascii_smash(unicode_string):157def ascii_smash(unicode_string):
182 """Attempt to convert the Unicode string, possibly containing accents,158 """Attempt to convert the Unicode string, possibly containing accents,
183 to an ASCII string.159 to an ASCII string.
@@ -370,6 +346,44 @@
370 if match is not None:346 if match is not None:
371 return match.group(1)347 return match.group(1)
372348
373 # Something we can"t represent. Return empty string.349 # Something we can't represent. Return empty string.
374 return ""350 return ""
375351
352
353def escape_nonascii_uniquely(bogus_string):
354 """Replace non-ascii characters with a hex representation.
355
356 This is mainly for preventing emails with invalid characters from causing
357 oopses. The nonascii characters could have been removed or just converted
358 to "?", but this provides some insight into what the bogus data was, and
359 it prevents the message-id from two unrelated emails matching because
360 all the nonascii characters have been replaced with the same ascii
361 character.
362
363 Unfortunately, all the strings below are actually part of this
364 function's docstring, so python processes the backslash once before
365 doctest, and then python processes it again when doctest runs the
366 test. This makes it confusing, since four backslashes will get
367 converted into a single ascii character.
368
369 >>> print len('\xa9'), len('\\xa9'), len('\\\\xa9')
370 1 1 4
371 >>> print escape_nonascii_uniquely('hello \xa9')
372 hello \\xa9
373 >>> print escape_nonascii_uniquely('hello \\xa9')
374 hello \\xa9
375
376 This string only has ascii characters, so escape_nonascii_uniquely()
377 actually has no effect.
378
379 >>> print escape_nonascii_uniquely('hello \\\\xa9')
380 hello \\xa9
381 """
382 nonascii_regex = re.compile(r'[\200-\377]')
383 # By encoding the invalid ascii with a backslash, x, and then the
384 # hex value, it makes it easy to decode it by pasting into a python
385 # interpreter. quopri() is not used, since that could caused the
386 # decoding of an email to fail.
387 def quote(match):
388 return '\\x%x' % ord(match.group(0))
389 return nonascii_regex.sub(quote, bogus_string)
376390
=== modified file 'lib/canonical/launchpad/xmlrpc/mailinglist.py'
--- lib/canonical/launchpad/xmlrpc/mailinglist.py 2010-08-20 20:31:18 +0000
+++ lib/canonical/launchpad/xmlrpc/mailinglist.py 2010-08-24 16:07:48 +0000
@@ -8,7 +8,7 @@
8 'MailingListAPIView',8 'MailingListAPIView',
9 ]9 ]
1010
1111import re
12import xmlrpclib12import xmlrpclib
1313
14from zope.component import getUtility14from zope.component import getUtility
@@ -16,6 +16,7 @@
16from zope.security.proxy import removeSecurityProxy16from zope.security.proxy import removeSecurityProxy
1717
18from canonical.config import config18from canonical.config import config
19from canonical.encoding import escape_nonascii_uniquely
19from canonical.launchpad.interfaces import (20from canonical.launchpad.interfaces import (
20 EmailAddressStatus,21 EmailAddressStatus,
21 IEmailAddressSet,22 IEmailAddressSet,
@@ -240,6 +241,15 @@
240 # though it's much more convenient to just pass 8-bit strings.241 # though it's much more convenient to just pass 8-bit strings.
241 if isinstance(bytes, xmlrpclib.Binary):242 if isinstance(bytes, xmlrpclib.Binary):
242 bytes = bytes.data243 bytes = bytes.data
244 # Although it is illegal for an email header to have unencoded
245 # non-ascii characters, it is better to let the list owner
246 # process the message than to cause an oops.
247 header_body_separator = re.compile('\r\n\r\n|\r\r|\n\n')
248 match = header_body_separator.search(bytes)
249 header = bytes[:match.start()]
250 header = escape_nonascii_uniquely(header)
251 bytes = header + bytes[match.start():]
252
243 mailing_list = getUtility(IMailingListSet).get(team_name)253 mailing_list = getUtility(IMailingListSet).get(team_name)
244 message = getUtility(IMessageSet).fromEmail(bytes)254 message = getUtility(IMessageSet).fromEmail(bytes)
245 mailing_list.holdMessage(message)255 mailing_list.holdMessage(message)
246256
=== modified file 'lib/lp/registry/doc/message-holds-xmlrpc.txt'
--- lib/lp/registry/doc/message-holds-xmlrpc.txt 2010-07-13 20:15:26 +0000
+++ lib/lp/registry/doc/message-holds-xmlrpc.txt 2010-08-24 16:07:48 +0000
@@ -226,18 +226,20 @@
226Non-ascii messages226Non-ascii messages
227==================227==================
228228
229Messages with non-ascii in their headers or bodies are not exactly legal (they229Messages with non-ascii in their headers are not exactly legal
230should be encoded) but do occur especially in spam. These messages can be230(they should be encoded) but do occur especially in spam. These
231held for moderator approval too.231messages can be held for moderator approval too. To avoid blowing up
232later if the string is converted to a unicode object, the non-ascii
233characters are replaced.
232234
233 >>> spam_message = message_from_string("""\235 >>> spam_message = message_from_string("""\
234 ... From: Anne \xa9 Person <anne.person@example.com>236 ... From: Anne \xa9 Person <anne.person@example.com>
235 ... To: team-one@lists.launchpad.dev237 ... To: team-one@lists.launchpad.dev
236 ... Subject: \xa9 Badgers!238 ... Subject: \xa9 Badgers!
237 ... Message-ID: <fifth-post>239 ... Message-ID: <fifth-post\xa9>
238 ... Date: Fri, 01 Aug 2000 01:08:59 -0000240 ... Date: Fri, 01 Aug 2000 01:08:59 -0000
239 ...241 ...
240 ... Watch out for badgers! \xa9242 ... Don't escape non-ascii characters in the body! \xa9
241 ... """)243 ... """)
242244
243 >>> import xmlrpclib245 >>> import xmlrpclib
@@ -247,9 +249,10 @@
247 True249 True
248 >>> commit()250 >>> commit()
249251
250 >>> held_message_spam = message_set.getMessageByMessageID('<fifth-post>')252 >>> held_message_spam = message_set.getMessageByMessageID(
253 ... '<fifth-post\\xa9>')
251 >>> print held_message_spam.message_id254 >>> print held_message_spam.message_id
252 <fifth-post>255 <fifth-post\xa9>
253 >>> print held_message_spam.posted_by.displayname256 >>> print held_message_spam.posted_by.displayname
254 Anne Person257 Anne Person
255258
@@ -258,14 +261,14 @@
258 ... message_content = held_message_spam.posted_message.read()261 ... message_content = held_message_spam.posted_message.read()
259 ... finally:262 ... finally:
260 ... held_message_spam.posted_message.close()263 ... held_message_spam.posted_message.close()
261 >>> message_content.splitlines()264 >>> print pretty(message_content.splitlines())
262 ['From: Anne \xa9 Person <anne.person@example.com>',265 ['From: Anne \\xa9 Person <anne.person@example.com>',
263 'To: team-one@lists.launchpad.dev',266 'To: team-one@lists.launchpad.dev',
264 'Subject: \xa9 Badgers!',267 'Subject: \\xa9 Badgers!',
265 'Message-ID: <fifth-post>',268 'Message-ID: <fifth-post\\xa9>',
266 'Date: Fri, 01 Aug 2000 01:08:59 -0000',269 'Date: Fri, 01 Aug 2000 01:08:59 -0000',
267 '',270 '',
268 'Watch out for badgers! \xa9']271 "Don't escape non-ascii characters in the body! \xa9"]
269272
270 >>> held_message_spam.status273 >>> held_message_spam.status
271 <DBItem PostedMessageStatus.NEW, (0) New status>274 <DBItem PostedMessageStatus.NEW, (0) New status>