Merge lp:~edwin-grubbs/launchpad/bug-615655-unicode-oops into lp:launchpad

Proposed by Edwin Grubbs
Status: Merged
Approved by: Aaron Bentley
Approved revision: no longer in the source branch.
Merged at revision: 11428
Proposed branch: lp:~edwin-grubbs/launchpad/bug-615655-unicode-oops
Merge into: lp:launchpad
Diff against target: 204 lines (+71/-44)
3 files modified
lib/canonical/encoding.py (+45/-31)
lib/canonical/launchpad/xmlrpc/mailinglist.py (+11/-1)
lib/lp/registry/doc/message-holds-xmlrpc.txt (+15/-12)
To merge this branch: bzr merge lp:~edwin-grubbs/launchpad/bug-615655-unicode-oops
Reviewer Review Type Date Requested Status
Aaron Bentley (community) Approve
Review via email: mp+33428@code.launchpad.net

Description of the change

Summary
-------

This branch fixes an oops caused by nonascii characters in an email
preventing a str from being converted to a unicode object. Normally,
this means the message is spam, but since we are not absolutely certain
that will be the case, we will just escape the offending characters and
let the mailing list manager review the email in Launchpad.

Tests
-----

./bin/test -vv -t canonical.encoding -t message-holds-xmlrpc.txt

To post a comment you must log in.
Revision history for this message
Aaron Bentley (abentley) wrote :

This needs some work, because 8-bit characters are only illegal in headers. (It is legal, though probably inadvisable, to use Content-transfer-encoding: 8bit or binary for message bodies.)

review: Needs Fixing
Revision history for this message
Edwin Grubbs (edwin-grubbs) wrote :

Hi Aaron,

I've fixed it to only escape the headers. I tried using message_from_string(), but the Message object requires that each header be deleted and then re-added, otherwise it would just create a duplicate header, and I started to worry about corner cases such as duplicate headers, which the method I was using didn't seem to take into account.

-Edwin

=== modified file 'lib/canonical/launchpad/xmlrpc/mailinglist.py'
--- lib/canonical/launchpad/xmlrpc/mailinglist.py 2010-08-23 17:43:51 +0000
+++ lib/canonical/launchpad/xmlrpc/mailinglist.py 2010-08-23 22:03:50 +0000
@@ -8,6 +8,7 @@
     'MailingListAPIView',
     ]

+import re
 import xmlrpclib

 from zope.component import getUtility
@@ -233,10 +234,15 @@
         # though it's much more convenient to just pass 8-bit strings.
         if isinstance(bytes, xmlrpclib.Binary):
             bytes = bytes.data
- # Although it is illegal for an email to have unencoded non-ascii
- # characters, it is better to let the list owner process the
- # message than to cause an oops.
- bytes = escape_nonascii_uniquely(bytes)
+ # Although it is illegal for an email header to have unencoded
+ # non-ascii characters, it is better to let the list owner
+ # process the message than to cause an oops.
+ header_body_separator = re.compile('\r\n\r\n|\r\r|\n\n')
+ match = header_body_separator.search(bytes)
+ header = bytes[:match.start()]
+ header = escape_nonascii_uniquely(header)
+ bytes = header + bytes[match.start():]
+
         mailing_list = getUtility(IMailingListSet).get(team_name)
         message = getUtility(IMessageSet).fromEmail(bytes)
         mailing_list.holdMessage(message)

=== modified file 'lib/lp/registry/doc/message-holds-xmlrpc.txt'
--- lib/lp/registry/doc/message-holds-xmlrpc.txt 2010-08-23 17:50:11 +0000
+++ lib/lp/registry/doc/message-holds-xmlrpc.txt 2010-08-23 22:06:19 +0000
@@ -226,7 +226,7 @@
 Non-ascii messages
 ==================

-Messages with non-ascii in their headers or bodies are not exactly legal
+Messages with non-ascii in their headers are not exactly legal
 (they should be encoded) but do occur especially in spam. These
 messages can be held for moderator approval too. To avoid blowing up
 later if the string is converted to a unicode object, the non-ascii
@@ -239,8 +239,7 @@
     ... Message-ID: <fifth-post\xa9>
     ... Date: Fri, 01 Aug 2000 01:08:59 -0000
     ...
- ... Watch out for badgers! \xa9
- ... Don't double quote characters: =E3=F6=FC
+ ... Don't escape non-ascii characters in the body! \xa9
     ... """)

     >>> import xmlrpclib
@@ -269,8 +268,7 @@
      'Message-ID: <fifth-post\\xa9>',
      'Date: Fri, 01 Aug 2000 01:08:59 -0000',
      '',
- 'Watch out for badgers! \\xa9',
- "Don't double quote characters: =E3=F6=FC"]
+ "Don't escape non-ascii characters in the body! \xa9"]

     >>> held_message_spam.status
      <DBItem PostedMessageStatus.NEW, (0) New status>

Revision history for this message
Aaron Bentley (abentley) wrote :

Thanks for your fixes.

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'lib/canonical/encoding.py'
2--- lib/canonical/encoding.py 2009-06-25 05:30:52 +0000
3+++ lib/canonical/encoding.py 2010-08-24 16:07:48 +0000
4@@ -4,14 +4,17 @@
5 """Character encoding utilities"""
6
7 __metaclass__ = type
8+__all__ = [
9+ 'ascii_smash',
10+ 'escape_nonascii_uniquely',
11+ 'guess',
12+ ]
13+
14 import re
15 import codecs
16 import unicodedata
17-from htmlentitydefs import codepoint2name
18 from cStringIO import StringIO
19
20-__all__ = ['guess', 'ascii_smash']
21-
22 _boms = [
23 (codecs.BOM_UTF16_BE, 'utf_16_be'),
24 (codecs.BOM_UTF16_LE, 'utf_16_le'),
25@@ -151,33 +154,6 @@
26 return unicode(s, 'ISO-8859-1', 'replace')
27
28
29-# def unicode_to_unaccented_str(text):
30-# """Converts a unicode string into an ascii-only str, converting accented
31-# characters to their plain equivalents.
32-#
33-# >>> unicode_to_unaccented_str(u'')
34-# ''
35-# >>> unicode_to_unaccented_str(u'foo bar 123')
36-# 'foo bar 123'
37-# >>> unicode_to_unaccented_str(u'viva S\xe3o Carlos!')
38-# 'viva Sao Carlos!'
39-# """
40-# assert isinstance(text, unicode)
41-# L = []
42-# for char in text:
43-# charnum = ord(char)
44-# codepoint = codepoint2name.get(charnum)
45-# if codepoint is not None:
46-# strchar = codepoint[0]
47-# else:
48-# try:
49-# strchar = char.encode('ascii')
50-# except UnicodeEncodeError:
51-# strchar = ''
52-# L.append(strchar)
53-# return ''.join(L)
54-
55-
56 def ascii_smash(unicode_string):
57 """Attempt to convert the Unicode string, possibly containing accents,
58 to an ASCII string.
59@@ -370,6 +346,44 @@
60 if match is not None:
61 return match.group(1)
62
63- # Something we can"t represent. Return empty string.
64+ # Something we can't represent. Return empty string.
65 return ""
66
67+
68+def escape_nonascii_uniquely(bogus_string):
69+ """Replace non-ascii characters with a hex representation.
70+
71+ This is mainly for preventing emails with invalid characters from causing
72+ oopses. The nonascii characters could have been removed or just converted
73+ to "?", but this provides some insight into what the bogus data was, and
74+ it prevents the message-id from two unrelated emails matching because
75+ all the nonascii characters have been replaced with the same ascii
76+ character.
77+
78+ Unfortunately, all the strings below are actually part of this
79+ function's docstring, so python processes the backslash once before
80+ doctest, and then python processes it again when doctest runs the
81+ test. This makes it confusing, since four backslashes will get
82+ converted into a single ascii character.
83+
84+ >>> print len('\xa9'), len('\\xa9'), len('\\\\xa9')
85+ 1 1 4
86+ >>> print escape_nonascii_uniquely('hello \xa9')
87+ hello \\xa9
88+ >>> print escape_nonascii_uniquely('hello \\xa9')
89+ hello \\xa9
90+
91+ This string only has ascii characters, so escape_nonascii_uniquely()
92+ actually has no effect.
93+
94+ >>> print escape_nonascii_uniquely('hello \\\\xa9')
95+ hello \\xa9
96+ """
97+ nonascii_regex = re.compile(r'[\200-\377]')
98+ # By encoding the invalid ascii with a backslash, x, and then the
99+ # hex value, it makes it easy to decode it by pasting into a python
100+ # interpreter. quopri() is not used, since that could caused the
101+ # decoding of an email to fail.
102+ def quote(match):
103+ return '\\x%x' % ord(match.group(0))
104+ return nonascii_regex.sub(quote, bogus_string)
105
106=== modified file 'lib/canonical/launchpad/xmlrpc/mailinglist.py'
107--- lib/canonical/launchpad/xmlrpc/mailinglist.py 2010-08-20 20:31:18 +0000
108+++ lib/canonical/launchpad/xmlrpc/mailinglist.py 2010-08-24 16:07:48 +0000
109@@ -8,7 +8,7 @@
110 'MailingListAPIView',
111 ]
112
113-
114+import re
115 import xmlrpclib
116
117 from zope.component import getUtility
118@@ -16,6 +16,7 @@
119 from zope.security.proxy import removeSecurityProxy
120
121 from canonical.config import config
122+from canonical.encoding import escape_nonascii_uniquely
123 from canonical.launchpad.interfaces import (
124 EmailAddressStatus,
125 IEmailAddressSet,
126@@ -240,6 +241,15 @@
127 # though it's much more convenient to just pass 8-bit strings.
128 if isinstance(bytes, xmlrpclib.Binary):
129 bytes = bytes.data
130+ # Although it is illegal for an email header to have unencoded
131+ # non-ascii characters, it is better to let the list owner
132+ # process the message than to cause an oops.
133+ header_body_separator = re.compile('\r\n\r\n|\r\r|\n\n')
134+ match = header_body_separator.search(bytes)
135+ header = bytes[:match.start()]
136+ header = escape_nonascii_uniquely(header)
137+ bytes = header + bytes[match.start():]
138+
139 mailing_list = getUtility(IMailingListSet).get(team_name)
140 message = getUtility(IMessageSet).fromEmail(bytes)
141 mailing_list.holdMessage(message)
142
143=== modified file 'lib/lp/registry/doc/message-holds-xmlrpc.txt'
144--- lib/lp/registry/doc/message-holds-xmlrpc.txt 2010-07-13 20:15:26 +0000
145+++ lib/lp/registry/doc/message-holds-xmlrpc.txt 2010-08-24 16:07:48 +0000
146@@ -226,18 +226,20 @@
147 Non-ascii messages
148 ==================
149
150-Messages with non-ascii in their headers or bodies are not exactly legal (they
151-should be encoded) but do occur especially in spam. These messages can be
152-held for moderator approval too.
153+Messages with non-ascii in their headers are not exactly legal
154+(they should be encoded) but do occur especially in spam. These
155+messages can be held for moderator approval too. To avoid blowing up
156+later if the string is converted to a unicode object, the non-ascii
157+characters are replaced.
158
159 >>> spam_message = message_from_string("""\
160 ... From: Anne \xa9 Person <anne.person@example.com>
161 ... To: team-one@lists.launchpad.dev
162 ... Subject: \xa9 Badgers!
163- ... Message-ID: <fifth-post>
164+ ... Message-ID: <fifth-post\xa9>
165 ... Date: Fri, 01 Aug 2000 01:08:59 -0000
166 ...
167- ... Watch out for badgers! \xa9
168+ ... Don't escape non-ascii characters in the body! \xa9
169 ... """)
170
171 >>> import xmlrpclib
172@@ -247,9 +249,10 @@
173 True
174 >>> commit()
175
176- >>> held_message_spam = message_set.getMessageByMessageID('<fifth-post>')
177+ >>> held_message_spam = message_set.getMessageByMessageID(
178+ ... '<fifth-post\\xa9>')
179 >>> print held_message_spam.message_id
180- <fifth-post>
181+ <fifth-post\xa9>
182 >>> print held_message_spam.posted_by.displayname
183 Anne Person
184
185@@ -258,14 +261,14 @@
186 ... message_content = held_message_spam.posted_message.read()
187 ... finally:
188 ... held_message_spam.posted_message.close()
189- >>> message_content.splitlines()
190- ['From: Anne \xa9 Person <anne.person@example.com>',
191+ >>> print pretty(message_content.splitlines())
192+ ['From: Anne \\xa9 Person <anne.person@example.com>',
193 'To: team-one@lists.launchpad.dev',
194- 'Subject: \xa9 Badgers!',
195- 'Message-ID: <fifth-post>',
196+ 'Subject: \\xa9 Badgers!',
197+ 'Message-ID: <fifth-post\\xa9>',
198 'Date: Fri, 01 Aug 2000 01:08:59 -0000',
199 '',
200- 'Watch out for badgers! \xa9']
201+ "Don't escape non-ascii characters in the body! \xa9"]
202
203 >>> held_message_spam.status
204 <DBItem PostedMessageStatus.NEW, (0) New status>