Merge into devel : bug-615655-unicode-oops : Code : Launchpad itself

Status:

Merged

Approved by:

Aaron Bentley on 2010-08-24

Approved revision:

no longer in the source branch.

Merged at revision:

11428

Proposed branch:

lp:~edwin-grubbs/launchpad/bug-615655-unicode-oops

Merge into:

lp:launchpad

Diff against target:

204 lines (+71/-44)

3 files modified

lib/canonical/encoding.py (+45/-31)
lib/canonical/launchpad/xmlrpc/mailinglist.py (+11/-1)
lib/lp/registry/doc/message-holds-xmlrpc.txt (+15/-12)

To merge this branch:

bzr merge lp:~edwin-grubbs/launchpad/bug-615655-unicode-oops

High

Fix Released

Link a bug report

Reviewer	Review Type	Date Requested	Status
Aaron Bentley (community)		2010-08-23	Approve on 2010-08-24
Review via email: mp+33428@code.launchpad.net

Description of the change

Summary
-------

This branch fixes an oops caused by nonascii characters in an email
preventing a str from being converted to a unicode object. Normally,
this means the message is spam, but since we are not absolutely certain
that will be the case, we will just escape the offending characters and
let the mailing list manager review the email in Launchpad.

Tests
-----

./bin/test -vv -t canonical.encoding -t message-holds-xmlrpc.txt

Revision history for this message

Aaron Bentley (abentley) wrote on 2010-08-23:

#

This needs some work, because 8-bit characters are only illegal in headers. (It is legal, though probably inadvisable, to use Content-transfer-encoding: 8bit or binary for message bodies.)

review: Needs Fixing

Revision history for this message

Edwin Grubbs (edwin-grubbs) wrote on 2010-08-23:

#

Hi Aaron,

I've fixed it to only escape the headers. I tried using message_from_string(), but the Message object requires that each header be deleted and then re-added, otherwise it would just create a duplicate header, and I started to worry about corner cases such as duplicate headers, which the method I was using didn't seem to take into account.

-Edwin

=== modified file 'lib/canonical/launchpad/xmlrpc/mailinglist.py'
--- lib/canonical/launchpad/xmlrpc/mailinglist.py 2010-08-23 17:43:51 +0000
+++ lib/canonical/launchpad/xmlrpc/mailinglist.py 2010-08-23 22:03:50 +0000
@@ -8,6 +8,7 @@
'MailingListAPIView',
]

+import re
import xmlrpclib

from zope.component import getUtility
@@ -233,10 +234,15 @@
         # though it's much more convenient to just pass 8-bit strings.
         if isinstance(bytes, xmlrpclib.Binary):
             bytes = bytes.data
- # Although it is illegal for an email to have unencoded non-ascii
- # characters, it is better to let the list owner process the
- # message than to cause an oops.
- bytes = escape_nonascii_uniquely(bytes)
+ # Although it is illegal for an email header to have unencoded
+ # non-ascii characters, it is better to let the list owner
+ # process the message than to cause an oops.
+ header_body_separator = re.compile('\r\n\r\n|\r\r|\n\n')
+ match = header_body_separator.search(bytes)
+ header = bytes[:match.start()]
+ header = escape_nonascii_uniquely(header)
+ bytes = header + bytes[match.start():]
+
         mailing_list = getUtility(IMailingListSet).get(team_name)
         message = getUtility(IMessageSet).fromEmail(bytes)
         mailing_list.holdMessage(message)

=== modified file 'lib/lp/registry/doc/message-holds-xmlrpc.txt'
--- lib/lp/registry/doc/message-holds-xmlrpc.txt 2010-08-23 17:50:11 +0000
+++ lib/lp/registry/doc/message-holds-xmlrpc.txt 2010-08-23 22:06:19 +0000
@@ -226,7 +226,7 @@
Non-ascii messages
==================

-Messages with non-ascii in their headers or bodies are not exactly legal
+Messages with non-ascii in their headers are not exactly legal
(they should be encoded) but do occur especially in spam. These
messages can be held for moderator approval too. To avoid blowing up
later if the string is converted to a unicode object, the non-ascii
@@ -239,8 +239,7 @@
     ... Message-ID: <fifth-post\xa9>
     ... Date: Fri, 01 Aug 2000 01:08:59 -0000
     ...
- ... Watch out for badgers! \xa9
- ... Don't double quote characters: =E3=F6=FC
+ ... Don't escape non-ascii characters in the body! \xa9
     ... """)

     >>> import xmlrpclib
@@ -269,8 +268,7 @@
      'Message-ID: <fifth-post\\xa9>',
      'Date: Fri, 01 Aug 2000 01:08:59 -0000',
      '',
- 'Watch out for badgers! \\xa9',
- "Don't double quote characters: =E3=F6=FC"]
+ "Don't escape non-ascii characters in the body! \xa9"]

>>> held_message_spam.status
<DBItem PostedMessageStatus.NEW, (0) New status>

Hi Aaron,

I've fixed it to only escape the headers. I tried using message_from_string(), but the Message object requires that each header be deleted and then re-added, otherwise it would just create a duplicate header, and I started to worry about corner cases such as duplicate headers, which the method I was using didn't seem to take into account.

-Edwin

=== modified file 'lib/canonical/launchpad/xmlrpc/mailinglist.py'
--- lib/canonical/launchpad/xmlrpc/mailinglist.py	2010-08-23 17:43:51 +0000
+++ lib/canonical/launchpad/xmlrpc/mailinglist.py	2010-08-23 22:03:50 +0000
@@ -8,6 +8,7 @@
     'MailingListAPIView',
     ]
 
+import re
 import xmlrpclib
 
 from zope.component import getUtility
@@ -233,10 +234,15 @@
         # though it's much more convenient to just pass 8-bit strings.
         if isinstance(bytes, xmlrpclib.Binary):
             bytes = bytes.data
-        # Although it is illegal for an email to have unencoded non-ascii
-        # characters, it is better to let the list owner process the
-        # message than to cause an oops.
-        bytes = escape_nonascii_uniquely(bytes)
+        # Although it is illegal for an email header to have unencoded
+        # non-ascii characters, it is better to let the list owner
+        # process the message than to cause an oops.
+        header_body_separator = re.compile('\r\n\r\n|\r\r|\n\n')
+        match = header_body_separator.search(bytes)
+        header = bytes[:match.start()]
+        header = escape_nonascii_uniquely(header)
+        bytes = header + bytes[match.start():]
+
         mailing_list = getUtility(IMailingListSet).get(team_name)
         message = getUtility(IMessageSet).fromEmail(bytes)
         mailing_list.holdMessage(message)

=== modified file 'lib/lp/registry/doc/message-holds-xmlrpc.txt'
--- lib/lp/registry/doc/message-holds-xmlrpc.txt	2010-08-23 17:50:11 +0000
+++ lib/lp/registry/doc/message-holds-xmlrpc.txt	2010-08-23 22:06:19 +0000
@@ -226,7 +226,7 @@
 Non-ascii messages
 ==================
 
-Messages with non-ascii in their headers or bodies are not exactly legal
+Messages with non-ascii in their headers are not exactly legal
 (they should be encoded) but do occur especially in spam.  These
 messages can be held for moderator approval too. To avoid blowing up
 later if the string is converted to a unicode object, the non-ascii
@@ -239,8 +239,7 @@
     ... Message-ID: <fifth-post\xa9>
     ... Date: Fri, 01 Aug 2000 01:08:59 -0000
     ...
-    ... Watch out for badgers! \xa9
-    ... Don't double quote characters: =E3=F6=FC
+    ... Don't escape non-ascii characters in the body! \xa9
     ... """)
 
     >>> import xmlrpclib
@@ -269,8 +268,7 @@
      'Message-ID: <fifth-post\\xa9>',
      'Date: Fri, 01 Aug 2000 01:08:59 -0000',
      '',
-     'Watch out for badgers! \\xa9',
-     "Don't double quote characters: =E3=F6=FC"]
+     "Don't escape non-ascii characters in the body! \xa9"]
 
     >>> held_message_spam.status
      <DBItem PostedMessageStatus.NEW, (0) New status>

Revision history for this message

Aaron Bentley (abentley) wrote on 2010-08-24:

#

Thanks for your fixes.

review: Approve

Launchpad itself

Merge lp:~edwin-grubbs/launchpad/bug-615655-unicode-oops into lp:launchpad

Commit message

Description of the change

Preview Diff

Subscribers

 === modified file 'lib/canonical/encoding.py'
 --- lib/canonical/encoding.py	2009-06-25 05:30:52 +0000
 +++ lib/canonical/encoding.py	2010-08-24 16:07:48 +0000
@@ -4,14 +4,17 @@
  """Character encoding utilities"""
  __metaclass__ = type
++__all__ = [
++    'ascii_smash',
++    'escape_nonascii_uniquely',
++    'guess',
++    ]
++
  import re
  import codecs
  import unicodedata
--from htmlentitydefs import codepoint2name
  from cStringIO import StringIO
--__all__ = ['guess', 'ascii_smash']
--
  _boms = [
      (codecs.BOM_UTF16_BE, 'utf_16_be'),
      (codecs.BOM_UTF16_LE, 'utf_16_le'),
@@ -151,33 +154,6 @@
      return unicode(s, 'ISO-8859-1', 'replace')
--# def unicode_to_unaccented_str(text):
--#     """Converts a unicode string into an ascii-only str, converting accented
--#     characters to their plain equivalents.
--#
--#     >>> unicode_to_unaccented_str(u'')
--#     ''
--#     >>> unicode_to_unaccented_str(u'foo bar 123')
--#     'foo bar 123'
--#     >>> unicode_to_unaccented_str(u'viva S\xe3o Carlos!')
--#     'viva Sao Carlos!'
--#     """
--#     assert isinstance(text, unicode)
--#     L = []
--#     for char in text:
--#         charnum = ord(char)
--#         codepoint = codepoint2name.get(charnum)
--#         if codepoint is not None:
--#             strchar = codepoint[0]
--#         else:
--#             try:
--#                 strchar = char.encode('ascii')
--#             except UnicodeEncodeError:
--#                 strchar = ''
--#         L.append(strchar)
--#     return ''.join(L)
--
--
  def ascii_smash(unicode_string):
      """Attempt to convert the Unicode string, possibly containing accents,
      to an ASCII string.
@@ -370,6 +346,44 @@
      if match is not None:
          return match.group(1)
--    # Something we can"t represent. Return empty string.
++    # Something we can't represent. Return empty string.
      return ""
++
++def escape_nonascii_uniquely(bogus_string):
++    """Replace non-ascii characters with a hex representation.
++
++    This is mainly for preventing emails with invalid characters from causing
++    oopses. The nonascii characters could have been removed or just converted
++    to "?", but this provides some insight into what the bogus data was, and
++    it prevents the message-id from two unrelated emails matching because
++    all the nonascii characters have been replaced with the same ascii
++    character.
++
++    Unfortunately, all the strings below are actually part of this
++    function's docstring, so python processes the backslash once before
++    doctest, and then python processes it again when doctest runs the
++    test. This makes it confusing, since four backslashes will get
++    converted into a single ascii character.
++
++    >>> print len('\xa9'), len('\\xa9'), len('\\\\xa9')
++    1 1 4
++    >>> print escape_nonascii_uniquely('hello \xa9')
++    hello \\xa9
++    >>> print escape_nonascii_uniquely('hello \\xa9')
++    hello \\xa9
++
++    This string only has ascii characters, so escape_nonascii_uniquely()
++    actually has no effect.
++
++    >>> print escape_nonascii_uniquely('hello \\\\xa9')
++    hello \\xa9
++    """
++    nonascii_regex = re.compile(r'[\200-\377]')
++    # By encoding the invalid ascii with a backslash, x, and then the
++    # hex value, it makes it easy to decode it by pasting into a python
++    # interpreter. quopri() is not used, since that could caused the
++    # decoding of an email to fail.
++    def quote(match):
++        return '\\x%x' % ord(match.group(0))
++    return nonascii_regex.sub(quote, bogus_string)
 === modified file 'lib/canonical/launchpad/xmlrpc/mailinglist.py'
 --- lib/canonical/launchpad/xmlrpc/mailinglist.py	2010-08-20 20:31:18 +0000
 +++ lib/canonical/launchpad/xmlrpc/mailinglist.py	2010-08-24 16:07:48 +0000
@@ -8,7 +8,7 @@
      'MailingListAPIView',
+     ]
--
++import re
  import xmlrpclib
  from zope.component import getUtility
@@ -16,6 +16,7 @@
  from zope.security.proxy import removeSecurityProxy
  from canonical.config import config
++from canonical.encoding import escape_nonascii_uniquely
  from canonical.launchpad.interfaces import (
      EmailAddressStatus,
      IEmailAddressSet,
@@ -240,6 +241,15 @@
          # though it's much more convenient to just pass 8-bit strings.
          if isinstance(bytes, xmlrpclib.Binary):
              bytes = bytes.data
++        # Although it is illegal for an email header to have unencoded
++        # non-ascii characters, it is better to let the list owner
++        # process the message than to cause an oops.
++        header_body_separator = re.compile('\r\n\r\n|\r\r|\n\n')
++        match = header_body_separator.search(bytes)
++        header = bytes[:match.start()]
++        header = escape_nonascii_uniquely(header)
++        bytes = header + bytes[match.start():]
++
          mailing_list = getUtility(IMailingListSet).get(team_name)
          message = getUtility(IMessageSet).fromEmail(bytes)
          mailing_list.holdMessage(message)
 === modified file 'lib/lp/registry/doc/message-holds-xmlrpc.txt'
 --- lib/lp/registry/doc/message-holds-xmlrpc.txt	2010-07-13 20:15:26 +0000
 +++ lib/lp/registry/doc/message-holds-xmlrpc.txt	2010-08-24 16:07:48 +0000
@@ -226,18 +226,20 @@
  Non-ascii messages
  ==================
--Messages with non-ascii in their headers or bodies are not exactly legal (they
--should be encoded) but do occur especially in spam.  These messages can be
--held for moderator approval too.
++Messages with non-ascii in their headers are not exactly legal
++(they should be encoded) but do occur especially in spam.  These
++messages can be held for moderator approval too. To avoid blowing up
++later if the string is converted to a unicode object, the non-ascii
++characters are replaced.
      >>> spam_message = message_from_string("""\
      ... From: Anne \xa9 Person <anne.person@example.com>
      ... To: team-one@lists.launchpad.dev
      ... Subject: \xa9 Badgers!
--    ... Message-ID: <fifth-post>
++    ... Message-ID: <fifth-post\xa9>
      ... Date: Fri, 01 Aug 2000 01:08:59 -0000
      ...
--    ... Watch out for badgers! \xa9
++    ... Don't escape non-ascii characters in the body! \xa9
      ... """)
      >>> import xmlrpclib
@@ -247,9 +249,10 @@
      True
      >>> commit()
--    >>> held_message_spam = message_set.getMessageByMessageID('<fifth-post>')
++    >>> held_message_spam = message_set.getMessageByMessageID(
++    ...     '<fifth-post\\xa9>')
      >>> print held_message_spam.message_id
--    <fifth-post>
++    <fifth-post\xa9>
      >>> print held_message_spam.posted_by.displayname
      Anne Person
@@ -258,14 +261,14 @@
      ...     message_content = held_message_spam.posted_message.read()
      ... finally:
      ...     held_message_spam.posted_message.close()
--    >>> message_content.splitlines()
--    ['From: Anne \xa9 Person <anne.person@example.com>',
++    >>> print pretty(message_content.splitlines())
++    ['From: Anne \\xa9 Person <anne.person@example.com>',
       'To: team-one@lists.launchpad.dev',
--     'Subject: \xa9 Badgers!',
--     'Message-ID: <fifth-post>',
++     'Subject: \\xa9 Badgers!',
++     'Message-ID: <fifth-post\\xa9>',
       'Date: Fri, 01 Aug 2000 01:08:59 -0000',
       '',
--     'Watch out for badgers! \xa9']
++     "Don't escape non-ascii characters in the body! \xa9"]
      >>> held_message_spam.status
       <DBItem PostedMessageStatus.NEW, (0) New status>