Merge into trunk : unicode-author-508251 : Code : bzr-builddeb

Status:

Merged

Merged at revision:

not available

Proposed branch:

lp:~jameinel/bzr-builddeb/unicode-author-508251

Merge into:

lp:bzr-builddeb

Diff against target:

195 lines (+90/-8)

3 files modified

import_dsc.py (+2/-1)
tests/test_util.py (+63/-0)
util.py (+25/-7)

To merge this branch:

bzr merge lp:~jameinel/bzr-builddeb/unicode-author-508251

Medium

Fix Released

Link a bug report

Reviewer	Review Type	Date Requested	Status
Bzr-builddeb-hackers		2010-02-18	Pending
Review via email: mp+19662@code.launchpad.net

Revision history for this message

John A Meinel (jameinel) wrote on 2010-02-18:

#

This is a basic fix for bug #508251. Specifically it:

1) Tries to decode using utf-8, if that fails it falls back to iso-8859-1. For now it also mutters the string it failed to decode. (might get a bit noisy, but it would let you know if there are issues with a given import.)

2) Applies this to both author decoding *and* to the commit message. I think the author stuff hid the fact that the commit message was also broken. Basically, find_extra_authors decodes everything before bzr was going to get a chance at it. And bzr was always decoding 'message' as bzrlib.user_encoding, which I assume was always utf-8 for the import machine. Arguably it was succeeding 'by accident', rather than by design.

3) Changes 'find_thanks()' to allow names to start with a Unicode character, rather than requiring strictly A-Z. If you want, I can bring back "author[0].isupper()" or something like that. Looking at the regex, if I said "Thanks to my cat" it seems reasonable to have 'deb-thanks': ['my cat'] even though it wasn't "Mr Cat". The "Thanks to" and "thank you" seem to be a decent filter, without having to worry about the exact name. If you want this changed to something else, just let me know.
I can restore the original behavior and change the tests, but it seemed reasonable to allow non-ascii as the first letter of someone's name. Given this changelog entry:
    - Translators: Vital Khilko (be), Vladimir Petkov (bg), Hendrik
      Brandt (de), Kostas Papadimas (el), Adam Weinberger (en_CA), Francisco
      Javier F. Serrador (es), Ilkka Tuohela (fi), Ignacio Casal Quinteiro
      (gl), Ankit Patel (gu), Luca Ferretti (it), Takeshi AIHANA (ja),
      Žygimantas Beručka (lt), Øivind Hoel (nb), Reinout van Schouwen (nl),
      Øivind Hoel (no), Evandro Fernandes Giovanini (pt_BR), Слободан Д.
      Средојевић (sr), Theppitak Karoonboonyanan (th), Clytie Siddall (vi),
      Funda Wang (zh_CN)

At least 3 of those people have non-ascii first letters (Žygimantas, Øivind, etc)

4) I also made sure to run this locally against 'gnome-panel' which was one of the failing imports. It has certainly gotten a lot farther, and I've check that it has run into a few of these mixed-encoding sections. Note that this assumes that each changelog block uses a constant encoding (for the purposes of commit message), but that actually seems reasonable. As dapper/debian/changelog switches back and forth from iso-8859-1 in some blocks to utf-8 in other blocks.

This is a basic fix for bug #508251. Specifically it:

1) Tries to decode using utf-8, if that fails it falls back to iso-8859-1. For now it also mutters the string it failed to decode. (might get a bit noisy, but it would let you know if there are issues with a given import.)

2) Applies this to both author decoding *and* to the commit message. I think the author stuff hid the fact that the commit message was also broken. Basically, find_extra_authors decodes everything before bzr was going to get a chance at it. And bzr was always decoding 'message' as bzrlib.user_encoding, which I assume was always utf-8 for the import machine. Arguably it was succeeding 'by accident', rather than by design.

3) Changes 'find_thanks()' to allow names to start with a Unicode character, rather than requiring strictly A-Z. If you want, I can bring back "author[0].isupper()" or something like that. Looking at the regex, if I said "Thanks to my cat" it seems reasonable to have 'deb-thanks':  ['my cat'] even though it wasn't "Mr Cat". The "Thanks to" and "thank you" seem to be a decent filter, without having to worry about the exact name. If you want this changed to something else, just let me know. 
I can restore the original behavior and change the tests, but it seemed reasonable to allow non-ascii as the first letter of someone's name. Given this changelog entry:
    - Translators: Vital Khilko (be), Vladimir Petkov (bg), Hendrik 
      Brandt (de), Kostas Papadimas (el), Adam Weinberger (en_CA), Francisco 
      Javier F. Serrador (es), Ilkka Tuohela (fi), Ignacio Casal Quinteiro
      (gl), Ankit Patel (gu), Luca Ferretti (it), Takeshi AIHANA (ja), 
      Žygimantas Beručka (lt), Øivind Hoel (nb), Reinout van Schouwen (nl), 
      Øivind Hoel (no), Evandro Fernandes Giovanini (pt_BR), Слободан Д. 
      Средојевић (sr), Theppitak Karoonboonyanan (th), Clytie Siddall (vi),
      Funda Wang (zh_CN)

At least 3 of those people have non-ascii first letters (Žygimantas, Øivind, etc)

4) I also made sure to run this locally against 'gnome-panel' which was one of the failing imports. It has certainly gotten a lot farther, and I've check that it has run into a few of these mixed-encoding sections. Note that this assumes that each changelog block uses a constant encoding (for the purposes of commit message), but that actually seems reasonable. As dapper/debian/changelog switches back and forth from iso-8859-1 in some blocks to utf-8 in other blocks.

Revision history for this message

James Westby (james-w) wrote on 2010-02-18:

#

On Thu, 18 Feb 2010 22:17:12 -0000, John A Meinel <email address hidden> wrote:
> This is a basic fix for bug #508251. Specifically it:

Thanks.

> 1) Tries to decode using utf-8, if that fails it falls back to iso-8859-1. For now it also mutters the string it failed to decode. (might get a bit noisy, but it would let you know if there are issues with a given import.)

It will still cause failures if it can't be decoded in
iso-8859-1 either, is that what we want at this stage?

> 4) I also made sure to run this locally against 'gnome-panel' which was one of the failing imports. It has certainly gotten a lot farther, and I've check that it has run into a few of these mixed-encoding sections. Note that this assumes that each changelog block uses a constant encoding (for the purposes of commit message), but that actually seems reasonable. As dapper/debian/changelog switches back and forth from iso-8859-1 in some blocks to utf-8 in other blocks.

Thanks, I'll apply this once you tell me that this test didn't discover
any problems with the change (it obviously isn't blocked on any other
issues that might be found.)

Thanks,

James

Revision history for this message

John A Meinel (jameinel) wrote on 2010-02-19:

#

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

James Westby wrote:
> On Thu, 18 Feb 2010 22:17:12 -0000, John A Meinel <email address hidden> wrote:
>> This is a basic fix for bug #508251. Specifically it:
>
> Thanks.
>
>> 1) Tries to decode using utf-8, if that fails it falls back to iso-8859-1. For now it also mutters the string it failed to decode. (might get a bit noisy, but it would let you know if there are issues with a given import.)
>
> It will still cause failures if it can't be decoded in
> iso-8859-1 either, is that what we want at this stage?

iso-8859-1 can decode all possible 8-bit sequences. Possibly
incorrectly, but all bits have a Unicode code point from iso-8859-1.

>
>> 4) I also made sure to run this locally against 'gnome-panel' which was one of the failing imports. It has certainly gotten a lot farther, and I've check that it has run into a few of these mixed-encoding sections. Note that this assumes that each changelog block uses a constant encoding (for the purposes of commit message), but that actually seems reasonable. As dapper/debian/changelog switches back and forth from iso-8859-1 in some blocks to utf-8 in other blocks.
>
> Thanks, I'll apply this once you tell me that this test didn't discover
> any problems with the change (it obviously isn't blocked on any other
> issues that might be found.)
>
> Thanks,
>
> James
>

The import succeeded. I don't have a way to tell the fidelity of the
result, etc.

I'm slightly concerned that a new import will give different results to
an old import (based on now finding an author that wasn't found before).
But I don't think the import system uses deterministic ids, so it should
be fine.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkt+ou4ACgkQJdeBCYSNAAMqogCdEMIQvWx31ExKOAPYjwmcKVJa
YGoAnA9m45Pg/9YJAUUuDYQEvjFijdjK
=gAnd
-----END PGP SIGNATURE-----

Revision history for this message

James Westby (james-w) wrote on 2010-02-19:

#

On Fri, 19 Feb 2010 08:40:47 -0600, John Arbash Meinel <email address hidden> wrote:
> iso-8859-1 can decode all possible 8-bit sequences. Possibly
> incorrectly, but all bits have a Unicode code point from iso-8859-1.

Ok, so we'll have nonsense on occaision, but no failures. Sounds
reasonable to me.

> The import succeeded. I don't have a way to tell the fidelity of the
> result, etc.

> I'm slightly concerned that a new import will give different results to
> an old import (based on now finding an author that wasn't found before).
> But I don't think the import system uses deterministic ids, so it should
> be fine.

It doesn't use deterministic ids. I don't think it matters much that
there will be differences, there's not a lot we can do about that.

Thanks,

James

Revision history for this message

James Westby (james-w) wrote on 2010-02-19:

#

Oh, I'll merge this once I've finished the change I'm currently working
on.

Thanks,

James

 === modified file 'import_dsc.py'
 --- import_dsc.py	2010-02-12 19:58:29 +0000
 +++ import_dsc.py	2010-02-18 22:17:11 +0000
@@ -76,6 +76,7 @@
      get_snapshot_revision,
      open_file_via_transport,
      open_transport,
++    safe_decode,
      subprocess_setup,
+     )
@@ -1251,7 +1252,7 @@
                   time_tuple = rfc822.parsedate_tz(raw_timestamp)
                   if time_tuple is not None:
                       timestamp = (time.mktime(time_tuple[:9]), time_tuple[9])
--                 author = cl.author.decode("utf-8")
++                 author = safe_decode(cl.author)
              versions = self._get_safe_versions_from_changelog(cl)
              assert not self.has_version(version), \
                  "Trying to import version %s again" % str(version)
 === modified file 'tests/test_util.py'
 --- tests/test_util.py	2010-02-12 16:41:15 +0000
 +++ tests/test_util.py	2010-02-18 22:17:11 +0000
@@ -45,6 +45,7 @@
                    move_file_if_different,
                    get_parent_dir,
                    recursive_copy,
++                  safe_decode,
                    strip_changelog_message,
                    suite_to_distribution,
                    tarball_name,
@@ -84,6 +85,19 @@
          self.failUnlessExists('a/f')
++class SafeDecodeTests(TestCase):
++
++    def assertSafeDecode(self, expected, val):
++        self.assertEqual(expected, safe_decode(val))
++
++    def test_utf8(self):
++        self.assertSafeDecode(u'ascii', 'ascii')
++        self.assertSafeDecode(u'\xe7', '\xc3\xa7')
++
++    def test_iso_8859_1(self):
++        self.assertSafeDecode(u'\xe7', '\xe7')
++
++
  cl_block1 = """\
  bzr-builddeb (0.17) unstable; urgency=low
@@ -467,6 +481,22 @@
          self.assertEqual([u"A. Hacker", u"B. Hacker"], authors)
          self.assertEqual([unicode]*len(authors), map(type, authors))
++    def test_find_extra_authors_utf8(self):
++        changes = ["  * Do foo", "", "  [ \xc3\xa1. Hacker ]", "  * Do bar", "",
++                   "  [ \xc3\xa7. Hacker ]", "  [ A. Hacker}"]
++        authors = find_extra_authors(changes)
++        self.assertEqual([u"\xe1. Hacker", u"\xe7. Hacker"], authors)
++        self.assertEqual([unicode]*len(authors), map(type, authors))
++
++    def test_find_extra_authors_iso_8859_1(self):
++        # We try to treat lines as utf-8, but if that fails to decode, we fall
++        # back to iso-8859-1
++        changes = ["  * Do foo", "", "  [ \xe1. Hacker ]", "  * Do bar", "",
++                   "  [ \xe7. Hacker ]", "  [ A. Hacker}"]
++        authors = find_extra_authors(changes)
++        self.assertEqual([u"\xe1. Hacker", u"\xe7. Hacker"], authors)
++        self.assertEqual([unicode]*len(authors), map(type, authors))
++
      def test_find_extra_authors_no_changes(self):
          authors = find_extra_authors([])
          self.assertEqual([], authors)
@@ -504,6 +534,8 @@
          self.assert_thanks_is(changes, [u"A. Hacker <ahacker@example.com>"])
          changes = ["  * Thanks to Adeodato Sim\xc3\x83\xc2\xb3"]
          self.assert_thanks_is(changes, [u"Adeodato Sim\xc3\xb3"])
++        changes = ["  * Thanks to \xc3\x81deodato Sim\xc3\x83\xc2\xb3"]
++        self.assert_thanks_is(changes, [u"\xc1deodato Sim\xc3\xb3"])
      def test_find_bugs_fixed_no_changes(self):
          self.assertEqual([], find_bugs_fixed([], None, _lplib=MockLaunchpad()))
@@ -582,6 +614,37 @@
          self.assertEqual(find_bugs_fixed(changes, wt.branch,
                      _lplib=MockLaunchpad()), bugs)
++    def assertUnicodeCommitInfo(self, changes):
++        wt = self.make_branch_and_tree(".")
++        changelog = Changelog()
++        author = "J. Maintainer <maint@example.com>"
++        changelog.new_block(changes=changes, author=author)
++        message, authors, thanks, bugs = \
++                get_commit_info_from_changelog(changelog, wt.branch,
++                        _lplib=MockLaunchpad())
++        self.assertEqual(u'[ \xc1. Hacker ]\n'
++                         u'* First ch\xe1nge, LP: #12345\n'
++                         u'* Second change, thanks to \xde. Hacker',
++                         message)
++        self.assertEqual([author, u'\xc1. Hacker'], authors)
++        self.assertEqual(unicode, type(authors[0]))
++        self.assertEqual([u'\xde. Hacker'], thanks)
++        self.assertEqual(['https://launchpad.net/bugs/12345 fixed'], bugs)
++
++    def test_get_commit_info_utf8(self):
++        changes = ["  [ \xc3\x81. Hacker ]",
++                   "  * First ch\xc3\xa1nge, LP: #12345",
++                   "  * Second change, thanks to \xc3\x9e. Hacker"]
++        self.assertUnicodeCommitInfo(changes)
++
++    def test_get_commit_info_iso_8859_1(self):
++        # Changelogs aren't always well-formed UTF-8, so we fall back to
++        # iso-8859-1 if we fail to decode utf-8.
++        changes = ["  [ \xc1. Hacker ]",
++                   "  * First ch\xe1nge, LP: #12345",
++                   "  * Second change, thanks to \xde. Hacker"]
++        self.assertUnicodeCommitInfo(changes)
++
  class MockLaunchpad(object):
 === modified file 'util.py'
 --- util.py	2010-02-10 13:43:44 +0000
 +++ util.py	2010-02-18 22:17:11 +0000
@@ -56,6 +56,24 @@
+                 )
++def safe_decode(s):
++    """Decode a string into a Unicode value."""
++    if isinstance(s, unicode): # Already unicode
++        mutter('safe_decode() called on an already-unicode string: %r' % (s,))
++        return s
++    try:
++        return s.decode('utf-8')
++    except UnicodeDecodeError, e:
++        mutter('safe_decode(%r) falling back to iso-8859-1' % (s,))
++        # TODO: Looking at BeautifulSoup it seems to use 'chardet' to try to
++        #       guess the encoding of a given text stream. We might want to
++        #       take a closer look at that.
++        # TODO: Another possibility would be to make the fallback encoding
++        #       configurable, possibly exposed as a command-line flag, for now,
++        #       this seems 'good enough'.
++        return s.decode('iso-8859-1')
++
++
  def recursive_copy(fromdir, todir):
      """Copy the contents of fromdir to todir.
@@ -392,13 +410,13 @@
  def find_extra_authors(changes):
--    extra_author_re = re.compile(r"\s*\[([^\]]+)]\s*", re.UNICODE)
++    extra_author_re = re.compile(r"\s*\[([^\]]+)]\s*")
      authors = []
      for change in changes:
          # Parse out any extra authors.
--        match = extra_author_re.match(change.decode("utf-8"))
++        match = extra_author_re.match(change)
          if match is not None:
--            new_author = match.group(1).strip()
++            new_author = safe_decode(match.group(1).strip())
              already_included = False
              for author in authors:
                  if author.startswith(new_author):
@@ -411,11 +429,11 @@
  def find_thanks(changes):
      thanks_re = re.compile(r"[tT]hank(?:(?:s)|(?:you))(?:\s*to)?"
--            "((?:\s+(?:(?:[A-Z]\.)|(?:[A-Z]\w+(?:-[A-Z]\w+)*)))+"
++            "((?:\s+(?:(?:\w\.)|(?:\w+(?:-\w+)*)))+"
              "(?:\s+<[^@>]+@[^@>]+>)?)",
              re.UNICODE)
      thanks = []
--    changes_str = " ".join(changes).decode("utf-8")
++    changes_str = safe_decode(" ".join(changes))
      for match in thanks_re.finditer(changes_str):
          if thanks is None:
              thanks = []
@@ -446,12 +464,12 @@
      bugs = []
      if changelog._blocks:
          block = changelog._blocks[0]
--        authors = [block.author.decode("utf-8")]
++        authors = [safe_decode(block.author)]
          changes = strip_changelog_message(block.changes())
          authors += find_extra_authors(changes)
          bugs = find_bugs_fixed(changes, branch, _lplib=_lplib)
          thanks = find_thanks(changes)
--        message = "\n".join(changes).replace("\r", "")
++        message = safe_decode("\n".join(changes).replace("\r", ""))
      return (message, authors, thanks, bugs)

bzr-builddeb

Merge lp:~jameinel/bzr-builddeb/unicode-author-508251 into lp:bzr-builddeb

Commit message

Description of the change

Preview Diff

Subscribers