Merge lp:~jameinel/bzr-builddeb/unicode-author-508251 into lp:bzr-builddeb

Proposed by John A Meinel
Status: Merged
Merged at revision: not available
Proposed branch: lp:~jameinel/bzr-builddeb/unicode-author-508251
Merge into: lp:bzr-builddeb
Diff against target: 195 lines (+90/-8)
3 files modified
import_dsc.py (+2/-1)
tests/test_util.py (+63/-0)
util.py (+25/-7)
To merge this branch: bzr merge lp:~jameinel/bzr-builddeb/unicode-author-508251
Reviewer Review Type Date Requested Status
Bzr-builddeb-hackers Pending
Review via email: mp+19662@code.launchpad.net
To post a comment you must log in.
Revision history for this message
John A Meinel (jameinel) wrote :

This is a basic fix for bug #508251. Specifically it:

1) Tries to decode using utf-8, if that fails it falls back to iso-8859-1. For now it also mutters the string it failed to decode. (might get a bit noisy, but it would let you know if there are issues with a given import.)

2) Applies this to both author decoding *and* to the commit message. I think the author stuff hid the fact that the commit message was also broken. Basically, find_extra_authors decodes everything before bzr was going to get a chance at it. And bzr was always decoding 'message' as bzrlib.user_encoding, which I assume was always utf-8 for the import machine. Arguably it was succeeding 'by accident', rather than by design.

3) Changes 'find_thanks()' to allow names to start with a Unicode character, rather than requiring strictly A-Z. If you want, I can bring back "author[0].isupper()" or something like that. Looking at the regex, if I said "Thanks to my cat" it seems reasonable to have 'deb-thanks': ['my cat'] even though it wasn't "Mr Cat". The "Thanks to" and "thank you" seem to be a decent filter, without having to worry about the exact name. If you want this changed to something else, just let me know.
I can restore the original behavior and change the tests, but it seemed reasonable to allow non-ascii as the first letter of someone's name. Given this changelog entry:
    - Translators: Vital Khilko (be), Vladimir Petkov (bg), Hendrik
      Brandt (de), Kostas Papadimas (el), Adam Weinberger (en_CA), Francisco
      Javier F. Serrador (es), Ilkka Tuohela (fi), Ignacio Casal Quinteiro
      (gl), Ankit Patel (gu), Luca Ferretti (it), Takeshi AIHANA (ja),
      Žygimantas Beručka (lt), Øivind Hoel (nb), Reinout van Schouwen (nl),
      Øivind Hoel (no), Evandro Fernandes Giovanini (pt_BR), Слободан Д.
      Средојевић (sr), Theppitak Karoonboonyanan (th), Clytie Siddall (vi),
      Funda Wang (zh_CN)

At least 3 of those people have non-ascii first letters (Žygimantas, Øivind, etc)

4) I also made sure to run this locally against 'gnome-panel' which was one of the failing imports. It has certainly gotten a lot farther, and I've check that it has run into a few of these mixed-encoding sections. Note that this assumes that each changelog block uses a constant encoding (for the purposes of commit message), but that actually seems reasonable. As dapper/debian/changelog switches back and forth from iso-8859-1 in some blocks to utf-8 in other blocks.

Revision history for this message
James Westby (james-w) wrote :

On Thu, 18 Feb 2010 22:17:12 -0000, John A Meinel <email address hidden> wrote:
> This is a basic fix for bug #508251. Specifically it:

Thanks.

> 1) Tries to decode using utf-8, if that fails it falls back to iso-8859-1. For now it also mutters the string it failed to decode. (might get a bit noisy, but it would let you know if there are issues with a given import.)

It will still cause failures if it can't be decoded in
iso-8859-1 either, is that what we want at this stage?

> 4) I also made sure to run this locally against 'gnome-panel' which was one of the failing imports. It has certainly gotten a lot farther, and I've check that it has run into a few of these mixed-encoding sections. Note that this assumes that each changelog block uses a constant encoding (for the purposes of commit message), but that actually seems reasonable. As dapper/debian/changelog switches back and forth from iso-8859-1 in some blocks to utf-8 in other blocks.

Thanks, I'll apply this once you tell me that this test didn't discover
any problems with the change (it obviously isn't blocked on any other
issues that might be found.)

Thanks,

James

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

James Westby wrote:
> On Thu, 18 Feb 2010 22:17:12 -0000, John A Meinel <email address hidden> wrote:
>> This is a basic fix for bug #508251. Specifically it:
>
> Thanks.
>
>> 1) Tries to decode using utf-8, if that fails it falls back to iso-8859-1. For now it also mutters the string it failed to decode. (might get a bit noisy, but it would let you know if there are issues with a given import.)
>
> It will still cause failures if it can't be decoded in
> iso-8859-1 either, is that what we want at this stage?

iso-8859-1 can decode all possible 8-bit sequences. Possibly
incorrectly, but all bits have a Unicode code point from iso-8859-1.

>
>> 4) I also made sure to run this locally against 'gnome-panel' which was one of the failing imports. It has certainly gotten a lot farther, and I've check that it has run into a few of these mixed-encoding sections. Note that this assumes that each changelog block uses a constant encoding (for the purposes of commit message), but that actually seems reasonable. As dapper/debian/changelog switches back and forth from iso-8859-1 in some blocks to utf-8 in other blocks.
>
> Thanks, I'll apply this once you tell me that this test didn't discover
> any problems with the change (it obviously isn't blocked on any other
> issues that might be found.)
>
> Thanks,
>
> James
>

The import succeeded. I don't have a way to tell the fidelity of the
result, etc.

I'm slightly concerned that a new import will give different results to
an old import (based on now finding an author that wasn't found before).
But I don't think the import system uses deterministic ids, so it should
be fine.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkt+ou4ACgkQJdeBCYSNAAMqogCdEMIQvWx31ExKOAPYjwmcKVJa
YGoAnA9m45Pg/9YJAUUuDYQEvjFijdjK
=gAnd
-----END PGP SIGNATURE-----

Revision history for this message
James Westby (james-w) wrote :

On Fri, 19 Feb 2010 08:40:47 -0600, John Arbash Meinel <email address hidden> wrote:
> iso-8859-1 can decode all possible 8-bit sequences. Possibly
> incorrectly, but all bits have a Unicode code point from iso-8859-1.

Ok, so we'll have nonsense on occaision, but no failures. Sounds
reasonable to me.

> The import succeeded. I don't have a way to tell the fidelity of the
> result, etc.

> I'm slightly concerned that a new import will give different results to
> an old import (based on now finding an author that wasn't found before).
> But I don't think the import system uses deterministic ids, so it should
> be fine.

It doesn't use deterministic ids. I don't think it matters much that
there will be differences, there's not a lot we can do about that.

Thanks,

James

Revision history for this message
James Westby (james-w) wrote :

Oh, I'll merge this once I've finished the change I'm currently working
on.

Thanks,

James

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'import_dsc.py'
--- import_dsc.py 2010-02-12 19:58:29 +0000
+++ import_dsc.py 2010-02-18 22:17:11 +0000
@@ -76,6 +76,7 @@
76 get_snapshot_revision,76 get_snapshot_revision,
77 open_file_via_transport,77 open_file_via_transport,
78 open_transport,78 open_transport,
79 safe_decode,
79 subprocess_setup,80 subprocess_setup,
80 )81 )
8182
@@ -1251,7 +1252,7 @@
1251 time_tuple = rfc822.parsedate_tz(raw_timestamp)1252 time_tuple = rfc822.parsedate_tz(raw_timestamp)
1252 if time_tuple is not None:1253 if time_tuple is not None:
1253 timestamp = (time.mktime(time_tuple[:9]), time_tuple[9])1254 timestamp = (time.mktime(time_tuple[:9]), time_tuple[9])
1254 author = cl.author.decode("utf-8")1255 author = safe_decode(cl.author)
1255 versions = self._get_safe_versions_from_changelog(cl)1256 versions = self._get_safe_versions_from_changelog(cl)
1256 assert not self.has_version(version), \1257 assert not self.has_version(version), \
1257 "Trying to import version %s again" % str(version)1258 "Trying to import version %s again" % str(version)
12581259
=== modified file 'tests/test_util.py'
--- tests/test_util.py 2010-02-12 16:41:15 +0000
+++ tests/test_util.py 2010-02-18 22:17:11 +0000
@@ -45,6 +45,7 @@
45 move_file_if_different,45 move_file_if_different,
46 get_parent_dir,46 get_parent_dir,
47 recursive_copy,47 recursive_copy,
48 safe_decode,
48 strip_changelog_message,49 strip_changelog_message,
49 suite_to_distribution,50 suite_to_distribution,
50 tarball_name,51 tarball_name,
@@ -84,6 +85,19 @@
84 self.failUnlessExists('a/f')85 self.failUnlessExists('a/f')
8586
8687
88class SafeDecodeTests(TestCase):
89
90 def assertSafeDecode(self, expected, val):
91 self.assertEqual(expected, safe_decode(val))
92
93 def test_utf8(self):
94 self.assertSafeDecode(u'ascii', 'ascii')
95 self.assertSafeDecode(u'\xe7', '\xc3\xa7')
96
97 def test_iso_8859_1(self):
98 self.assertSafeDecode(u'\xe7', '\xe7')
99
100
87cl_block1 = """\101cl_block1 = """\
88bzr-builddeb (0.17) unstable; urgency=low102bzr-builddeb (0.17) unstable; urgency=low
89103
@@ -467,6 +481,22 @@
467 self.assertEqual([u"A. Hacker", u"B. Hacker"], authors)481 self.assertEqual([u"A. Hacker", u"B. Hacker"], authors)
468 self.assertEqual([unicode]*len(authors), map(type, authors))482 self.assertEqual([unicode]*len(authors), map(type, authors))
469483
484 def test_find_extra_authors_utf8(self):
485 changes = [" * Do foo", "", " [ \xc3\xa1. Hacker ]", " * Do bar", "",
486 " [ \xc3\xa7. Hacker ]", " [ A. Hacker}"]
487 authors = find_extra_authors(changes)
488 self.assertEqual([u"\xe1. Hacker", u"\xe7. Hacker"], authors)
489 self.assertEqual([unicode]*len(authors), map(type, authors))
490
491 def test_find_extra_authors_iso_8859_1(self):
492 # We try to treat lines as utf-8, but if that fails to decode, we fall
493 # back to iso-8859-1
494 changes = [" * Do foo", "", " [ \xe1. Hacker ]", " * Do bar", "",
495 " [ \xe7. Hacker ]", " [ A. Hacker}"]
496 authors = find_extra_authors(changes)
497 self.assertEqual([u"\xe1. Hacker", u"\xe7. Hacker"], authors)
498 self.assertEqual([unicode]*len(authors), map(type, authors))
499
470 def test_find_extra_authors_no_changes(self):500 def test_find_extra_authors_no_changes(self):
471 authors = find_extra_authors([])501 authors = find_extra_authors([])
472 self.assertEqual([], authors)502 self.assertEqual([], authors)
@@ -504,6 +534,8 @@
504 self.assert_thanks_is(changes, [u"A. Hacker <ahacker@example.com>"])534 self.assert_thanks_is(changes, [u"A. Hacker <ahacker@example.com>"])
505 changes = [" * Thanks to Adeodato Sim\xc3\x83\xc2\xb3"]535 changes = [" * Thanks to Adeodato Sim\xc3\x83\xc2\xb3"]
506 self.assert_thanks_is(changes, [u"Adeodato Sim\xc3\xb3"])536 self.assert_thanks_is(changes, [u"Adeodato Sim\xc3\xb3"])
537 changes = [" * Thanks to \xc3\x81deodato Sim\xc3\x83\xc2\xb3"]
538 self.assert_thanks_is(changes, [u"\xc1deodato Sim\xc3\xb3"])
507539
508 def test_find_bugs_fixed_no_changes(self):540 def test_find_bugs_fixed_no_changes(self):
509 self.assertEqual([], find_bugs_fixed([], None, _lplib=MockLaunchpad()))541 self.assertEqual([], find_bugs_fixed([], None, _lplib=MockLaunchpad()))
@@ -582,6 +614,37 @@
582 self.assertEqual(find_bugs_fixed(changes, wt.branch,614 self.assertEqual(find_bugs_fixed(changes, wt.branch,
583 _lplib=MockLaunchpad()), bugs)615 _lplib=MockLaunchpad()), bugs)
584616
617 def assertUnicodeCommitInfo(self, changes):
618 wt = self.make_branch_and_tree(".")
619 changelog = Changelog()
620 author = "J. Maintainer <maint@example.com>"
621 changelog.new_block(changes=changes, author=author)
622 message, authors, thanks, bugs = \
623 get_commit_info_from_changelog(changelog, wt.branch,
624 _lplib=MockLaunchpad())
625 self.assertEqual(u'[ \xc1. Hacker ]\n'
626 u'* First ch\xe1nge, LP: #12345\n'
627 u'* Second change, thanks to \xde. Hacker',
628 message)
629 self.assertEqual([author, u'\xc1. Hacker'], authors)
630 self.assertEqual(unicode, type(authors[0]))
631 self.assertEqual([u'\xde. Hacker'], thanks)
632 self.assertEqual(['https://launchpad.net/bugs/12345 fixed'], bugs)
633
634 def test_get_commit_info_utf8(self):
635 changes = [" [ \xc3\x81. Hacker ]",
636 " * First ch\xc3\xa1nge, LP: #12345",
637 " * Second change, thanks to \xc3\x9e. Hacker"]
638 self.assertUnicodeCommitInfo(changes)
639
640 def test_get_commit_info_iso_8859_1(self):
641 # Changelogs aren't always well-formed UTF-8, so we fall back to
642 # iso-8859-1 if we fail to decode utf-8.
643 changes = [" [ \xc1. Hacker ]",
644 " * First ch\xe1nge, LP: #12345",
645 " * Second change, thanks to \xde. Hacker"]
646 self.assertUnicodeCommitInfo(changes)
647
585648
586class MockLaunchpad(object):649class MockLaunchpad(object):
587650
588651
=== modified file 'util.py'
--- util.py 2010-02-10 13:43:44 +0000
+++ util.py 2010-02-18 22:17:11 +0000
@@ -56,6 +56,24 @@
56 )56 )
5757
5858
59def safe_decode(s):
60 """Decode a string into a Unicode value."""
61 if isinstance(s, unicode): # Already unicode
62 mutter('safe_decode() called on an already-unicode string: %r' % (s,))
63 return s
64 try:
65 return s.decode('utf-8')
66 except UnicodeDecodeError, e:
67 mutter('safe_decode(%r) falling back to iso-8859-1' % (s,))
68 # TODO: Looking at BeautifulSoup it seems to use 'chardet' to try to
69 # guess the encoding of a given text stream. We might want to
70 # take a closer look at that.
71 # TODO: Another possibility would be to make the fallback encoding
72 # configurable, possibly exposed as a command-line flag, for now,
73 # this seems 'good enough'.
74 return s.decode('iso-8859-1')
75
76
59def recursive_copy(fromdir, todir):77def recursive_copy(fromdir, todir):
60 """Copy the contents of fromdir to todir.78 """Copy the contents of fromdir to todir.
6179
@@ -392,13 +410,13 @@
392410
393411
394def find_extra_authors(changes):412def find_extra_authors(changes):
395 extra_author_re = re.compile(r"\s*\[([^\]]+)]\s*", re.UNICODE)413 extra_author_re = re.compile(r"\s*\[([^\]]+)]\s*")
396 authors = []414 authors = []
397 for change in changes:415 for change in changes:
398 # Parse out any extra authors.416 # Parse out any extra authors.
399 match = extra_author_re.match(change.decode("utf-8"))417 match = extra_author_re.match(change)
400 if match is not None:418 if match is not None:
401 new_author = match.group(1).strip()419 new_author = safe_decode(match.group(1).strip())
402 already_included = False420 already_included = False
403 for author in authors:421 for author in authors:
404 if author.startswith(new_author):422 if author.startswith(new_author):
@@ -411,11 +429,11 @@
411429
412def find_thanks(changes):430def find_thanks(changes):
413 thanks_re = re.compile(r"[tT]hank(?:(?:s)|(?:you))(?:\s*to)?"431 thanks_re = re.compile(r"[tT]hank(?:(?:s)|(?:you))(?:\s*to)?"
414 "((?:\s+(?:(?:[A-Z]\.)|(?:[A-Z]\w+(?:-[A-Z]\w+)*)))+"432 "((?:\s+(?:(?:\w\.)|(?:\w+(?:-\w+)*)))+"
415 "(?:\s+<[^@>]+@[^@>]+>)?)",433 "(?:\s+<[^@>]+@[^@>]+>)?)",
416 re.UNICODE)434 re.UNICODE)
417 thanks = []435 thanks = []
418 changes_str = " ".join(changes).decode("utf-8")436 changes_str = safe_decode(" ".join(changes))
419 for match in thanks_re.finditer(changes_str):437 for match in thanks_re.finditer(changes_str):
420 if thanks is None:438 if thanks is None:
421 thanks = []439 thanks = []
@@ -446,12 +464,12 @@
446 bugs = []464 bugs = []
447 if changelog._blocks:465 if changelog._blocks:
448 block = changelog._blocks[0]466 block = changelog._blocks[0]
449 authors = [block.author.decode("utf-8")]467 authors = [safe_decode(block.author)]
450 changes = strip_changelog_message(block.changes())468 changes = strip_changelog_message(block.changes())
451 authors += find_extra_authors(changes)469 authors += find_extra_authors(changes)
452 bugs = find_bugs_fixed(changes, branch, _lplib=_lplib)470 bugs = find_bugs_fixed(changes, branch, _lplib=_lplib)
453 thanks = find_thanks(changes)471 thanks = find_thanks(changes)
454 message = "\n".join(changes).replace("\r", "")472 message = safe_decode("\n".join(changes).replace("\r", ""))
455 return (message, authors, thanks, bugs)473 return (message, authors, thanks, bugs)
456474
457475

Subscribers

People subscribed via source and target branches