Launchpad itself

Merge lp:~henninge/launchpad/bug-506925-oops-export into lp:launchpad

bug-506925-oops-export
Merge into devel

Proposed by Henning Eggers on 2010-01-13

Status:

Merged

Approved by:

Данило Шеган on 2010-01-13

Approved revision:

not available

Merged at revision:

not available

Proposed branch:

lp:~henninge/launchpad/bug-506925-oops-export

Merge into:

lp:launchpad

Diff against target:

272 lines (+109/-81)

2 files modified

lib/lp/translations/utilities/gettext_po_exporter.py (+50/-54)
lib/lp/translations/utilities/tests/test_gettext_po_exporter.py (+59/-27)

To merge this branch:

bzr merge lp:~henninge/launchpad/bug-506925-oops-export

High

Fix Released

Link a bug report

Reviewer	Review Type	Date Requested	Status
Данило Шеган (community)	code	2010-01-13	Approve on 2010-01-13
Review via email: mp+17304@code.launchpad.net

Commit message

Fixed oops in translations export code by properly encoding the whole file as utf-8 if needed.

Revision history for this message

Henning Eggers (henninge) wrote on 2010-01-13:

= Bug 506925 =

Each PO file has a standard header with information about the file. One of the fields in that header is "Last-Translator" which contains the name and email address of the last person that did a translation in this file. Upon export from Launchpad, this field is filled with the display_name (or title? who cares) of the last translators IPerson object (and email address). This may contain non-ascii characters as in https://launchpad.net/~danilo. The header also has a "Content-Type" field that declares the character encoding of the file. This bug used to surface when that field did not contain a charset that could represent the translators name (ideally utf-8).

This problem is not new as such clashes have happened with translations, too, which also may contain non-ascii characters. The export routine alredy knows how to deal with this and simply changes the encoding of the file to UTF-8. This behaviour needed to simply be extended to the header, i.e. the whole file.

== Pre-imp notes ==

Talked with danilo about it and we agreed that it is ok to convert the whole file to UTF-8 if necessary. Most files are UTF-8 already.

== Implmentation details ==

lib/lp/translations/utilities/gettext_po_exporter.py

* Encode the whole file at the end instead of each chunk seperately and in different places.
* Did away with the need of having to re-encode all chunks if the encoding changes.
* The header is prepended at the end or the loop so that the encoding can be changed at last-minute warning.

lib/lp/translations/utilities/tests/test_gettext_po_exporter.py

* Added test for encoding of non-ascii characters in the header.
* Combined common code into a private method.
* Use named format specifiers because their order changes.

== Test ==

bin/test -vvct GettextPOExporterTestCase

== Demo/QA ==

* Import a file that has "ascii" as it's Content-type.
* Have Danilo make a translation using ascii characters.
* Export the file.
* The exported file should be UTF-8 now and Данило Шеган should appear as the Last-Translator.

= Launchpad lint =

Checking for conflicts. and issues in doctests and templates.
Running jslint, xmllint, pyflakes, and pylint.
Using normal rules.

Linting changed files:
lib/lp/translations/utilities/gettext_po_exporter.py
lib/lp/translations/utilities/tests/test_gettext_po_exporter.py

= Bug 506925 =

== Pre-imp notes ==

Talked with danilo about it and we agreed that it is ok to convert the whole file to UTF-8 if necessary. Most files are UTF-8 already.

== Implmentation details ==

lib/lp/translations/utilities/gettext_po_exporter.py

* Encode the whole file at the end instead of each chunk seperately and in different places.
 * Did away with the need of having to re-encode all chunks if the encoding changes.
 * The header is prepended at the end or the loop so that the encoding can be changed at last-minute warning.

lib/lp/translations/utilities/tests/test_gettext_po_exporter.py

* Added test for encoding of non-ascii characters in the header.
 * Combined common code into a private method.
 * Use named format specifiers because their order changes.

== Test ==

bin/test -vvct GettextPOExporterTestCase

== Demo/QA ==

* Import a file that has "ascii" as it's Content-type.
 * Have Danilo make a translation using ascii characters.
 * Export the file.
 * The exported file should be UTF-8 now and Данило Шеган should appear as the Last-Translator.

= Launchpad lint =

Checking for conflicts. and issues in doctests and templates.
Running jslint, xmllint, pyflakes, and pylint.
Using normal rules.

Linting changed files:
  lib/lp/translations/utilities/gettext_po_exporter.py
  lib/lp/translations/utilities/tests/test_gettext_po_exporter.py

Revision history for this message

Данило Шеган (danilo) wrote on 2010-01-13:

As discussed on IRC, I have some concerns with _try_encode_file_content method: it has very weird return semantics. Basically, they are: return encoded content according to translation_file header charset, and if it fails, return None if translation_file header wasn't UTF-8, otherwise fail with an exception. Not a very reasonable method, I'd say :)

We've looked at several solutions to that, one was to just switch the callsite to do exception handling, and not handle exceptions in this method (where the second one might fail, and would naturally propagate), though this approach loses some debugging information existing code has (file name in the exception error).

We can also just do this "exception error message improvement" in the _try_encode_file_content (i.e. catch the exception there and re-raise it unconditionally, compared to what we do now where we re-raise it only if the requested encoding was UTF-8).

review: Approve (code)

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Barki Mustapha

Celso Providelo

Christian Reis

Christy Awad

Colin Watson

Harpianto,ANDI

Henning Eggers

James Troup

John A Meinel

Kevin bush

Launchpad code reviewers

Launchpad code reviewers from Canonical

Matthew Tanner

Maximiliano Bertacchini

Oguz Ersoz

Simon Brakhane

Ubuntu-BR DevOps

William Grant

alhawiti

api.ng

pedro cavazos

todaioan

wenjingwen

to status/vote changes:

Tzaddi

Tzaddi Belding

 === modified file 'lib/lp/translations/utilities/gettext_po_exporter.py'
 --- lib/lp/translations/utilities/gettext_po_exporter.py	2009-11-18 13:26:03 +0000
 +++ lib/lp/translations/utilities/gettext_po_exporter.py	2010-01-15 06:56:16 +0000
@@ -277,13 +277,23 @@
          """See `ITranslationFormatExporter`."""
          return export_translation_message(translation_message)
--    def _makeExportedHeader(self, translation_file, charset=None):
++    def _makeExportedHeader(self, translation_file):
          """Transform the header information into a format suitable for export.
          :return: Unicode string containing the header.
          """
          raise NotImplementedError
++    def _encode_file_content(self, translation_file, exported_content):
++        """Try to encode the file using the charset given in the header."""
++        file_content = (
++            self._makeExportedHeader(translation_file) +
++            u'\n\n' +
++            exported_content)
++        encoded_file_content = file_content.encode(
++            translation_file.header.charset)
++        return encoded_file_content
++
      def exportTranslationFiles(self, translation_files, ignore_obsolete=False,
                                 force_utf8=False):
          """See `ITranslationFormatExporter`."""
@@ -310,10 +320,7 @@
                          translation_file.language_code,
                          file_extension))
--            if force_utf8:
--                translation_file.header.charset = 'UTF-8'
--            chunks = [self._makeExportedHeader(translation_file)]
--
++            chunks = []
              seen_keys = {}
              for message in translation_file.messages:
@@ -335,49 +342,42 @@
                  if (message.is_obsolete and
                      (ignore_obsolete or len(message.translations) == 0)):
                      continue
--                exported_message = self.exportTranslationMessageData(message)
--                try:
--                    encoded_text = exported_message.encode(
--                        translation_file.header.charset)
--                except UnicodeEncodeError, error:
--                    if translation_file.header.charset.upper() == 'UTF-8':
--                        # It's already UTF-8, we cannot do anything.
--                        raise UnicodeEncodeError(
--                            '%s:\n%s' % (file_path, str(error)))
--
--                    # This message cannot be represented in the current
--                    # encoding.
--                    if translation_file.path:
--                        file_description = translation_file.path
--                    elif translation_file.language_code:
--                        file_description = (
--                            "%s translation" % translation_file.language_code)
--                    else:
--                        file_description = "template"
--                    logging.info(
--                        "Can't represent %s as %s; using UTF-8 instead." % (
--                            file_description,
--                            translation_file.header.charset.upper()))
--
--                    old_charset = translation_file.header.charset
--                    translation_file.header.charset = 'UTF-8'
--                    # We need to update the header too.
--                    chunks[0] = self._makeExportedHeader(
--                        translation_file, old_charset)
--                    # Update already exported entries.
--                    for index, chunk in enumerate(chunks):
--                        chunks[index] = chunk.decode(
--                            old_charset).encode('UTF-8')
--                    encoded_text = exported_message.encode('UTF-8')
--
--                chunks.append(encoded_text)
--
--            exported_file_content = '\n\n'.join(chunks)
++                chunks.append(self.exportTranslationMessageData(message))
++
              # Gettext .po files are supposed to end with a new line.
--            exported_file_content += '\n'
--
--            storage.addFile(file_path, file_extension, exported_file_content)
++            exported_file_content = u'\n\n'.join(chunks) + u'\n'
++
++            # Try to encode the file
++            if force_utf8:
++                translation_file.header.charset = 'UTF-8'
++            try:
++                encoded_file_content = self._encode_file_content(
++                    translation_file, exported_file_content)
++            except UnicodeEncodeError:
++                if translation_file.header.charset.upper() == 'UTF-8':
++                    # It's already UTF-8, we cannot do anything.
++                    raise
++                # This file content cannot be represented in the current
++                # encoding.
++                if translation_file.path:
++                    file_description = translation_file.path
++                elif translation_file.language_code:
++                    file_description = (
++                        "%s translation" % translation_file.language_code)
++                else:
++                    file_description = "template"
++                logging.info(
++                    "Can't represent %s as %s; using UTF-8 instead." % (
++                        file_description,
++                        translation_file.header.charset.upper()))
++                # Use UTF-8 instead.
++                translation_file.header.charset = 'UTF-8'
++                # This either succeeds or raises UnicodeError.
++                encoded_file_content = self._encode_file_content(
++                    translation_file, exported_file_content)
++
++            storage.addFile(file_path, file_extension, encoded_file_content)
          return storage.export()
@@ -410,13 +410,11 @@
              TranslationFileFormat.PO,
              TranslationFileFormat.KDEPO]
--    def _makeExportedHeader(self, translation_file, charset=None):
++    def _makeExportedHeader(self, translation_file):
          """Create a standard gettext PO header, encoded as a message.
          :return: The header message as a unicode string.
          """
--        if charset is None:
--            charset = translation_file.header.charset
          header_translation_message = TranslationMessageData()
          header_translation_message.addTranslation(
              TranslationConstants.SINGULAR_FORM,
@@ -427,7 +425,7 @@
              header_translation_message.flags.update(['fuzzy'])
          exported_header = self.exportTranslationMessageData(
              header_translation_message)
--        return exported_header.encode(charset)
++        return exported_header
  class GettextPOChangedExporter(GettextPOExporterBase):
@@ -447,15 +445,13 @@
          self.format = TranslationFileFormat.POCHANGED
          self.supported_source_formats = []
--    def _makeExportedHeader(self, translation_file, charset=None):
++    def _makeExportedHeader(self, translation_file):
          """Create a header for changed PO files.
          This is a reduced header containing a warning that this is an
          icomplete gettext PO file.
          :return: The header as a unicode string.
          """
--        if charset is None:
--            charset = translation_file.header.charset
--        return self.exported_header.encode(charset)
++        return self.exported_header
      def acceptSingularClash(self, previous_message, current_message):
          """See `GettextPOExporterBase`."""
 === modified file 'lib/lp/translations/utilities/tests/test_gettext_po_exporter.py'
 --- lib/lp/translations/utilities/tests/test_gettext_po_exporter.py	2009-11-01 22:50:17 +0000
 +++ lib/lp/translations/utilities/tests/test_gettext_po_exporter.py	2010-01-15 06:56:16 +0000
@@ -263,33 +263,9 @@
          for pofile in pofiles:
              compare(self, pofile)
--
--    def testBrokenEncodingExport(self):
--        """Test what happens when the content and the encoding don't agree.
--
--        If a pofile fails to encode using the character set specified in the
--        header, the header should be changed to specify to UTF-8 and the
--        pofile exported accordingly.
--        """
--
--        pofile = dedent('''
--            msgid ""
--            msgstr ""
--            "Project-Id-Version: foo\\n"
--            "Report-Msgid-Bugs-To: \\n"
--            "POT-Creation-Date: 2007-07-09 03:39+0100\\n"
--            "PO-Revision-Date: 2001-09-09 01:46+0000\\n"
--            "Last-Translator: Kubla Kahn <kk@pleasure-dome.com>\\n"
--            "Language-Team: LANGUAGE <LL@li.org>\\n"
--            "MIME-Version: 1.0\\n"
--            "Content-Type: text/plain; charset=%s\\n"
--            "Content-Transfer-Encoding: 8bit\\n"
--
--            msgid "a"
--            msgstr "%s"
--            ''')
++    def _testBrokenEncoding(self, pofile_content):
          translation_file = self.parser.parse(
--            pofile % ('ISO-8859-15', '\xe1'))
++            pofile_content % {'charset': 'ISO-8859-15', 'special': '\xe1'})
          translation_file.is_template = False
          translation_file.language_code = 'es'
          translation_file.path = 'po/es.po'
@@ -302,9 +278,65 @@
              [translation_file])
          self._compareImportAndExport(
--            pofile.strip() % ('UTF-8', '\xc3\xa1'),
++            pofile_content.strip() % {
++                'charset': 'UTF-8', 'special': '\xc3\xa1'},
              exported_file.read().strip())
++    def testBrokenEncodingExport(self):
++        """Test what happens when the content and the encoding don't agree.
++
++        If a pofile fails to encode using the character set specified in the
++        header, the header should be changed to specify to UTF-8 and the
++        pofile exported accordingly.
++        """
++
++        pofile_content = dedent('''
++            msgid ""
++            msgstr ""
++            "Project-Id-Version: foo\\n"
++            "Report-Msgid-Bugs-To: \\n"
++            "POT-Creation-Date: 2007-07-09 03:39+0100\\n"
++            "PO-Revision-Date: 2001-09-09 01:46+0000\\n"
++            "Last-Translator: Kubla Kahn <kk@pleasure-dome.com>\\n"
++            "Language-Team: LANGUAGE <LL@li.org>\\n"
++            "MIME-Version: 1.0\\n"
++            "Content-Type: text/plain; charset=%(charset)s\\n"
++            "Content-Transfer-Encoding: 8bit\\n"
++
++            msgid "a"
++            msgstr "%(special)s"
++            ''')
++        self._testBrokenEncoding(pofile_content)
++
++    def testBrokenEncodingHeader(self):
++        """A header field might require a different encoding, too.
++
++        This usually happens if the Last-Translator name contains non-ascii
++        characters.
++
++        If a pofile fails to encode using the character set specified in the
++        header, the header should be changed to specify to UTF-8 and the
++        pofile exported accordingly.
++        """
++
++        pofile_content = dedent('''
++            msgid ""
++            msgstr ""
++            "Project-Id-Version: foo\\n"
++            "Report-Msgid-Bugs-To: \\n"
++            "POT-Creation-Date: 2007-07-09 03:39+0100\\n"
++            "PO-Revision-Date: 2001-09-09 01:46+0000\\n"
++            "Last-Translator: Kubla K%(special)shn <kk@pleasure-dome.com>\\n"
++            "Language-Team: LANGUAGE <LL@li.org>\\n"
++            "MIME-Version: 1.0\\n"
++            "Content-Type: text/plain; charset=%(charset)s\\n"
++            "Content-Transfer-Encoding: 8bit\\n"
++
++            msgid "a"
++            msgstr "b"
++            ''')
++        self._testBrokenEncoding(pofile_content)
++
      def testIncompletePluralMessage(self):
          """Test export correctness for partial plural messages."""

Launchpad itself

Merge lp:~henninge/launchpad/bug-506925-oops-export into lp:launchpad

Commit message

Description of the change

Preview Diff

Subscribers