Merge lp:~henninge/launchpad/bug-506925-oops-export into lp:launchpad

Proposed by Henning Eggers
Status: Merged
Approved by: Данило Шеган
Approved revision: not available
Merged at revision: not available
Proposed branch: lp:~henninge/launchpad/bug-506925-oops-export
Merge into: lp:launchpad
Diff against target: 272 lines (+109/-81)
2 files modified
lib/lp/translations/utilities/gettext_po_exporter.py (+50/-54)
lib/lp/translations/utilities/tests/test_gettext_po_exporter.py (+59/-27)
To merge this branch: bzr merge lp:~henninge/launchpad/bug-506925-oops-export
Reviewer Review Type Date Requested Status
Данило Шеган (community) code Approve
Review via email: mp+17304@code.launchpad.net

Commit message

Fixed oops in translations export code by properly encoding the whole file as utf-8 if needed.

To post a comment you must log in.
Revision history for this message
Henning Eggers (henninge) wrote :

= Bug 506925 =

Each PO file has a standard header with information about the file. One of the fields in that header is "Last-Translator" which contains the name and email address of the last person that did a translation in this file. Upon export from Launchpad, this field is filled with the display_name (or title? who cares) of the last translators IPerson object (and email address). This may contain non-ascii characters as in https://launchpad.net/~danilo. The header also has a "Content-Type" field that declares the character encoding of the file. This bug used to surface when that field did not contain a charset that could represent the translators name (ideally utf-8).

This problem is not new as such clashes have happened with translations, too, which also may contain non-ascii characters. The export routine alredy knows how to deal with this and simply changes the encoding of the file to UTF-8. This behaviour needed to simply be extended to the header, i.e. the whole file.

== Pre-imp notes ==

Talked with danilo about it and we agreed that it is ok to convert the whole file to UTF-8 if necessary. Most files are UTF-8 already.

== Implmentation details ==

lib/lp/translations/utilities/gettext_po_exporter.py

 * Encode the whole file at the end instead of each chunk seperately and in different places.
 * Did away with the need of having to re-encode all chunks if the encoding changes.
 * The header is prepended at the end or the loop so that the encoding can be changed at last-minute warning.

lib/lp/translations/utilities/tests/test_gettext_po_exporter.py

 * Added test for encoding of non-ascii characters in the header.
 * Combined common code into a private method.
 * Use named format specifiers because their order changes.

== Test ==

bin/test -vvct GettextPOExporterTestCase

== Demo/QA ==

 * Import a file that has "ascii" as it's Content-type.
 * Have Danilo make a translation using ascii characters.
 * Export the file.
 * The exported file should be UTF-8 now and Данило Шеган should appear as the Last-Translator.

= Launchpad lint =

Checking for conflicts. and issues in doctests and templates.
Running jslint, xmllint, pyflakes, and pylint.
Using normal rules.

Linting changed files:
  lib/lp/translations/utilities/gettext_po_exporter.py
  lib/lp/translations/utilities/tests/test_gettext_po_exporter.py

Revision history for this message
Данило Шеган (danilo) wrote :

As discussed on IRC, I have some concerns with _try_encode_file_content method: it has very weird return semantics. Basically, they are: return encoded content according to translation_file header charset, and if it fails, return None if translation_file header wasn't UTF-8, otherwise fail with an exception. Not a very reasonable method, I'd say :)

We've looked at several solutions to that, one was to just switch the callsite to do exception handling, and not handle exceptions in this method (where the second one might fail, and would naturally propagate), though this approach loses some debugging information existing code has (file name in the exception error).

We can also just do this "exception error message improvement" in the _try_encode_file_content (i.e. catch the exception there and re-raise it unconditionally, compared to what we do now where we re-raise it only if the requested encoding was UTF-8).

review: Approve (code)

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'lib/lp/translations/utilities/gettext_po_exporter.py'
--- lib/lp/translations/utilities/gettext_po_exporter.py 2009-11-18 13:26:03 +0000
+++ lib/lp/translations/utilities/gettext_po_exporter.py 2010-01-15 06:56:16 +0000
@@ -277,13 +277,23 @@
277 """See `ITranslationFormatExporter`."""277 """See `ITranslationFormatExporter`."""
278 return export_translation_message(translation_message)278 return export_translation_message(translation_message)
279279
280 def _makeExportedHeader(self, translation_file, charset=None):280 def _makeExportedHeader(self, translation_file):
281 """Transform the header information into a format suitable for export.281 """Transform the header information into a format suitable for export.
282282
283 :return: Unicode string containing the header.283 :return: Unicode string containing the header.
284 """284 """
285 raise NotImplementedError285 raise NotImplementedError
286286
287 def _encode_file_content(self, translation_file, exported_content):
288 """Try to encode the file using the charset given in the header."""
289 file_content = (
290 self._makeExportedHeader(translation_file) +
291 u'\n\n' +
292 exported_content)
293 encoded_file_content = file_content.encode(
294 translation_file.header.charset)
295 return encoded_file_content
296
287 def exportTranslationFiles(self, translation_files, ignore_obsolete=False,297 def exportTranslationFiles(self, translation_files, ignore_obsolete=False,
288 force_utf8=False):298 force_utf8=False):
289 """See `ITranslationFormatExporter`."""299 """See `ITranslationFormatExporter`."""
@@ -310,10 +320,7 @@
310 translation_file.language_code,320 translation_file.language_code,
311 file_extension))321 file_extension))
312322
313 if force_utf8:323 chunks = []
314 translation_file.header.charset = 'UTF-8'
315 chunks = [self._makeExportedHeader(translation_file)]
316
317 seen_keys = {}324 seen_keys = {}
318325
319 for message in translation_file.messages:326 for message in translation_file.messages:
@@ -335,49 +342,42 @@
335 if (message.is_obsolete and342 if (message.is_obsolete and
336 (ignore_obsolete or len(message.translations) == 0)):343 (ignore_obsolete or len(message.translations) == 0)):
337 continue344 continue
338 exported_message = self.exportTranslationMessageData(message)345 chunks.append(self.exportTranslationMessageData(message))
339 try:346
340 encoded_text = exported_message.encode(
341 translation_file.header.charset)
342 except UnicodeEncodeError, error:
343 if translation_file.header.charset.upper() == 'UTF-8':
344 # It's already UTF-8, we cannot do anything.
345 raise UnicodeEncodeError(
346 '%s:\n%s' % (file_path, str(error)))
347
348 # This message cannot be represented in the current
349 # encoding.
350 if translation_file.path:
351 file_description = translation_file.path
352 elif translation_file.language_code:
353 file_description = (
354 "%s translation" % translation_file.language_code)
355 else:
356 file_description = "template"
357 logging.info(
358 "Can't represent %s as %s; using UTF-8 instead." % (
359 file_description,
360 translation_file.header.charset.upper()))
361
362 old_charset = translation_file.header.charset
363 translation_file.header.charset = 'UTF-8'
364 # We need to update the header too.
365 chunks[0] = self._makeExportedHeader(
366 translation_file, old_charset)
367 # Update already exported entries.
368 for index, chunk in enumerate(chunks):
369 chunks[index] = chunk.decode(
370 old_charset).encode('UTF-8')
371 encoded_text = exported_message.encode('UTF-8')
372
373 chunks.append(encoded_text)
374
375 exported_file_content = '\n\n'.join(chunks)
376347
377 # Gettext .po files are supposed to end with a new line.348 # Gettext .po files are supposed to end with a new line.
378 exported_file_content += '\n'349 exported_file_content = u'\n\n'.join(chunks) + u'\n'
379350
380 storage.addFile(file_path, file_extension, exported_file_content)351 # Try to encode the file
352 if force_utf8:
353 translation_file.header.charset = 'UTF-8'
354 try:
355 encoded_file_content = self._encode_file_content(
356 translation_file, exported_file_content)
357 except UnicodeEncodeError:
358 if translation_file.header.charset.upper() == 'UTF-8':
359 # It's already UTF-8, we cannot do anything.
360 raise
361 # This file content cannot be represented in the current
362 # encoding.
363 if translation_file.path:
364 file_description = translation_file.path
365 elif translation_file.language_code:
366 file_description = (
367 "%s translation" % translation_file.language_code)
368 else:
369 file_description = "template"
370 logging.info(
371 "Can't represent %s as %s; using UTF-8 instead." % (
372 file_description,
373 translation_file.header.charset.upper()))
374 # Use UTF-8 instead.
375 translation_file.header.charset = 'UTF-8'
376 # This either succeeds or raises UnicodeError.
377 encoded_file_content = self._encode_file_content(
378 translation_file, exported_file_content)
379
380 storage.addFile(file_path, file_extension, encoded_file_content)
381381
382 return storage.export()382 return storage.export()
383383
@@ -410,13 +410,11 @@
410 TranslationFileFormat.PO,410 TranslationFileFormat.PO,
411 TranslationFileFormat.KDEPO]411 TranslationFileFormat.KDEPO]
412412
413 def _makeExportedHeader(self, translation_file, charset=None):413 def _makeExportedHeader(self, translation_file):
414 """Create a standard gettext PO header, encoded as a message.414 """Create a standard gettext PO header, encoded as a message.
415415
416 :return: The header message as a unicode string.416 :return: The header message as a unicode string.
417 """417 """
418 if charset is None:
419 charset = translation_file.header.charset
420 header_translation_message = TranslationMessageData()418 header_translation_message = TranslationMessageData()
421 header_translation_message.addTranslation(419 header_translation_message.addTranslation(
422 TranslationConstants.SINGULAR_FORM,420 TranslationConstants.SINGULAR_FORM,
@@ -427,7 +425,7 @@
427 header_translation_message.flags.update(['fuzzy'])425 header_translation_message.flags.update(['fuzzy'])
428 exported_header = self.exportTranslationMessageData(426 exported_header = self.exportTranslationMessageData(
429 header_translation_message)427 header_translation_message)
430 return exported_header.encode(charset)428 return exported_header
431429
432430
433class GettextPOChangedExporter(GettextPOExporterBase):431class GettextPOChangedExporter(GettextPOExporterBase):
@@ -447,15 +445,13 @@
447 self.format = TranslationFileFormat.POCHANGED445 self.format = TranslationFileFormat.POCHANGED
448 self.supported_source_formats = []446 self.supported_source_formats = []
449447
450 def _makeExportedHeader(self, translation_file, charset=None):448 def _makeExportedHeader(self, translation_file):
451 """Create a header for changed PO files.449 """Create a header for changed PO files.
452 This is a reduced header containing a warning that this is an450 This is a reduced header containing a warning that this is an
453 icomplete gettext PO file.451 icomplete gettext PO file.
454 :return: The header as a unicode string.452 :return: The header as a unicode string.
455 """453 """
456 if charset is None:454 return self.exported_header
457 charset = translation_file.header.charset
458 return self.exported_header.encode(charset)
459455
460 def acceptSingularClash(self, previous_message, current_message):456 def acceptSingularClash(self, previous_message, current_message):
461 """See `GettextPOExporterBase`."""457 """See `GettextPOExporterBase`."""
462458
=== modified file 'lib/lp/translations/utilities/tests/test_gettext_po_exporter.py'
--- lib/lp/translations/utilities/tests/test_gettext_po_exporter.py 2009-11-01 22:50:17 +0000
+++ lib/lp/translations/utilities/tests/test_gettext_po_exporter.py 2010-01-15 06:56:16 +0000
@@ -263,33 +263,9 @@
263 for pofile in pofiles:263 for pofile in pofiles:
264 compare(self, pofile)264 compare(self, pofile)
265265
266266 def _testBrokenEncoding(self, pofile_content):
267 def testBrokenEncodingExport(self):
268 """Test what happens when the content and the encoding don't agree.
269
270 If a pofile fails to encode using the character set specified in the
271 header, the header should be changed to specify to UTF-8 and the
272 pofile exported accordingly.
273 """
274
275 pofile = dedent('''
276 msgid ""
277 msgstr ""
278 "Project-Id-Version: foo\\n"
279 "Report-Msgid-Bugs-To: \\n"
280 "POT-Creation-Date: 2007-07-09 03:39+0100\\n"
281 "PO-Revision-Date: 2001-09-09 01:46+0000\\n"
282 "Last-Translator: Kubla Kahn <kk@pleasure-dome.com>\\n"
283 "Language-Team: LANGUAGE <LL@li.org>\\n"
284 "MIME-Version: 1.0\\n"
285 "Content-Type: text/plain; charset=%s\\n"
286 "Content-Transfer-Encoding: 8bit\\n"
287
288 msgid "a"
289 msgstr "%s"
290 ''')
291 translation_file = self.parser.parse(267 translation_file = self.parser.parse(
292 pofile % ('ISO-8859-15', '\xe1'))268 pofile_content % {'charset': 'ISO-8859-15', 'special': '\xe1'})
293 translation_file.is_template = False269 translation_file.is_template = False
294 translation_file.language_code = 'es'270 translation_file.language_code = 'es'
295 translation_file.path = 'po/es.po'271 translation_file.path = 'po/es.po'
@@ -302,9 +278,65 @@
302 [translation_file])278 [translation_file])
303279
304 self._compareImportAndExport(280 self._compareImportAndExport(
305 pofile.strip() % ('UTF-8', '\xc3\xa1'),281 pofile_content.strip() % {
282 'charset': 'UTF-8', 'special': '\xc3\xa1'},
306 exported_file.read().strip())283 exported_file.read().strip())
307284
285 def testBrokenEncodingExport(self):
286 """Test what happens when the content and the encoding don't agree.
287
288 If a pofile fails to encode using the character set specified in the
289 header, the header should be changed to specify to UTF-8 and the
290 pofile exported accordingly.
291 """
292
293 pofile_content = dedent('''
294 msgid ""
295 msgstr ""
296 "Project-Id-Version: foo\\n"
297 "Report-Msgid-Bugs-To: \\n"
298 "POT-Creation-Date: 2007-07-09 03:39+0100\\n"
299 "PO-Revision-Date: 2001-09-09 01:46+0000\\n"
300 "Last-Translator: Kubla Kahn <kk@pleasure-dome.com>\\n"
301 "Language-Team: LANGUAGE <LL@li.org>\\n"
302 "MIME-Version: 1.0\\n"
303 "Content-Type: text/plain; charset=%(charset)s\\n"
304 "Content-Transfer-Encoding: 8bit\\n"
305
306 msgid "a"
307 msgstr "%(special)s"
308 ''')
309 self._testBrokenEncoding(pofile_content)
310
311 def testBrokenEncodingHeader(self):
312 """A header field might require a different encoding, too.
313
314 This usually happens if the Last-Translator name contains non-ascii
315 characters.
316
317 If a pofile fails to encode using the character set specified in the
318 header, the header should be changed to specify to UTF-8 and the
319 pofile exported accordingly.
320 """
321
322 pofile_content = dedent('''
323 msgid ""
324 msgstr ""
325 "Project-Id-Version: foo\\n"
326 "Report-Msgid-Bugs-To: \\n"
327 "POT-Creation-Date: 2007-07-09 03:39+0100\\n"
328 "PO-Revision-Date: 2001-09-09 01:46+0000\\n"
329 "Last-Translator: Kubla K%(special)shn <kk@pleasure-dome.com>\\n"
330 "Language-Team: LANGUAGE <LL@li.org>\\n"
331 "MIME-Version: 1.0\\n"
332 "Content-Type: text/plain; charset=%(charset)s\\n"
333 "Content-Transfer-Encoding: 8bit\\n"
334
335 msgid "a"
336 msgstr "b"
337 ''')
338 self._testBrokenEncoding(pofile_content)
339
308 def testIncompletePluralMessage(self):340 def testIncompletePluralMessage(self):
309 """Test export correctness for partial plural messages."""341 """Test export correctness for partial plural messages."""
310342