Merge lp:~henninge/launchpad/bug-506925-oops-export into lp:launchpad

Proposed by Henning Eggers
Status: Merged
Approved by: Данило Шеган
Approved revision: not available
Merged at revision: not available
Proposed branch: lp:~henninge/launchpad/bug-506925-oops-export
Merge into: lp:launchpad
Diff against target: 272 lines (+109/-81)
2 files modified
lib/lp/translations/utilities/gettext_po_exporter.py (+50/-54)
lib/lp/translations/utilities/tests/test_gettext_po_exporter.py (+59/-27)
To merge this branch: bzr merge lp:~henninge/launchpad/bug-506925-oops-export
Reviewer Review Type Date Requested Status
Данило Шеган (community) code Approve
Review via email: mp+17304@code.launchpad.net

Commit message

Fixed oops in translations export code by properly encoding the whole file as utf-8 if needed.

To post a comment you must log in.
Revision history for this message
Henning Eggers (henninge) wrote :

= Bug 506925 =

Each PO file has a standard header with information about the file. One of the fields in that header is "Last-Translator" which contains the name and email address of the last person that did a translation in this file. Upon export from Launchpad, this field is filled with the display_name (or title? who cares) of the last translators IPerson object (and email address). This may contain non-ascii characters as in https://launchpad.net/~danilo. The header also has a "Content-Type" field that declares the character encoding of the file. This bug used to surface when that field did not contain a charset that could represent the translators name (ideally utf-8).

This problem is not new as such clashes have happened with translations, too, which also may contain non-ascii characters. The export routine alredy knows how to deal with this and simply changes the encoding of the file to UTF-8. This behaviour needed to simply be extended to the header, i.e. the whole file.

== Pre-imp notes ==

Talked with danilo about it and we agreed that it is ok to convert the whole file to UTF-8 if necessary. Most files are UTF-8 already.

== Implmentation details ==

lib/lp/translations/utilities/gettext_po_exporter.py

 * Encode the whole file at the end instead of each chunk seperately and in different places.
 * Did away with the need of having to re-encode all chunks if the encoding changes.
 * The header is prepended at the end or the loop so that the encoding can be changed at last-minute warning.

lib/lp/translations/utilities/tests/test_gettext_po_exporter.py

 * Added test for encoding of non-ascii characters in the header.
 * Combined common code into a private method.
 * Use named format specifiers because their order changes.

== Test ==

bin/test -vvct GettextPOExporterTestCase

== Demo/QA ==

 * Import a file that has "ascii" as it's Content-type.
 * Have Danilo make a translation using ascii characters.
 * Export the file.
 * The exported file should be UTF-8 now and Данило Шеган should appear as the Last-Translator.

= Launchpad lint =

Checking for conflicts. and issues in doctests and templates.
Running jslint, xmllint, pyflakes, and pylint.
Using normal rules.

Linting changed files:
  lib/lp/translations/utilities/gettext_po_exporter.py
  lib/lp/translations/utilities/tests/test_gettext_po_exporter.py

Revision history for this message
Данило Шеган (danilo) wrote :

As discussed on IRC, I have some concerns with _try_encode_file_content method: it has very weird return semantics. Basically, they are: return encoded content according to translation_file header charset, and if it fails, return None if translation_file header wasn't UTF-8, otherwise fail with an exception. Not a very reasonable method, I'd say :)

We've looked at several solutions to that, one was to just switch the callsite to do exception handling, and not handle exceptions in this method (where the second one might fail, and would naturally propagate), though this approach loses some debugging information existing code has (file name in the exception error).

We can also just do this "exception error message improvement" in the _try_encode_file_content (i.e. catch the exception there and re-raise it unconditionally, compared to what we do now where we re-raise it only if the requested encoding was UTF-8).

review: Approve (code)

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'lib/lp/translations/utilities/gettext_po_exporter.py'
2--- lib/lp/translations/utilities/gettext_po_exporter.py 2009-11-18 13:26:03 +0000
3+++ lib/lp/translations/utilities/gettext_po_exporter.py 2010-01-15 06:56:16 +0000
4@@ -277,13 +277,23 @@
5 """See `ITranslationFormatExporter`."""
6 return export_translation_message(translation_message)
7
8- def _makeExportedHeader(self, translation_file, charset=None):
9+ def _makeExportedHeader(self, translation_file):
10 """Transform the header information into a format suitable for export.
11
12 :return: Unicode string containing the header.
13 """
14 raise NotImplementedError
15
16+ def _encode_file_content(self, translation_file, exported_content):
17+ """Try to encode the file using the charset given in the header."""
18+ file_content = (
19+ self._makeExportedHeader(translation_file) +
20+ u'\n\n' +
21+ exported_content)
22+ encoded_file_content = file_content.encode(
23+ translation_file.header.charset)
24+ return encoded_file_content
25+
26 def exportTranslationFiles(self, translation_files, ignore_obsolete=False,
27 force_utf8=False):
28 """See `ITranslationFormatExporter`."""
29@@ -310,10 +320,7 @@
30 translation_file.language_code,
31 file_extension))
32
33- if force_utf8:
34- translation_file.header.charset = 'UTF-8'
35- chunks = [self._makeExportedHeader(translation_file)]
36-
37+ chunks = []
38 seen_keys = {}
39
40 for message in translation_file.messages:
41@@ -335,49 +342,42 @@
42 if (message.is_obsolete and
43 (ignore_obsolete or len(message.translations) == 0)):
44 continue
45- exported_message = self.exportTranslationMessageData(message)
46- try:
47- encoded_text = exported_message.encode(
48- translation_file.header.charset)
49- except UnicodeEncodeError, error:
50- if translation_file.header.charset.upper() == 'UTF-8':
51- # It's already UTF-8, we cannot do anything.
52- raise UnicodeEncodeError(
53- '%s:\n%s' % (file_path, str(error)))
54-
55- # This message cannot be represented in the current
56- # encoding.
57- if translation_file.path:
58- file_description = translation_file.path
59- elif translation_file.language_code:
60- file_description = (
61- "%s translation" % translation_file.language_code)
62- else:
63- file_description = "template"
64- logging.info(
65- "Can't represent %s as %s; using UTF-8 instead." % (
66- file_description,
67- translation_file.header.charset.upper()))
68-
69- old_charset = translation_file.header.charset
70- translation_file.header.charset = 'UTF-8'
71- # We need to update the header too.
72- chunks[0] = self._makeExportedHeader(
73- translation_file, old_charset)
74- # Update already exported entries.
75- for index, chunk in enumerate(chunks):
76- chunks[index] = chunk.decode(
77- old_charset).encode('UTF-8')
78- encoded_text = exported_message.encode('UTF-8')
79-
80- chunks.append(encoded_text)
81-
82- exported_file_content = '\n\n'.join(chunks)
83+ chunks.append(self.exportTranslationMessageData(message))
84+
85
86 # Gettext .po files are supposed to end with a new line.
87- exported_file_content += '\n'
88-
89- storage.addFile(file_path, file_extension, exported_file_content)
90+ exported_file_content = u'\n\n'.join(chunks) + u'\n'
91+
92+ # Try to encode the file
93+ if force_utf8:
94+ translation_file.header.charset = 'UTF-8'
95+ try:
96+ encoded_file_content = self._encode_file_content(
97+ translation_file, exported_file_content)
98+ except UnicodeEncodeError:
99+ if translation_file.header.charset.upper() == 'UTF-8':
100+ # It's already UTF-8, we cannot do anything.
101+ raise
102+ # This file content cannot be represented in the current
103+ # encoding.
104+ if translation_file.path:
105+ file_description = translation_file.path
106+ elif translation_file.language_code:
107+ file_description = (
108+ "%s translation" % translation_file.language_code)
109+ else:
110+ file_description = "template"
111+ logging.info(
112+ "Can't represent %s as %s; using UTF-8 instead." % (
113+ file_description,
114+ translation_file.header.charset.upper()))
115+ # Use UTF-8 instead.
116+ translation_file.header.charset = 'UTF-8'
117+ # This either succeeds or raises UnicodeError.
118+ encoded_file_content = self._encode_file_content(
119+ translation_file, exported_file_content)
120+
121+ storage.addFile(file_path, file_extension, encoded_file_content)
122
123 return storage.export()
124
125@@ -410,13 +410,11 @@
126 TranslationFileFormat.PO,
127 TranslationFileFormat.KDEPO]
128
129- def _makeExportedHeader(self, translation_file, charset=None):
130+ def _makeExportedHeader(self, translation_file):
131 """Create a standard gettext PO header, encoded as a message.
132
133 :return: The header message as a unicode string.
134 """
135- if charset is None:
136- charset = translation_file.header.charset
137 header_translation_message = TranslationMessageData()
138 header_translation_message.addTranslation(
139 TranslationConstants.SINGULAR_FORM,
140@@ -427,7 +425,7 @@
141 header_translation_message.flags.update(['fuzzy'])
142 exported_header = self.exportTranslationMessageData(
143 header_translation_message)
144- return exported_header.encode(charset)
145+ return exported_header
146
147
148 class GettextPOChangedExporter(GettextPOExporterBase):
149@@ -447,15 +445,13 @@
150 self.format = TranslationFileFormat.POCHANGED
151 self.supported_source_formats = []
152
153- def _makeExportedHeader(self, translation_file, charset=None):
154+ def _makeExportedHeader(self, translation_file):
155 """Create a header for changed PO files.
156 This is a reduced header containing a warning that this is an
157 icomplete gettext PO file.
158 :return: The header as a unicode string.
159 """
160- if charset is None:
161- charset = translation_file.header.charset
162- return self.exported_header.encode(charset)
163+ return self.exported_header
164
165 def acceptSingularClash(self, previous_message, current_message):
166 """See `GettextPOExporterBase`."""
167
168=== modified file 'lib/lp/translations/utilities/tests/test_gettext_po_exporter.py'
169--- lib/lp/translations/utilities/tests/test_gettext_po_exporter.py 2009-11-01 22:50:17 +0000
170+++ lib/lp/translations/utilities/tests/test_gettext_po_exporter.py 2010-01-15 06:56:16 +0000
171@@ -263,33 +263,9 @@
172 for pofile in pofiles:
173 compare(self, pofile)
174
175-
176- def testBrokenEncodingExport(self):
177- """Test what happens when the content and the encoding don't agree.
178-
179- If a pofile fails to encode using the character set specified in the
180- header, the header should be changed to specify to UTF-8 and the
181- pofile exported accordingly.
182- """
183-
184- pofile = dedent('''
185- msgid ""
186- msgstr ""
187- "Project-Id-Version: foo\\n"
188- "Report-Msgid-Bugs-To: \\n"
189- "POT-Creation-Date: 2007-07-09 03:39+0100\\n"
190- "PO-Revision-Date: 2001-09-09 01:46+0000\\n"
191- "Last-Translator: Kubla Kahn <kk@pleasure-dome.com>\\n"
192- "Language-Team: LANGUAGE <LL@li.org>\\n"
193- "MIME-Version: 1.0\\n"
194- "Content-Type: text/plain; charset=%s\\n"
195- "Content-Transfer-Encoding: 8bit\\n"
196-
197- msgid "a"
198- msgstr "%s"
199- ''')
200+ def _testBrokenEncoding(self, pofile_content):
201 translation_file = self.parser.parse(
202- pofile % ('ISO-8859-15', '\xe1'))
203+ pofile_content % {'charset': 'ISO-8859-15', 'special': '\xe1'})
204 translation_file.is_template = False
205 translation_file.language_code = 'es'
206 translation_file.path = 'po/es.po'
207@@ -302,9 +278,65 @@
208 [translation_file])
209
210 self._compareImportAndExport(
211- pofile.strip() % ('UTF-8', '\xc3\xa1'),
212+ pofile_content.strip() % {
213+ 'charset': 'UTF-8', 'special': '\xc3\xa1'},
214 exported_file.read().strip())
215
216+ def testBrokenEncodingExport(self):
217+ """Test what happens when the content and the encoding don't agree.
218+
219+ If a pofile fails to encode using the character set specified in the
220+ header, the header should be changed to specify to UTF-8 and the
221+ pofile exported accordingly.
222+ """
223+
224+ pofile_content = dedent('''
225+ msgid ""
226+ msgstr ""
227+ "Project-Id-Version: foo\\n"
228+ "Report-Msgid-Bugs-To: \\n"
229+ "POT-Creation-Date: 2007-07-09 03:39+0100\\n"
230+ "PO-Revision-Date: 2001-09-09 01:46+0000\\n"
231+ "Last-Translator: Kubla Kahn <kk@pleasure-dome.com>\\n"
232+ "Language-Team: LANGUAGE <LL@li.org>\\n"
233+ "MIME-Version: 1.0\\n"
234+ "Content-Type: text/plain; charset=%(charset)s\\n"
235+ "Content-Transfer-Encoding: 8bit\\n"
236+
237+ msgid "a"
238+ msgstr "%(special)s"
239+ ''')
240+ self._testBrokenEncoding(pofile_content)
241+
242+ def testBrokenEncodingHeader(self):
243+ """A header field might require a different encoding, too.
244+
245+ This usually happens if the Last-Translator name contains non-ascii
246+ characters.
247+
248+ If a pofile fails to encode using the character set specified in the
249+ header, the header should be changed to specify to UTF-8 and the
250+ pofile exported accordingly.
251+ """
252+
253+ pofile_content = dedent('''
254+ msgid ""
255+ msgstr ""
256+ "Project-Id-Version: foo\\n"
257+ "Report-Msgid-Bugs-To: \\n"
258+ "POT-Creation-Date: 2007-07-09 03:39+0100\\n"
259+ "PO-Revision-Date: 2001-09-09 01:46+0000\\n"
260+ "Last-Translator: Kubla K%(special)shn <kk@pleasure-dome.com>\\n"
261+ "Language-Team: LANGUAGE <LL@li.org>\\n"
262+ "MIME-Version: 1.0\\n"
263+ "Content-Type: text/plain; charset=%(charset)s\\n"
264+ "Content-Transfer-Encoding: 8bit\\n"
265+
266+ msgid "a"
267+ msgstr "b"
268+ ''')
269+ self._testBrokenEncoding(pofile_content)
270+
271 def testIncompletePluralMessage(self):
272 """Test export correctness for partial plural messages."""
273