Merge lp:~garyvdm/bzr/unicode_bom_detect into lp:bzr

Proposed by Gary van der Merwe
Status: Work in progress
Proposed branch: lp:~garyvdm/bzr/unicode_bom_detect
Merge into: lp:bzr
Diff against target: 229 lines (+91/-17)
7 files modified
NEWS (+7/-0)
bzrlib/diff.py (+2/-2)
bzrlib/merge.py (+1/-1)
bzrlib/merge3.py (+4/-4)
bzrlib/shelf_ui.py (+2/-2)
bzrlib/tests/test_textfile.py (+27/-6)
bzrlib/textfile.py (+48/-2)
To merge this branch: bzr merge lp:~garyvdm/bzr/unicode_bom_detect
Reviewer Review Type Date Requested Status
Martin Packman (community) Needs Information
bzr-core Pending
Review via email: mp+17076@code.launchpad.net
To post a comment you must log in.
Revision history for this message
Gary van der Merwe (garyvdm) wrote :

This makes files with a Unicode BOM correctly decoded and displayed in diff, merge, and shelve.

Revision history for this message
Martin Packman (gz) wrote :

Is this partly inspired by the current thread on python-dev?
<http://mail.python.org/pipermail/python-dev/2010-January/094828.html>

Can you clearly document what the actual heuristic is? It's not clear to me from reading the diff, and it's not the same as the notepad one. Why, for instance, are you not decoding UTF-8 text?

What happens if a UnicodeDecodeError is raised from one of these functions?

review: Needs Information
Revision history for this message
Aaron Bentley (abentley) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

review: resubmit

This branch introduces confusion between unicode characters and bytes
into our core commands.

Merge writes bytes, not unicode characters, to disk. Decoding the bytes
into unicode characters will mean that characters will be written to
disk. This only works for unicode characters in the ascii range because
python automatically converts unicode strings to ascii. Even though it
appears to work with the characters you tested, the merged version will
not be in utf-16, and that's not a reasonable result.

Similarly, patches are bytestreams, not unicode streams. They have no
defined encoding. It should be possible to apply a patch using
/bin/patch. But again, by decoding the bytes to unicode, this branch
makes that impossible.

A branch like this should, at minimum, test that the operations work
correctly with high-bit characters like u'\u1234', and that the input
encoding is preserved in the output.

Gary van der Merwe wrote:
> Gary van der Merwe has proposed merging lp:~garyvdm/bzr/unicode_bom_detect into lp:bzr.
>
> Requested reviews:
> bzr-core (bzr-core)
> Related bugs:
> #267296 utf16 file detected as binary file
> https://bugs.launchpad.net/bugs/267296
>
>
> This makes files with a Unicode BOM correctly decoded and displayed in diff, merge, and shelve.
>
>
>

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAktLaaAACgkQ0F+nu1YWqI2O+wCfcrD7i2BXd9g8a2jaWFff67ni
mSoAnjchuECApiMB0+yTgBjRbSaoGfST
=ERY4
-----END PGP SIGNATURE-----

Revision history for this message
Vincent Ladeuil (vila) wrote :

Gary, I'm marking this proposal as 'Work In progress', keep us informed of your progress and feel free to ask for help.

Revision history for this message
Alexander Belchenko (bialix) wrote :

> Is this partly inspired by the current thread on python-dev?
> <http://mail.python.org/pipermail/python-dev/2010-January/094828.html>

This is wrong URL. Do you have better one?

Revision history for this message
Martin Packman (gz) wrote :

> > Is this partly inspired by the current thread on python-dev?
> > <http://mail.python.org/pipermail/python-dev/2010-January/094828.html>
>
> This is wrong URL. Do you have better one?

Something bad seems to have happened to their mailman, there's a bunch of junk in January. Try this instead, but if it breaks, just look through the archives for the thread "Improve open() to support reading file starting with an unicode BOM":
<http://mail.python.org/pipermail/python-dev/2010-January/095550.html>

Revision history for this message
Alexander Belchenko (bialix) wrote :

Martin [gz] пишет:
>>> Is this partly inspired by the current thread on python-dev?
>>> <http://mail.python.org/pipermail/python-dev/2010-January/094828.html>
>> This is wrong URL. Do you have better one?
>
> Something bad seems to have happened to their mailman, there's a bunch of junk in January. Try this instead, but if it breaks, just look through the archives for the thread "Improve open() to support reading file starting with an unicode BOM":
> <http://mail.python.org/pipermail/python-dev/2010-January/095550.html>

Thanks.

Unmerged revisions

4952. By Gary van der Merwe

Update NEWS, and doc strings.

4951. By Gary van der Merwe

Use text_lines rather than check_text_lines.

4950. By Gary van der Merwe

Add text_file, which replaces check_text_lines. If a BOM encoding is detected, the lines are decoded, and returned.

4949. By Gary van der Merwe

Make text_file handle files with a BOM encoding.

4948. By Gary van der Merwe

Test text_file for files with a BOM encoding.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'NEWS'
2--- NEWS 2010-01-08 09:27:39 +0000
3+++ NEWS 2010-01-09 17:10:28 +0000
4@@ -61,6 +61,9 @@
5 returns ``EINTR`` by calling ``PyErr_CheckSignals``. This affected the
6 optional ``_readdir_pyx`` extension. (Andrew Bennetts, #495023)
7
8+* Files with a Unicode BOM are now correctly decoded and displayed in diff,
9+ merge, and shelve. (Gary van der Merwe, #267296)
10+
11 * Fixed a side effect mutation of ``RemoteBzrDirFormat._network_name``
12 that caused some tests to fail when run in a non-default order.
13 Probably no user impact. (Martin Pool, #504102)
14@@ -120,6 +123,10 @@
15 CamelCase. For the features that were more likely to be used, we added a
16 deprecation thunk, but not all. (John Arbash Meinel)
17
18+* ``bzrlib.textfile.check_text_lines`` has been deprecated; use
19+ ``bzrlib.textfile.text_lines`` instead, which returns the lines, decoded if
20+ a unicode BOM was detected. (Gary van der Merwe)
21+
22 * The Branch hooks pre_change_branch_tip no longer masks exceptions raised
23 by plugins - the original exceptions are now preserved. (Robert Collins)
24
25
26=== modified file 'bzrlib/diff.py'
27--- bzrlib/diff.py 2009-12-03 07:36:17 +0000
28+++ bzrlib/diff.py 2010-01-09 17:10:28 +0000
29@@ -89,8 +89,8 @@
30 return
31
32 if allow_binary is False:
33- textfile.check_text_lines(oldlines)
34- textfile.check_text_lines(newlines)
35+ oldlines = textfile.text_lines(oldlines)
36+ newlines = textfile.text_lines(newlines)
37
38 if sequence_matcher is None:
39 sequence_matcher = patiencediff.PatienceSequenceMatcher
40
41=== modified file 'bzrlib/merge.py'
42--- bzrlib/merge.py 2009-12-18 08:22:42 +0000
43+++ bzrlib/merge.py 2010-01-09 17:10:28 +0000
44@@ -1444,7 +1444,7 @@
45 lines = list(lines)
46 # Note we're checking whether the OUTPUT is binary in this case,
47 # because we don't want to get into weave merge guts.
48- textfile.check_text_lines(lines)
49+ lines = textfile.text_lines(lines)
50 self.tt.create_file(lines, trans_id)
51 if base_lines is not None:
52 # Conflict
53
54=== modified file 'bzrlib/merge3.py'
55--- bzrlib/merge3.py 2009-03-23 14:59:43 +0000
56+++ bzrlib/merge3.py 2010-01-09 17:10:28 +0000
57@@ -21,7 +21,7 @@
58
59 from bzrlib.errors import CantReprocessAndShowBase
60 import bzrlib.patiencediff
61-from bzrlib.textfile import check_text_lines
62+from bzrlib.textfile import text_lines
63
64
65 def intersect(ra, rb):
66@@ -67,9 +67,9 @@
67 incorporating the changes from both BASE->OTHER and BASE->THIS.
68 All three will typically be sequences of lines."""
69 def __init__(self, base, a, b, is_cherrypick=False):
70- check_text_lines(base)
71- check_text_lines(a)
72- check_text_lines(b)
73+ base = text_lines(base)
74+ a = text_lines(a)
75+ b = text_lines(b)
76 self.base = base
77 self.a = a
78 self.b = b
79
80=== modified file 'bzrlib/shelf_ui.py'
81--- bzrlib/shelf_ui.py 2009-12-15 17:57:26 +0000
82+++ bzrlib/shelf_ui.py 2010-01-09 17:10:28 +0000
83@@ -331,8 +331,8 @@
84 target_lines = work_tree_lines
85 else:
86 target_lines = self.target_tree.get_file_lines(file_id)
87- textfile.check_text_lines(work_tree_lines)
88- textfile.check_text_lines(target_lines)
89+ work_tree_lines = textfile.text_lines(work_tree_lines)
90+ target_lines = textfile.text_lines(target_lines)
91 parsed = self.get_parsed_patch(file_id, self.reporter.invert_diff)
92 final_hunks = []
93 if not self.auto:
94
95=== modified file 'bzrlib/tests/test_textfile.py'
96--- bzrlib/tests/test_textfile.py 2009-03-23 14:59:43 +0000
97+++ bzrlib/tests/test_textfile.py 2010-01-09 17:10:28 +0000
98@@ -18,7 +18,8 @@
99
100 from bzrlib.errors import BinaryFile
101 from bzrlib.tests import TestCase, TestCaseInTempDir
102-from bzrlib.textfile import text_file, check_text_lines, check_text_path
103+from bzrlib.textfile import (
104+ text_file, text_lines, check_text_lines, check_text_path, bom_encodings)
105
106
107 class TextFile(TestCase):
108@@ -30,13 +31,33 @@
109 self.assertRaises(BinaryFile, text_file, s)
110 s = StringIO('a' * 1024 + '\x00')
111 self.assertEqual(text_file(s).read(), s.getvalue())
112-
113- def test_check_text_lines(self):
114+
115+ def test_text_file_with_bom_encoding(self):
116+ u = u'ab' * 2048
117+ for bom, encoding in bom_encodings:
118+ if encoding:
119+ s = StringIO(bom + u.encode(encoding))
120+ else:
121+ s = StringIO(bom + u.encode('utf_8'))
122+
123+ self.assertEqual(text_file(s).read(), u)
124+
125+ def test_text_lines(self):
126 lines = ['ab' * 2048]
127- check_text_lines(lines)
128+ self.assertEqual(text_lines(lines), lines)
129 lines = ['a' * 1023 + '\x00']
130- self.assertRaises(BinaryFile, check_text_lines, lines)
131-
132+ self.assertRaises(BinaryFile, text_lines, lines)
133+
134+
135+ def test_text_lines_with_bom_encoding(self):
136+ u = u'hello'
137+ for bom, encoding in bom_encodings:
138+ if encoding:
139+ lines = [bom + u.encode(encoding)]
140+ else:
141+ lines = [bom + u.encode('utf_8')]
142+
143+ self.assertEqual(''.join(text_lines(lines)), u)
144
145 class TextPath(TestCaseInTempDir):
146
147
148=== modified file 'bzrlib/textfile.py'
149--- bzrlib/textfile.py 2009-03-23 14:59:43 +0000
150+++ bzrlib/textfile.py 2010-01-09 17:10:28 +0000
151@@ -17,30 +17,76 @@
152 """Utilities for distinguishing binary files from text files"""
153
154 from itertools import chain
155+import codecs
156
157 from bzrlib.errors import BinaryFile
158 from bzrlib.iterablefile import IterableFile
159 from bzrlib.osutils import file_iterator
160-
161+from bzrlib.symbol_versioning import deprecated_function, deprecated_in
162+
163+
164+# Note that the order of this is important. utf_32_le must be tested for before
165+# utf_16_le, otherwise utf_16_le will give a false positive for utf_32_le.
166+bom_encodings = [
167+ (codecs.BOM_UTF32_BE, 'utf_32_be'),
168+ (codecs.BOM_UTF32_LE, 'utf_32_le'),
169+ (codecs.BOM_UTF16_BE, 'utf_16_be'),
170+ (codecs.BOM_UTF16_LE, 'utf_16_le'),
171+ (codecs.BOM_UTF8, None), # we don't want to re-decode
172+ ]
173
174 def text_file(input):
175 """Produce a file iterator that is guaranteed to be text, without seeking.
176 BinaryFile is raised if the file contains a NUL in the first 1024 bytes.
177+ If a unicode BOM is detected, the returned file will be decoded.
178 """
179 first_chunk = input.read(1024)
180+ for bom, encoding in bom_encodings:
181+ if first_chunk.startswith(bom):
182+ first_chunk = first_chunk[len(bom):]
183+ if encoding:
184+ first_chunk = first_chunk.decode(encoding)
185+ break
186+ else:
187+ encoding = None
188+
189 if '\x00' in first_chunk:
190 raise BinaryFile()
191+
192+ if encoding:
193+ return IterableFile(chain((first_chunk,),
194+ decode_iterator(file_iterator(input), encoding)))
195 return IterableFile(chain((first_chunk,), file_iterator(input)))
196
197+def decode_iterator(iterator, encoding):
198+ for s in iterator:
199+ yield s.decode(encoding)
200
201+@deprecated_function(deprecated_in((2, 1, 0)))
202 def check_text_lines(lines):
203 """Raise BinaryFile if the supplied lines contain NULs.
204 Only the first 1024 characters are checked.
205 """
206+ text_lines(lines)
207+
208+def text_lines(lines):
209+ """Raise BinaryFile if the supplied lines contain NULs.
210+ Only the first 1024 characters are checked.
211+ If a unicode BOM is detected, the returned lines will be decoded.
212+ """
213+ if lines:
214+ for bom, encoding in bom_encodings:
215+ if lines[0].startswith(bom):
216+ lines[0] = lines[0][len(bom):]
217+ if encoding:
218+ lines = ''.join(lines).decode(encoding).splitlines()
219+ break
220+
221 f = IterableFile(lines)
222 if '\x00' in f.read(1024):
223 raise BinaryFile()
224-
225+
226+ return lines
227
228 def check_text_path(path):
229 """Check whether the supplied path is a text, not binary file.