Merge lp:~garyvdm/bzr/unicode_bom_detect into lp:bzr

Proposed by Gary van der Merwe
Status: Work in progress
Proposed branch: lp:~garyvdm/bzr/unicode_bom_detect
Merge into: lp:bzr
Diff against target: 229 lines (+91/-17)
7 files modified
NEWS (+7/-0)
bzrlib/diff.py (+2/-2)
bzrlib/merge.py (+1/-1)
bzrlib/merge3.py (+4/-4)
bzrlib/shelf_ui.py (+2/-2)
bzrlib/tests/test_textfile.py (+27/-6)
bzrlib/textfile.py (+48/-2)
To merge this branch: bzr merge lp:~garyvdm/bzr/unicode_bom_detect
Reviewer Review Type Date Requested Status
Martin Packman (community) Needs Information
bzr-core Pending
Review via email: mp+17076@code.launchpad.net
To post a comment you must log in.
Revision history for this message
Gary van der Merwe (garyvdm) wrote :

This makes files with a Unicode BOM correctly decoded and displayed in diff, merge, and shelve.

Revision history for this message
Martin Packman (gz) wrote :

Is this partly inspired by the current thread on python-dev?
<http://mail.python.org/pipermail/python-dev/2010-January/094828.html>

Can you clearly document what the actual heuristic is? It's not clear to me from reading the diff, and it's not the same as the notepad one. Why, for instance, are you not decoding UTF-8 text?

What happens if a UnicodeDecodeError is raised from one of these functions?

review: Needs Information
Revision history for this message
Aaron Bentley (abentley) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

review: resubmit

This branch introduces confusion between unicode characters and bytes
into our core commands.

Merge writes bytes, not unicode characters, to disk. Decoding the bytes
into unicode characters will mean that characters will be written to
disk. This only works for unicode characters in the ascii range because
python automatically converts unicode strings to ascii. Even though it
appears to work with the characters you tested, the merged version will
not be in utf-16, and that's not a reasonable result.

Similarly, patches are bytestreams, not unicode streams. They have no
defined encoding. It should be possible to apply a patch using
/bin/patch. But again, by decoding the bytes to unicode, this branch
makes that impossible.

A branch like this should, at minimum, test that the operations work
correctly with high-bit characters like u'\u1234', and that the input
encoding is preserved in the output.

Gary van der Merwe wrote:
> Gary van der Merwe has proposed merging lp:~garyvdm/bzr/unicode_bom_detect into lp:bzr.
>
> Requested reviews:
> bzr-core (bzr-core)
> Related bugs:
> #267296 utf16 file detected as binary file
> https://bugs.launchpad.net/bugs/267296
>
>
> This makes files with a Unicode BOM correctly decoded and displayed in diff, merge, and shelve.
>
>
>

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAktLaaAACgkQ0F+nu1YWqI2O+wCfcrD7i2BXd9g8a2jaWFff67ni
mSoAnjchuECApiMB0+yTgBjRbSaoGfST
=ERY4
-----END PGP SIGNATURE-----

Revision history for this message
Vincent Ladeuil (vila) wrote :

Gary, I'm marking this proposal as 'Work In progress', keep us informed of your progress and feel free to ask for help.

Revision history for this message
Alexander Belchenko (bialix) wrote :

> Is this partly inspired by the current thread on python-dev?
> <http://mail.python.org/pipermail/python-dev/2010-January/094828.html>

This is wrong URL. Do you have better one?

Revision history for this message
Martin Packman (gz) wrote :

> > Is this partly inspired by the current thread on python-dev?
> > <http://mail.python.org/pipermail/python-dev/2010-January/094828.html>
>
> This is wrong URL. Do you have better one?

Something bad seems to have happened to their mailman, there's a bunch of junk in January. Try this instead, but if it breaks, just look through the archives for the thread "Improve open() to support reading file starting with an unicode BOM":
<http://mail.python.org/pipermail/python-dev/2010-January/095550.html>

Revision history for this message
Alexander Belchenko (bialix) wrote :

Martin [gz] пишет:
>>> Is this partly inspired by the current thread on python-dev?
>>> <http://mail.python.org/pipermail/python-dev/2010-January/094828.html>
>> This is wrong URL. Do you have better one?
>
> Something bad seems to have happened to their mailman, there's a bunch of junk in January. Try this instead, but if it breaks, just look through the archives for the thread "Improve open() to support reading file starting with an unicode BOM":
> <http://mail.python.org/pipermail/python-dev/2010-January/095550.html>

Thanks.

Unmerged revisions

4952. By Gary van der Merwe

Update NEWS, and doc strings.

4951. By Gary van der Merwe

Use text_lines rather than check_text_lines.

4950. By Gary van der Merwe

Add text_file, which replaces check_text_lines. If a BOM encoding is detected, the lines are decoded, and returned.

4949. By Gary van der Merwe

Make text_file handle files with a BOM encoding.

4948. By Gary van der Merwe

Test text_file for files with a BOM encoding.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'NEWS'
--- NEWS 2010-01-08 09:27:39 +0000
+++ NEWS 2010-01-09 17:10:28 +0000
@@ -61,6 +61,9 @@
61 returns ``EINTR`` by calling ``PyErr_CheckSignals``. This affected the61 returns ``EINTR`` by calling ``PyErr_CheckSignals``. This affected the
62 optional ``_readdir_pyx`` extension. (Andrew Bennetts, #495023)62 optional ``_readdir_pyx`` extension. (Andrew Bennetts, #495023)
6363
64* Files with a Unicode BOM are now correctly decoded and displayed in diff,
65 merge, and shelve. (Gary van der Merwe, #267296)
66
64* Fixed a side effect mutation of ``RemoteBzrDirFormat._network_name``67* Fixed a side effect mutation of ``RemoteBzrDirFormat._network_name``
65 that caused some tests to fail when run in a non-default order.68 that caused some tests to fail when run in a non-default order.
66 Probably no user impact. (Martin Pool, #504102)69 Probably no user impact. (Martin Pool, #504102)
@@ -120,6 +123,10 @@
120 CamelCase. For the features that were more likely to be used, we added a123 CamelCase. For the features that were more likely to be used, we added a
121 deprecation thunk, but not all. (John Arbash Meinel)124 deprecation thunk, but not all. (John Arbash Meinel)
122125
126* ``bzrlib.textfile.check_text_lines`` has been deprecated; use
127 ``bzrlib.textfile.text_lines`` instead, which returns the lines, decoded if
128 a unicode BOM was detected. (Gary van der Merwe)
129
123* The Branch hooks pre_change_branch_tip no longer masks exceptions raised130* The Branch hooks pre_change_branch_tip no longer masks exceptions raised
124 by plugins - the original exceptions are now preserved. (Robert Collins)131 by plugins - the original exceptions are now preserved. (Robert Collins)
125132
126133
=== modified file 'bzrlib/diff.py'
--- bzrlib/diff.py 2009-12-03 07:36:17 +0000
+++ bzrlib/diff.py 2010-01-09 17:10:28 +0000
@@ -89,8 +89,8 @@
89 return89 return
9090
91 if allow_binary is False:91 if allow_binary is False:
92 textfile.check_text_lines(oldlines)92 oldlines = textfile.text_lines(oldlines)
93 textfile.check_text_lines(newlines)93 newlines = textfile.text_lines(newlines)
9494
95 if sequence_matcher is None:95 if sequence_matcher is None:
96 sequence_matcher = patiencediff.PatienceSequenceMatcher96 sequence_matcher = patiencediff.PatienceSequenceMatcher
9797
=== modified file 'bzrlib/merge.py'
--- bzrlib/merge.py 2009-12-18 08:22:42 +0000
+++ bzrlib/merge.py 2010-01-09 17:10:28 +0000
@@ -1444,7 +1444,7 @@
1444 lines = list(lines)1444 lines = list(lines)
1445 # Note we're checking whether the OUTPUT is binary in this case,1445 # Note we're checking whether the OUTPUT is binary in this case,
1446 # because we don't want to get into weave merge guts.1446 # because we don't want to get into weave merge guts.
1447 textfile.check_text_lines(lines)1447 lines = textfile.text_lines(lines)
1448 self.tt.create_file(lines, trans_id)1448 self.tt.create_file(lines, trans_id)
1449 if base_lines is not None:1449 if base_lines is not None:
1450 # Conflict1450 # Conflict
14511451
=== modified file 'bzrlib/merge3.py'
--- bzrlib/merge3.py 2009-03-23 14:59:43 +0000
+++ bzrlib/merge3.py 2010-01-09 17:10:28 +0000
@@ -21,7 +21,7 @@
2121
22from bzrlib.errors import CantReprocessAndShowBase22from bzrlib.errors import CantReprocessAndShowBase
23import bzrlib.patiencediff23import bzrlib.patiencediff
24from bzrlib.textfile import check_text_lines24from bzrlib.textfile import text_lines
2525
2626
27def intersect(ra, rb):27def intersect(ra, rb):
@@ -67,9 +67,9 @@
67 incorporating the changes from both BASE->OTHER and BASE->THIS.67 incorporating the changes from both BASE->OTHER and BASE->THIS.
68 All three will typically be sequences of lines."""68 All three will typically be sequences of lines."""
69 def __init__(self, base, a, b, is_cherrypick=False):69 def __init__(self, base, a, b, is_cherrypick=False):
70 check_text_lines(base)70 base = text_lines(base)
71 check_text_lines(a)71 a = text_lines(a)
72 check_text_lines(b)72 b = text_lines(b)
73 self.base = base73 self.base = base
74 self.a = a74 self.a = a
75 self.b = b75 self.b = b
7676
=== modified file 'bzrlib/shelf_ui.py'
--- bzrlib/shelf_ui.py 2009-12-15 17:57:26 +0000
+++ bzrlib/shelf_ui.py 2010-01-09 17:10:28 +0000
@@ -331,8 +331,8 @@
331 target_lines = work_tree_lines331 target_lines = work_tree_lines
332 else:332 else:
333 target_lines = self.target_tree.get_file_lines(file_id)333 target_lines = self.target_tree.get_file_lines(file_id)
334 textfile.check_text_lines(work_tree_lines)334 work_tree_lines = textfile.text_lines(work_tree_lines)
335 textfile.check_text_lines(target_lines)335 target_lines = textfile.text_lines(target_lines)
336 parsed = self.get_parsed_patch(file_id, self.reporter.invert_diff)336 parsed = self.get_parsed_patch(file_id, self.reporter.invert_diff)
337 final_hunks = []337 final_hunks = []
338 if not self.auto:338 if not self.auto:
339339
=== modified file 'bzrlib/tests/test_textfile.py'
--- bzrlib/tests/test_textfile.py 2009-03-23 14:59:43 +0000
+++ bzrlib/tests/test_textfile.py 2010-01-09 17:10:28 +0000
@@ -18,7 +18,8 @@
1818
19from bzrlib.errors import BinaryFile19from bzrlib.errors import BinaryFile
20from bzrlib.tests import TestCase, TestCaseInTempDir20from bzrlib.tests import TestCase, TestCaseInTempDir
21from bzrlib.textfile import text_file, check_text_lines, check_text_path21from bzrlib.textfile import (
22 text_file, text_lines, check_text_lines, check_text_path, bom_encodings)
2223
2324
24class TextFile(TestCase):25class TextFile(TestCase):
@@ -30,13 +31,33 @@
30 self.assertRaises(BinaryFile, text_file, s)31 self.assertRaises(BinaryFile, text_file, s)
31 s = StringIO('a' * 1024 + '\x00')32 s = StringIO('a' * 1024 + '\x00')
32 self.assertEqual(text_file(s).read(), s.getvalue())33 self.assertEqual(text_file(s).read(), s.getvalue())
3334
34 def test_check_text_lines(self):35 def test_text_file_with_bom_encoding(self):
36 u = u'ab' * 2048
37 for bom, encoding in bom_encodings:
38 if encoding:
39 s = StringIO(bom + u.encode(encoding))
40 else:
41 s = StringIO(bom + u.encode('utf_8'))
42
43 self.assertEqual(text_file(s).read(), u)
44
45 def test_text_lines(self):
35 lines = ['ab' * 2048]46 lines = ['ab' * 2048]
36 check_text_lines(lines)47 self.assertEqual(text_lines(lines), lines)
37 lines = ['a' * 1023 + '\x00']48 lines = ['a' * 1023 + '\x00']
38 self.assertRaises(BinaryFile, check_text_lines, lines)49 self.assertRaises(BinaryFile, text_lines, lines)
3950
51
52 def test_text_lines_with_bom_encoding(self):
53 u = u'hello'
54 for bom, encoding in bom_encodings:
55 if encoding:
56 lines = [bom + u.encode(encoding)]
57 else:
58 lines = [bom + u.encode('utf_8')]
59
60 self.assertEqual(''.join(text_lines(lines)), u)
4061
41class TextPath(TestCaseInTempDir):62class TextPath(TestCaseInTempDir):
4263
4364
=== modified file 'bzrlib/textfile.py'
--- bzrlib/textfile.py 2009-03-23 14:59:43 +0000
+++ bzrlib/textfile.py 2010-01-09 17:10:28 +0000
@@ -17,30 +17,76 @@
17"""Utilities for distinguishing binary files from text files"""17"""Utilities for distinguishing binary files from text files"""
1818
19from itertools import chain19from itertools import chain
20import codecs
2021
21from bzrlib.errors import BinaryFile22from bzrlib.errors import BinaryFile
22from bzrlib.iterablefile import IterableFile23from bzrlib.iterablefile import IterableFile
23from bzrlib.osutils import file_iterator24from bzrlib.osutils import file_iterator
2425from bzrlib.symbol_versioning import deprecated_function, deprecated_in
26
27
28# Note that the order of this is important. utf_32_le must be tested for before
29# utf_16_le, otherwise utf_16_le will give a false positive for utf_32_le.
30bom_encodings = [
31 (codecs.BOM_UTF32_BE, 'utf_32_be'),
32 (codecs.BOM_UTF32_LE, 'utf_32_le'),
33 (codecs.BOM_UTF16_BE, 'utf_16_be'),
34 (codecs.BOM_UTF16_LE, 'utf_16_le'),
35 (codecs.BOM_UTF8, None), # we don't want to re-decode
36 ]
2537
26def text_file(input):38def text_file(input):
27 """Produce a file iterator that is guaranteed to be text, without seeking.39 """Produce a file iterator that is guaranteed to be text, without seeking.
28 BinaryFile is raised if the file contains a NUL in the first 1024 bytes.40 BinaryFile is raised if the file contains a NUL in the first 1024 bytes.
41 If a unicode BOM is detected, the returned file will be decoded.
29 """42 """
30 first_chunk = input.read(1024)43 first_chunk = input.read(1024)
44 for bom, encoding in bom_encodings:
45 if first_chunk.startswith(bom):
46 first_chunk = first_chunk[len(bom):]
47 if encoding:
48 first_chunk = first_chunk.decode(encoding)
49 break
50 else:
51 encoding = None
52
31 if '\x00' in first_chunk:53 if '\x00' in first_chunk:
32 raise BinaryFile()54 raise BinaryFile()
55
56 if encoding:
57 return IterableFile(chain((first_chunk,),
58 decode_iterator(file_iterator(input), encoding)))
33 return IterableFile(chain((first_chunk,), file_iterator(input)))59 return IterableFile(chain((first_chunk,), file_iterator(input)))
3460
61def decode_iterator(iterator, encoding):
62 for s in iterator:
63 yield s.decode(encoding)
3564
65@deprecated_function(deprecated_in((2, 1, 0)))
36def check_text_lines(lines):66def check_text_lines(lines):
37 """Raise BinaryFile if the supplied lines contain NULs.67 """Raise BinaryFile if the supplied lines contain NULs.
38 Only the first 1024 characters are checked.68 Only the first 1024 characters are checked.
39 """69 """
70 text_lines(lines)
71
72def text_lines(lines):
73 """Raise BinaryFile if the supplied lines contain NULs.
74 Only the first 1024 characters are checked.
75 If a unicode BOM is detected, the returned lines will be decoded.
76 """
77 if lines:
78 for bom, encoding in bom_encodings:
79 if lines[0].startswith(bom):
80 lines[0] = lines[0][len(bom):]
81 if encoding:
82 lines = ''.join(lines).decode(encoding).splitlines()
83 break
84
40 f = IterableFile(lines)85 f = IterableFile(lines)
41 if '\x00' in f.read(1024):86 if '\x00' in f.read(1024):
42 raise BinaryFile()87 raise BinaryFile()
4388
89 return lines
4490
45def check_text_path(path):91def check_text_path(path):
46 """Check whether the supplied path is a text, not binary file.92 """Check whether the supplied path is a text, not binary file.