Bazaar

Merge lp:~garyvdm/bzr/unicode_bom_detect into lp:bzr

unicode_bom_detect
Merge into bzr.dev

Proposed by Gary van der Merwe on 2010-01-09

Status:

Work in progress

Proposed branch:

lp:~garyvdm/bzr/unicode_bom_detect

Merge into:

lp:bzr

Diff against target:

229 lines (+91/-17)

7 files modified

NEWS (+7/-0)
bzrlib/diff.py (+2/-2)
bzrlib/merge.py (+1/-1)
bzrlib/merge3.py (+4/-4)
bzrlib/shelf_ui.py (+2/-2)
bzrlib/tests/test_textfile.py (+27/-6)
bzrlib/textfile.py (+48/-2)

To merge this branch:

bzr merge lp:~garyvdm/bzr/unicode_bom_detect

Medium

Confirmed

Link a bug report

Reviewer	Date Requested	Status
Martin Packman (community)		Needs Information on 2010-01-11
bzr-core	2010-01-09	Pending
Review via email: mp+17076@code.launchpad.net

Revision history for this message

Gary van der Merwe (garyvdm) wrote on 2010-01-09:

This makes files with a Unicode BOM correctly decoded and displayed in diff, merge, and shelve.

Revision history for this message

Martin Packman (gz) wrote on 2010-01-11:

Is this partly inspired by the current thread on python-dev?
<http://mail.python.org/pipermail/python-dev/2010-January/094828.html>

Can you clearly document what the actual heuristic is? It's not clear to me from reading the diff, and it's not the same as the notepad one. Why, for instance, are you not decoding UTF-8 text?

What happens if a UnicodeDecodeError is raised from one of these functions?

review: Needs Information

Revision history for this message

Aaron Bentley (abentley) wrote on 2010-01-11:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

review: resubmit

This branch introduces confusion between unicode characters and bytes
into our core commands.

Merge writes bytes, not unicode characters, to disk. Decoding the bytes
into unicode characters will mean that characters will be written to
disk. This only works for unicode characters in the ascii range because
python automatically converts unicode strings to ascii. Even though it
appears to work with the characters you tested, the merged version will
not be in utf-16, and that's not a reasonable result.

Similarly, patches are bytestreams, not unicode streams. They have no
defined encoding. It should be possible to apply a patch using
/bin/patch. But again, by decoding the bytes to unicode, this branch
makes that impossible.

A branch like this should, at minimum, test that the operations work
correctly with high-bit characters like u'\u1234', and that the input
encoding is preserved in the output.

Gary van der Merwe wrote:
> Gary van der Merwe has proposed merging lp:~garyvdm/bzr/unicode_bom_detect into lp:bzr.
>
> Requested reviews:
> bzr-core (bzr-core)
> Related bugs:
> #267296 utf16 file detected as binary file
> https://bugs.launchpad.net/bugs/267296
>
>
> This makes files with a Unicode BOM correctly decoded and displayed in diff, merge, and shelve.
>
>
>

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAktLaaAACgkQ0F+nu1YWqI2O+wCfcrD7i2BXd9g8a2jaWFff67ni
mSoAnjchuECApiMB0+yTgBjRbSaoGfST
=ERY4
-----END PGP SIGNATURE-----

Revision history for this message

Vincent Ladeuil (vila) wrote on 2010-01-19:

Gary, I'm marking this proposal as 'Work In progress', keep us informed of your progress and feel free to ask for help.

Revision history for this message

Alexander Belchenko (bialix) wrote on 2010-01-19:

> Is this partly inspired by the current thread on python-dev?
> <http://mail.python.org/pipermail/python-dev/2010-January/094828.html>

This is wrong URL. Do you have better one?

Revision history for this message

Martin Packman (gz) wrote on 2010-01-19:

> > Is this partly inspired by the current thread on python-dev?
> > <http://mail.python.org/pipermail/python-dev/2010-January/094828.html>
>
> This is wrong URL. Do you have better one?

Something bad seems to have happened to their mailman, there's a bunch of junk in January. Try this instead, but if it breaks, just look through the archives for the thread "Improve open() to support reading file starting with an unicode BOM":
<http://mail.python.org/pipermail/python-dev/2010-January/095550.html>

Revision history for this message

Alexander Belchenko (bialix) wrote on 2010-01-19:

Martin [gz] пишет:
>>> Is this partly inspired by the current thread on python-dev?
>>> <http://mail.python.org/pipermail/python-dev/2010-January/094828.html>
>> This is wrong URL. Do you have better one?
>
> Something bad seems to have happened to their mailman, there's a bunch of junk in January. Try this instead, but if it breaks, just look through the archives for the thread "Improve open() to support reading file starting with an unicode BOM":
> <http://mail.python.org/pipermail/python-dev/2010-January/095550.html>

Thanks.

Unmerged revisions

4952. By Gary van der Merwe on 2010-01-09: Update NEWS, and doc strings.
4951. By Gary van der Merwe on 2010-01-09: Use text_lines rather than check_text_lines.
4950. By Gary van der Merwe on 2010-01-09: Add text_file, which replaces check_text_lines. If a BOM encoding is detected, the lines are decoded, and returned.
4949. By Gary van der Merwe on 2010-01-09: Make text_file handle files with a BOM encoding.
4948. By Gary van der Merwe on 2010-01-09: Test text_file for files with a BOM encoding.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Alejandro Cornejo2

Alexander Belchenko

Bazaar Codereview Subscribers

Benoit Pierre

Gary van der Merwe

Gmood

Karl Bielefeldt

Mahmoud Hassan

Matt Nordhoff

Mohd Fikri Mohd Amin

MrJOHN

Václav Haisman

bzr PQM

vincenzo

to status/vote changes:

amandla2023

Bazaar

Merge lp:~garyvdm/bzr/unicode_bom_detect into lp:bzr

Commit message

Description of the change

Unmerged revisions

Preview Diff

Subscribers

 === modified file 'NEWS'
 --- NEWS	2010-01-08 09:27:39 +0000
 +++ NEWS	2010-01-09 17:10:28 +0000
@@ -61,6 +61,9 @@
    returns ``EINTR`` by calling ``PyErr_CheckSignals``.  This affected the
    optional ``_readdir_pyx`` extension.  (Andrew Bennetts, #495023)
++* Files with a Unicode BOM are now correctly decoded and displayed in diff,
++  merge, and shelve. (Gary van der Merwe, #267296)
++
  * Fixed a side effect mutation of ``RemoteBzrDirFormat._network_name``
    that caused some tests to fail when run in a non-default order.
    Probably no user impact.  (Martin Pool, #504102)
@@ -120,6 +123,10 @@
    CamelCase. For the features that were more likely to be used, we added a
    deprecation thunk, but not all. (John Arbash Meinel)
++* ``bzrlib.textfile.check_text_lines`` has been deprecated; use
++  ``bzrlib.textfile.text_lines`` instead, which returns the lines, decoded if
++  a unicode BOM was detected.  (Gary van der Merwe)
++
  * The Branch hooks pre_change_branch_tip no longer masks exceptions raised
    by plugins - the original exceptions are now preserved. (Robert Collins)
 === modified file 'bzrlib/diff.py'
 --- bzrlib/diff.py	2009-12-03 07:36:17 +0000
 +++ bzrlib/diff.py	2010-01-09 17:10:28 +0000
@@ -89,8 +89,8 @@
          return
      if allow_binary is False:
--        textfile.check_text_lines(oldlines)
--        textfile.check_text_lines(newlines)
++        oldlines = textfile.text_lines(oldlines)
++        newlines = textfile.text_lines(newlines)
      if sequence_matcher is None:
          sequence_matcher = patiencediff.PatienceSequenceMatcher
 === modified file 'bzrlib/merge.py'
 --- bzrlib/merge.py	2009-12-18 08:22:42 +0000
 +++ bzrlib/merge.py	2010-01-09 17:10:28 +0000
@@ -1444,7 +1444,7 @@
          lines = list(lines)
          # Note we're checking whether the OUTPUT is binary in this case,
          # because we don't want to get into weave merge guts.
--        textfile.check_text_lines(lines)
++        lines = textfile.text_lines(lines)
          self.tt.create_file(lines, trans_id)
          if base_lines is not None:
              # Conflict
 === modified file 'bzrlib/merge3.py'
 --- bzrlib/merge3.py	2009-03-23 14:59:43 +0000
 +++ bzrlib/merge3.py	2010-01-09 17:10:28 +0000
@@ -21,7 +21,7 @@
  from bzrlib.errors import CantReprocessAndShowBase
  import bzrlib.patiencediff
--from bzrlib.textfile import check_text_lines
++from bzrlib.textfile import text_lines
  def intersect(ra, rb):
@@ -67,9 +67,9 @@
      incorporating the changes from both BASE->OTHER and BASE->THIS.
      All three will typically be sequences of lines."""
      def __init__(self, base, a, b, is_cherrypick=False):
--        check_text_lines(base)
--        check_text_lines(a)
--        check_text_lines(b)
++        base = text_lines(base)
++        a = text_lines(a)
++        b = text_lines(b)
          self.base = base
          self.a = a
          self.b = b
 === modified file 'bzrlib/shelf_ui.py'
 --- bzrlib/shelf_ui.py	2009-12-15 17:57:26 +0000
 +++ bzrlib/shelf_ui.py	2010-01-09 17:10:28 +0000
@@ -331,8 +331,8 @@
              target_lines = work_tree_lines
          else:
              target_lines = self.target_tree.get_file_lines(file_id)
--        textfile.check_text_lines(work_tree_lines)
--        textfile.check_text_lines(target_lines)
++        work_tree_lines = textfile.text_lines(work_tree_lines)
++        target_lines = textfile.text_lines(target_lines)
          parsed = self.get_parsed_patch(file_id, self.reporter.invert_diff)
          final_hunks = []
          if not self.auto:
 === modified file 'bzrlib/tests/test_textfile.py'
 --- bzrlib/tests/test_textfile.py	2009-03-23 14:59:43 +0000
 +++ bzrlib/tests/test_textfile.py	2010-01-09 17:10:28 +0000
@@ -18,7 +18,8 @@
  from bzrlib.errors import BinaryFile
  from bzrlib.tests import TestCase, TestCaseInTempDir
--from bzrlib.textfile import text_file, check_text_lines, check_text_path
++from bzrlib.textfile import (
++    text_file, text_lines, check_text_lines, check_text_path, bom_encodings)
  class TextFile(TestCase):
@@ -30,13 +31,33 @@
          self.assertRaises(BinaryFile, text_file, s)
          s = StringIO('a' * 1024 + '\x00')
          self.assertEqual(text_file(s).read(), s.getvalue())
--
--    def test_check_text_lines(self):
++
++    def test_text_file_with_bom_encoding(self):
++        u = u'ab' * 2048
++        for bom, encoding in bom_encodings:
++            if encoding:
++                s = StringIO(bom + u.encode(encoding))
++            else:
++                s = StringIO(bom + u.encode('utf_8'))
++
++            self.assertEqual(text_file(s).read(), u)
++
++    def test_text_lines(self):
          lines = ['ab' * 2048]
--        check_text_lines(lines)
++        self.assertEqual(text_lines(lines), lines)
          lines = ['a' * 1023 + '\x00']
--        self.assertRaises(BinaryFile, check_text_lines, lines)
--
++        self.assertRaises(BinaryFile, text_lines, lines)
++
++
++    def test_text_lines_with_bom_encoding(self):
++        u = u'hello'
++        for bom, encoding in bom_encodings:
++            if encoding:
++                lines = [bom + u.encode(encoding)]
++            else:
++                lines = [bom + u.encode('utf_8')]
++
++            self.assertEqual(''.join(text_lines(lines)), u)
  class TextPath(TestCaseInTempDir):
 === modified file 'bzrlib/textfile.py'
 --- bzrlib/textfile.py	2009-03-23 14:59:43 +0000
 +++ bzrlib/textfile.py	2010-01-09 17:10:28 +0000
@@ -17,30 +17,76 @@
  """Utilities for distinguishing binary files from text files"""
  from itertools import chain
++import codecs
  from bzrlib.errors import BinaryFile
  from bzrlib.iterablefile import IterableFile
  from bzrlib.osutils import file_iterator
--
++from bzrlib.symbol_versioning import deprecated_function, deprecated_in
++
++
++# Note that the order of this is important. utf_32_le must be tested for before
++# utf_16_le, otherwise utf_16_le will give a false positive for utf_32_le.
++bom_encodings = [
++        (codecs.BOM_UTF32_BE, 'utf_32_be'),
++        (codecs.BOM_UTF32_LE, 'utf_32_le'),
++        (codecs.BOM_UTF16_BE, 'utf_16_be'),
++        (codecs.BOM_UTF16_LE, 'utf_16_le'),
++        (codecs.BOM_UTF8, None), # we don't want to re-decode
++    ]
  def text_file(input):
      """Produce a file iterator that is guaranteed to be text, without seeking.
      BinaryFile is raised if the file contains a NUL in the first 1024 bytes.
++    If a unicode BOM is detected, the returned file will be decoded.
      """
      first_chunk = input.read(1024)
++    for bom, encoding in bom_encodings:
++        if first_chunk.startswith(bom):
++            first_chunk = first_chunk[len(bom):]
++            if encoding:
++                first_chunk = first_chunk.decode(encoding)
++            break
++    else:
++        encoding = None
++
      if '\x00' in first_chunk:
          raise BinaryFile()
++
++    if encoding:
++        return IterableFile(chain((first_chunk,),
++            decode_iterator(file_iterator(input), encoding)))
      return IterableFile(chain((first_chunk,), file_iterator(input)))
++def decode_iterator(iterator, encoding):
++    for s in iterator:
++        yield s.decode(encoding)
++@deprecated_function(deprecated_in((2, 1, 0)))
  def check_text_lines(lines):
      """Raise BinaryFile if the supplied lines contain NULs.
      Only the first 1024 characters are checked.
      """
++    text_lines(lines)
++
++def text_lines(lines):
++    """Raise BinaryFile if the supplied lines contain NULs.
++    Only the first 1024 characters are checked.
++    If a unicode BOM is detected, the returned lines will be decoded.
++    """
++    if lines:
++        for bom, encoding in bom_encodings:
++            if lines[0].startswith(bom):
++                lines[0] = lines[0][len(bom):]
++                if encoding:
++                    lines = ''.join(lines).decode(encoding).splitlines()
++                break
++
      f = IterableFile(lines)
      if '\x00' in f.read(1024):
          raise BinaryFile()
--
++
++    return lines
  def check_text_path(path):
      """Check whether the supplied path is a text, not binary file.