Merge lp:~songofacandy/bzr/fix-523746-dev into lp:bzr

Proposed by methane
Status: Merged
Approved by: Martin Pool
Approved revision: no longer in the source branch.
Merged at revision: 5892
Proposed branch: lp:~songofacandy/bzr/fix-523746-dev
Merge into: lp:bzr
Diff against target: 156 lines (+98/-12)
2 files modified
bzrlib/diff.py (+51/-11)
bzrlib/tests/test_diff.py (+47/-1)
To merge this branch: bzr merge lp:~songofacandy/bzr/fix-523746-dev
Reviewer Review Type Date Requested Status
John A Meinel Approve
Vincent Ladeuil Abstain
Alexander Belchenko Pending
Gary van der Merwe Pending
Martin Packman Pending
bzr-core Pending
Review via email: mp+20537@code.launchpad.net

Commit message

better handling of subprocesses with non-ascii encodings and filenames (bug 523746)

To post a comment you must log in.
Revision history for this message
methane (songofacandy) wrote :
Revision history for this message
Vincent Ladeuil (vila) wrote :

@Naoki, can you please use more meaningful branch names
(see Making a Merge Proposal in doc/developers/HACKING.txt) to ease Patch Pilot work ? :-)

Also, it helps a lot if you put an initial comment (we sometimes call it the cover letter)
when you start the merge proposal explaining the intent of the patch.

If you've fixed a bug, you desserve credit for that, so please add a NEWS entry too :)

review: Needs Fixing
Revision history for this message
Vincent Ladeuil (vila) wrote :

Oh, silly me, I mis-read your initial comment as pointing to itself,
where in fact, you just did what I asked in the *other* mp.

Anyway, this illustrates why I find sparse branch names unhelpful, I just didn't
recognize it :)

In fact you made the comments I was waiting for on the other mp:

> This problem depends on OS, locale settings and other environment.
> So I agree with you that this patch should land to bzr.dev and be tested on
> vary environment while bzr 2.2 is released.

Fine, I'll mark the other mp rejected.

So, test against variations on locale settings exist in bzrlib/tests/test_non_ascii.py
and bzrlib/tests/blackbox/test_non_ascii.py

Ask for help if you don't find the existing tests enough to write your own.

review: Needs Fixing
Revision history for this message
Vincent Ladeuil (vila) wrote :

@Naoki, I've added tests in lp:~vila/bzr/523746-difftool-file-names and doing so I think I found a simpler fix, could you test it and give feedback ?

review: Needs Information
Revision history for this message
methane (songofacandy) wrote :

This problem contains many situation.
I think your fix doesn't work sometimes because:

1. osutils.get_user_encoding() is not good in rare case.
user_encoding is not equals to filesystem encoding.

2. path.encode(enc) can raise UnicodeEncodeError.
So bzr should use "encode(enc, 'replace')"

3. _write_file() writes file with unicode path.
But if bzr can't encode the path collectly, external command can't
open the file.
So bzr should encode the path before _write_file().

To test these situation:

* Use fixed encoding in the test.
* Use filenames that contains chars that can't encode with the encoding.
* Replace subprocess.call and check arguments.

--
INADA Naoki <email address hidden>

Revision history for this message
Vincent Ladeuil (vila) wrote :

> This problem contains many situation.
> I think your fix doesn't work

That's why I wanted you to test it :) Did you ? If yes, did it fail and how ?

>sometimes because:
>
> 1. osutils.get_user_encoding() is not good in rare case.
> user_encoding is not equals to filesystem encoding.

My assumption was that subprocess.Popen should receive its arguments as if
they were coming from the command line: in user encoding.

>
> 2. path.encode(enc) can raise UnicodeEncodeError.
> So bzr should use "encode(enc, 'replace')"

That will mean some path can't be encoded in user encoding which in turn means
it could not be user as a parameter to bzr no ?

>
> 3. _write_file() writes file with unicode path.
> But if bzr can't encode the path collectly, external command can't
> open the file.

We rely on python to do the right thing from unicode paths there.

> So bzr should encode the path before _write_file().
>
> To test these situation:
>
> * Use fixed encoding in the test.

That's what the tests are doing for the user encoding.

> * Use filenames that contains chars that can't encode with the encoding.

Hmm, AFAIR, when we encounter problems there it's because some setting is wrong in the environment. As mentioned below, I couldn't automate the tests enough
to reach that point, hence my call for real-life tests :)

> * Replace subprocess.call and check arguments.

I don't want to do that, if the problem is in subprocess I will not be able to detect it.

That's the main point I dislike with the current tests, they don't exercise
subprocess.Popen, but I don't have an external diff tool I can call available
on all platforms.

Revision history for this message
Vincent Ladeuil (vila) wrote :

> We rely on python to do the right thing from unicode paths there.
i.e. encode using 'mbcs' as returned by getsystemfilecoding() on windows
and whatever is appropriate on other systems.

Revision history for this message
methane (songofacandy) wrote :

>> 1. osutils.get_user_encoding() is not good in rare case.
>> user_encoding is not equals to filesystem encoding.
>
>My assumption was that subprocess.Popen should receive its arguments as if
>they were coming from the command line: in user encoding.

OK, I don't know about when get_user_encoding() returns encoding that is not
suitable for commandline arguments.

>> 2. path.encode(enc) can raise UnicodeEncodeError.
>> So bzr should use "encode(enc, 'replace')"
>
>That will mean some path can't be encoded in user encoding which in turn means
>it could not be user as a parameter to bzr no ?

No. bzr uses unicode arguments in Windows. So users can pass unicode filename to bzr.
And "bzr diff --using=winmergeu.exe" doesn't requires file path but shows diff on
such a file.

>> 3. _write_file() writes file with unicode path.
>> But if bzr can't encode the path collectly, external command can't
>> open the file.
>
> We rely on python to do the right thing from unicode paths there.

On windows, unicode path is too perfect.
If unicode path contains chars that is not be able to encode,
we can't use the path in encoded string.
But we can create a file with such a name when using unicode.
So we should replace such a char before _write_file().

Revision history for this message
Vincent Ladeuil (vila) wrote :

>>>>> "Naoki" == INADA Naoki <email address hidden> writes:

    >>> 1. osutils.get_user_encoding() is not good in rare case.
    >>> user_encoding is not equals to filesystem encoding.
    >>
    >> My assumption was that subprocess.Popen should receive its arguments as if
    >> they were coming from the command line: in user encoding.

    Naoki> OK, I don't know about when get_user_encoding() returns encoding that is not
    Naoki> suitable for commandline arguments.

Good.

    >>> 2. path.encode(enc) can raise UnicodeEncodeError.
    >>> So bzr should use "encode(enc, 'replace')"
    >>
    >> That will mean some path can't be encoded in user encoding which
    >> in turn means it could not be user as a parameter to bzr no ?

    Naoki> No. bzr uses unicode arguments in Windows.

No. bzr uses unicode paths internally, they are coming from the command
line and decoded with user encoding or from the bzr db where they are
encoding into utf-8.

    Naoki> So users can pass unicode filename to bzr.

Encoded appropriately yes. They are decoded into unicode and encoded by
bzr or by python when access to the filesystem is needed at which point
the file system encoding is used (which may be different from the user
encoding).

    Naoki> And "bzr diff --using=winmergeu.exe" doesn't requires file
    Naoki> path but shows diff on such a file.

Sure and it may decode them using the *terminal* encoding for the
content of the diff but DiffFromTool has a parameter 'path_encoding' to
control that.

Note that this is still not the encoding we need to use to create any
needed file though.

    >>> 3. _write_file() writes file with unicode path.
    >>> But if bzr can't encode the path collectly, external command can't
    >>> open the file.
    >>
    >> We rely on python to do the right thing from unicode paths there.

    Naoki> On windows, unicode path is too perfect.
    Naoki> If unicode path contains chars that is not be able to encode,
    Naoki> we can't use the path in encoded string.

If we can't use it here, we can't use it either when creating the
working tree, so I'll tend to ignore the problem here.

    Naoki> But we can create a file with such a name when using unicode.
    Naoki> So we should replace such a char before _write_file().

I don't think so.

Can we chat about that on IRC or can you provide me a file name that
breaks my patch (preferably as a python string using the \xnnnn
notation) ?

Revision history for this message
Vincent Ladeuil (vila) wrote :

So it appears that I lack the windows knowledge (and the ability to tests)
to effectively help Naoki here.

The consensus seems to be that the problem is now to either:
- make subprocess.Popen accept unicode arguments (Naoki as a monkey-patch for that
  but I find it a bit scary),
- bypass subprocess.Popen and use a direct win32 API (I'm out of my league on
  this one)

John ? Alexander ? Can you give it a try ?

review: Abstain
Revision history for this message
Vincent Ladeuil (vila) wrote :

> John ? Alexander ? Can you give it a try ?

Or may be Gary since you're working in this area right now and have some way of testing it on windows ?

Revision history for this message
Alexander Belchenko (bialix) wrote :

> So it appears that I lack the windows knowledge (and the ability to tests)
> to effectively help Naoki here.
>
> The consensus seems to be that the problem is now to either:
> - make subprocess.Popen accept unicode arguments (Naoki as a monkey-patch for
> that
> but I find it a bit scary),
> - bypass subprocess.Popen and use a direct win32 API (I'm out of my league on
> this one)
>
> John ? Alexander ? Can you give it a try ?

Sorry, I'm a bit busy these days, have lost this ping. I can't promise I will look into it soon. But I'll try.

Revision history for this message
Gary van der Merwe (garyvdm) wrote :

Vincent Ladeuil wrote on 2010-04-21 :
> That's the main point I dislike with the current tests, they don't exercise
> subprocess.Popen, but I don't have an external diff tool I can call available
> on all platforms.

I've fixed that in this branch: lp:~garyvdm/bzr/523746-difftool-file-names

Those tests fail in wine for both Vincent's and INADA's branches.

- new_abs_path = self._prepare_files(file_id, old_path, new_path,
- allow_write_new=True,
- force_temp=True)[1]
- command = self._get_command(osutils.pathjoin('old', old_path),
- osutils.pathjoin('new', new_path))
+ old_abs_path, new_abs_path = self._prepare_files(
+ file_id, old_path, new_path,
+ allow_write_new=True,
+ force_temp=True)
+ command = self._get_command(old_abs_path, new_abs_path)

This is a nice simple change. Maybe we should merge this straight away, separately from the other changes.

- new_file = open(new_abs_path, 'r')
+ new_file = open(new_abs_path, 'rb')

Why is this changed?

Revision history for this message
methane (songofacandy) wrote :

> Vincent Ladeuil wrote on 2010-04-21 :
> > That's the main point I dislike with the current tests, they don't exercise
> > subprocess.Popen, but I don't have an external diff tool I can call
> available
> > on all platforms.
>
> I've fixed that in this branch: lp:~garyvdm/bzr/523746-difftool-file-names
>
> Those tests fail in wine for both Vincent's and INADA's branches.

I'll see the branch later.

> - new_abs_path = self._prepare_files(file_id, old_path, new_path,
> - allow_write_new=True,
> - force_temp=True)[1]
> - command = self._get_command(osutils.pathjoin('old', old_path),
> - osutils.pathjoin('new', new_path))
> + old_abs_path, new_abs_path = self._prepare_files(
> + file_id, old_path, new_path,
> + allow_write_new=True,
> + force_temp=True)
> + command = self._get_command(old_abs_path, new_abs_path)
>
> This is a nice simple change. Maybe we should merge this straight away,
> separately from the other changes.

I think so.

> - new_file = open(new_abs_path, 'r')
> + new_file = open(new_abs_path, 'rb')
>
> Why is this changed?

This change is made because I think bzr should use 'binary' mode in this point.
External diff tool may have edit feature and user may fix line ending. So bzr should keep
EOL bytes.

But, this change is not relating to #523746.

Revision history for this message
methane (songofacandy) wrote :

I'm ashamed to leave this so long.

> > Vincent Ladeuil wrote on 2010-04-21 :
> > > That's the main point I dislike with the current tests, they don't
> exercise
> > > subprocess.Popen, but I don't have an external diff tool I can call
> > available
> > > on all platforms.
> >
> > I've fixed that in this branch: lp:~garyvdm/bzr/523746-difftool-file-names
> >
> > Those tests fail in wine for both Vincent's and INADA's branches.
>
> I'll see the branch later.

Your test fails because my branch changes filename for it safely write
to the filename and pass the filename to subprocess.

I split filename creation from DiffFromTool._write_file and wrote test for it.
Please see it.

I've wrote test executing DiffFromTool but it doesn't tests real problem
because overriding sys.getfilesystemencoding() doesn't changes
Py_FileSystemDefaultEncoding in CPython core.

Revision history for this message
Vincent Ladeuil (vila) wrote :

Thanks for coming back songofacandy (If you explain me one last time which of INADA or Naoki I should use, I swear I won't forget anymore :-}) !

Once again, I'd like to defer to people with a better windows knowledge than myself in this specific area, I'll nominate explicit reviewers, please, one of you, review ! :) Or if you can't just abstain ;)

Revision history for this message
Vincent Ladeuil (vila) wrote :

My understanding was that we couldn't arbitrarily try to encode/decode both paths and other arguments. I may be right in general and wrong in this particular case where mbcs may be the silver bullet. That's why I'm deferring to people knowing better than me.

Revision history for this message
John A Meinel (jameinel) wrote :
Download full text (3.2 KiB)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 05/06/2011 03:29 AM, INADA Naoki wrote:
> INADA Naoki has proposed merging lp:~songofacandy/bzr/fix-523746-dev into lp:bzr.
>
> Requested reviews:
> Vincent Ladeuil (vila)
> bzr-core (bzr-core)
> Related bugs:
> Bug #523746 in Bazaar: "crash with winmerge in cp932 japanese character"
> https://bugs.launchpad.net/bzr/+bug/523746
>
> For more details, see:
> https://code.launchpad.net/~songofacandy/bzr/fix-523746-dev/+merge/20537

+ def _safe_filename(self, prefix, relpath):
+ if sys.platform == 'win32':
+ fenc = 'mbcs'
+ else:
+ fenc = sys.getfilesystemencoding() or 'ascii'
+ # encoded_str.replace('?', '_') may break multibyte char.
+ # So we should encode, decode, then replace(u'?', u'_')
+ relpath_tmp = relpath.encode(fenc, 'replace').decode(fenc,
'replace')
+ relpath_tmp = relpath_tmp.replace(u'?', u'_')
+ return osutils.pathjoin(self._root, prefix, relpath_tmp)

Why are we using "encode('replace')" rather than "encode('ignore')". It
doesn't seem to buy us much to be putting "_" into the filename, rather
than just writing out whatever letters we can.

I also don't understand why you do the encode/decode dance and the put a
munged Unicode string back into the path. I think I sort of understand,
but a docstring would probably help a lot at understanding why we need
to use this function.

...

+ command = [AtTemplate(t).substitute(my_map) for t in
+ self.command_template]
+ if sys.platform == 'win32': # Popen doesn't accept unicode on win32
+ command_encoded = []
+ for c in command:
+ if isinstance(c, unicode):
+ command_encoded.append(c.encode('mbcs'))
+ else:
+ command_encoded.append(c)
+ return command_encoded
+ else:
+ return command

^- I'm not sure whether c.encode('mbcs') can fail, or whether it will
return '?' for unusable characters. For example, I can create Arabic or
Japanese filenames (since the filesystem is UTF-16) even though 'mbcs'
is only able to handle ~latin-1. Any thoughts on how to handle that
cleanly? We don't have to change Popen to support Unicode (though that
would be ideal), but at least something that would give users helpful
feedback, rather than just tracebacks.

+ def test_encodable_filename(self):
+ import sys
+ diffobj = diff.DiffFromTool(['python', '@old_path', '@new_path'],
+ None, None, None)

On *my* system, there is no 'python' (I never add it to sys.path, as I
happen to have 2.4/5/6/7 all installed.) There is 'sys.executable', though.

diffobj = diff.DiffFromTool([sys.executable, ...])

Also, we probably want to skip this test if "sys.frozen" is set, since I
don't think we can run it from 'bzr.exe'. I think we already had a
feature for that.

John
=:->

 review: needsfixing

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk3EA3oACgkQJdeBCYSNAAMYNgCeJIIEWRF1Dg/OLWQ5kRDCI6EP
W0cAoID...

Read more...

review: Needs Fixing
Revision history for this message
methane (songofacandy) wrote :
Download full text (3.6 KiB)

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 05/06/2011 03:29 AM, INADA Naoki wrote:
> > INADA Naoki has proposed merging lp:~songofacandy/bzr/fix-523746-dev into
> lp:bzr.
> >
> > Requested reviews:
> > Vincent Ladeuil (vila)
> > bzr-core (bzr-core)
> > Related bugs:
> > Bug #523746 in Bazaar: "crash with winmerge in cp932 japanese character"
> > https://bugs.launchpad.net/bzr/+bug/523746
> >
> > For more details, see:
> > https://code.launchpad.net/~songofacandy/bzr/fix-523746-dev/+merge/20537
>
>
> + def _safe_filename(self, prefix, relpath):
> + if sys.platform == 'win32':
> + fenc = 'mbcs'
> + else:
> + fenc = sys.getfilesystemencoding() or 'ascii'
> + # encoded_str.replace('?', '_') may break multibyte char.
> + # So we should encode, decode, then replace(u'?', u'_')
> + relpath_tmp = relpath.encode(fenc, 'replace').decode(fenc,
> 'replace')
> + relpath_tmp = relpath_tmp.replace(u'?', u'_')
> + return osutils.pathjoin(self._root, prefix, relpath_tmp)
>
> Why are we using "encode('replace')" rather than "encode('ignore')". It
> doesn't seem to buy us much to be putting "_" into the filename, rather
> than just writing out whatever letters we can.

Because "encode('ignore')" may makes null string.
Fallback to other string is also OK, but I don't have nice name for fallback.

> I also don't understand why you do the encode/decode dance and the put a
> munged Unicode string back into the path. I think I sort of understand,
> but a docstring would probably help a lot at understanding why we need
> to use this function.

OK, I'll add docstrings.

>
> ...
>
> + command = [AtTemplate(t).substitute(my_map) for t in
> + self.command_template]
> + if sys.platform == 'win32': # Popen doesn't accept unicode on win32
> + command_encoded = []
> + for c in command:
> + if isinstance(c, unicode):
> + command_encoded.append(c.encode('mbcs'))
> + else:
> + command_encoded.append(c)
> + return command_encoded
> + else:
> + return command
>
> ^- I'm not sure whether c.encode('mbcs') can fail, or whether it will
> return '?' for unusable characters. For example, I can create Arabic or
> Japanese filenames (since the filesystem is UTF-16) even though 'mbcs'
> is only able to handle ~latin-1. Any thoughts on how to handle that
> cleanly? We don't have to change Popen to support Unicode (though that
> would be ideal), but at least something that would give users helpful
> feedback, rather than just tracebacks.

c.encode('mbcs') may fail on very rare circumstance. For example, path of
diff tool or temporary directory is not be able to encode with 'mbcs'.
I don't have any idea what should we do on that circumstance.

But filepath on commandline should not be fail decoding unlsess
decode('mbcs').encode('mbcs') can be fail.

> + def test_encodable_filename(self):
> + import sys
> + diffobj = diff.DiffFromTool(['python', '@old_path', '@new_path'],
> + None, None, No...

Read more...

Revision history for this message
methane (songofacandy) wrote :

Done

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 5/6/2011 8:28 PM, INADA Naoki wrote:
> INADA Naoki has proposed merging lp:~songofacandy/bzr/fix-523746-dev into lp:bzr.
>
> Requested reviews:
> John A Meinel (jameinel)
> Alexander Belchenko (bialix)
> Gary van der Merwe (garyvdm)
> Martin [gz] (gz)
> Vincent Ladeuil (vila)
> bzr-core (bzr-core)
> Related bugs:
> Bug #523746 in Bazaar: "crash with winmerge in cp932 japanese character"
> https://bugs.launchpad.net/bzr/+bug/523746
>
> For more details, see:
> https://code.launchpad.net/~songofacandy/bzr/fix-523746-dev/+merge/20537

Can you give a little bit more background in your merge proposal? I know
there is a bug linked, but describing what you are doing makes it easier
to understand the diff.

 review: needsinfo

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk3Liw0ACgkQJdeBCYSNAAOxzACguDKS9ztLjW6NJ54ii0fWIffL
dgkAoJ8kAv8ws8Atxw+O4d4o3ke54TMa
=u/ie
-----END PGP SIGNATURE-----

review: Needs Information
Revision history for this message
methane (songofacandy) wrote :

OK.

1. On Windows.
Some diff application doesn't receive command line argument as a UTF-16 string
but as a string encoded with current codepage.
In this situation, passing Unicode argument to child process may fail
when the file path
contains character that is not be able to encode with current codepage.
So, we should use only chars those are able to encode with current codepage as a
temporary file.

2. On Unix.
All application may receive commandline argument as just bytes. So we can pass
any path as a argument.
But, GUI diff tool may be not able to show such filepath. So we should
use string
that is able to encode with current locale as a temporary file, too.

My fix removes chars those are not able to encode with current codepage/locale.
And force to use temporary file when checkouted file contains such chars as a
path.

On Thu, May 12, 2011 at 4:23 PM, John Arbash Meinel
<email address hidden> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 5/6/2011 8:28 PM, INADA Naoki wrote:
>> INADA Naoki has proposed merging lp:~songofacandy/bzr/fix-523746-dev into lp:bzr.
>>
>> Requested reviews:
>>   John A Meinel (jameinel)
>>   Alexander Belchenko (bialix)
>>   Gary van der Merwe (garyvdm)
>>   Martin [gz] (gz)
>>   Vincent Ladeuil (vila)
>>   bzr-core (bzr-core)
>> Related bugs:
>>   Bug #523746 in Bazaar: "crash with winmerge in cp932 japanese character"
>>   https://bugs.launchpad.net/bzr/+bug/523746
>>
>> For more details, see:
>> https://code.launchpad.net/~songofacandy/bzr/fix-523746-dev/+merge/20537
>
> Can you give a little bit more background in your merge proposal? I know
> there is a bug linked, but describing what you are doing makes it easier
> to understand the diff.
>
>  review: needsinfo
>
> John
> =:->
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (Cygwin)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iEYEARECAAYFAk3Liw0ACgkQJdeBCYSNAAOxzACguDKS9ztLjW6NJ54ii0fWIffL
> dgkAoJ8kAv8ws8Atxw+O4d4o3ke54TMa
> =u/ie
> -----END PGP SIGNATURE-----
>

--
INADA Naoki  <email address hidden>

Revision history for this message
John A Meinel (jameinel) wrote :

Seems fine to me. I don't really like the encode().decode() stuff, but I'll live with it.

review: Approve
Revision history for this message
Martin Pool (mbp) wrote :

sent to pqm by email

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'bzrlib/diff.py'
2--- bzrlib/diff.py 2011-04-07 10:36:24 +0000
3+++ bzrlib/diff.py 2011-05-06 17:43:35 +0000
4@@ -745,8 +745,18 @@
5
6 def _get_command(self, old_path, new_path):
7 my_map = {'old_path': old_path, 'new_path': new_path}
8- return [AtTemplate(t).substitute(my_map) for t in
9- self.command_template]
10+ command = [AtTemplate(t).substitute(my_map) for t in
11+ self.command_template]
12+ if sys.platform == 'win32': # Popen doesn't accept unicode on win32
13+ command_encoded = []
14+ for c in command:
15+ if isinstance(c, unicode):
16+ command_encoded.append(c.encode('mbcs'))
17+ else:
18+ command_encoded.append(c)
19+ return command_encoded
20+ else:
21+ return command
22
23 def _execute(self, old_path, new_path):
24 command = self._get_command(old_path, new_path)
25@@ -772,12 +782,42 @@
26 raise
27 return True
28
29+ @staticmethod
30+ def _fenc():
31+ """Returns safe encoding for passing file path to diff tool"""
32+ if sys.platform == 'win32':
33+ return 'mbcs'
34+ else:
35+ # Don't fallback to 'utf-8' because subprocess may not be able to
36+ # handle utf-8 correctly when locale is not utf-8.
37+ return sys.getfilesystemencoding() or 'ascii'
38+
39+ def _is_safepath(self, path):
40+ """Return true if `path` may be able to pass to subprocess."""
41+ fenc = self._fenc()
42+ try:
43+ return path == path.encode(fenc).decode(fenc)
44+ except UnicodeError:
45+ return False
46+
47+ def _safe_filename(self, prefix, relpath):
48+ """Replace unsafe character in `relpath` then join `self._root`,
49+ `prefix` and `relpath`."""
50+ fenc = self._fenc()
51+ # encoded_str.replace('?', '_') may break multibyte char.
52+ # So we should encode, decode, then replace(u'?', u'_')
53+ relpath_tmp = relpath.encode(fenc, 'replace').decode(fenc, 'replace')
54+ relpath_tmp = relpath_tmp.replace(u'?', u'_')
55+ return osutils.pathjoin(self._root, prefix, relpath_tmp)
56+
57 def _write_file(self, file_id, tree, prefix, relpath, force_temp=False,
58 allow_write=False):
59 if not force_temp and isinstance(tree, WorkingTree):
60- return tree.abspath(tree.id2path(file_id))
61-
62- full_path = osutils.pathjoin(self._root, prefix, relpath)
63+ full_path = tree.abspath(tree.id2path(file_id))
64+ if self._is_safepath(full_path):
65+ return full_path
66+
67+ full_path = self._safe_filename(prefix, relpath)
68 if not force_temp and self._try_symlink_root(tree, prefix):
69 return full_path
70 parent_dir = osutils.dirname(full_path)
71@@ -840,13 +880,13 @@
72 """
73 old_path = self.old_tree.id2path(file_id)
74 new_path = self.new_tree.id2path(file_id)
75- new_abs_path = self._prepare_files(file_id, old_path, new_path,
76- allow_write_new=True,
77- force_temp=True)[1]
78- command = self._get_command(osutils.pathjoin('old', old_path),
79- osutils.pathjoin('new', new_path))
80+ old_abs_path, new_abs_path = self._prepare_files(
81+ file_id, old_path, new_path,
82+ allow_write_new=True,
83+ force_temp=True)
84+ command = self._get_command(old_abs_path, new_abs_path)
85 subprocess.call(command, cwd=self._root)
86- new_file = open(new_abs_path, 'r')
87+ new_file = open(new_abs_path, 'rb')
88 try:
89 return new_file.read()
90 finally:
91
92=== modified file 'bzrlib/tests/test_diff.py'
93--- bzrlib/tests/test_diff.py 2011-04-18 23:19:16 +0000
94+++ bzrlib/tests/test_diff.py 2011-05-06 17:43:35 +0000
95@@ -33,7 +33,7 @@
96 transform,
97 )
98 from bzrlib.symbol_versioning import deprecated_in
99-from bzrlib.tests import features
100+from bzrlib.tests import features, EncodingAdapter
101 from bzrlib.tests.blackbox.test_diff import subst_dates
102
103
104@@ -1421,6 +1421,52 @@
105 diff_obj._prepare_files('file2-id', 'oldname2', 'newname2')
106
107
108+class TestDiffFromToolEncodedFilename(tests.TestCaseWithTransport):
109+
110+ def test_encodable_filename(self):
111+ # Just checks file path for external diff tool.
112+ # We cannot change CPython's internal encoding used by os.exec*.
113+ import sys
114+ diffobj = diff.DiffFromTool(['dummy', '@old_path', '@new_path'],
115+ None, None, None)
116+ for _, scenario in EncodingAdapter.encoding_scenarios:
117+ encoding = scenario['encoding']
118+ dirname = scenario['info']['directory']
119+ filename = scenario['info']['filename']
120+
121+ self.overrideAttr(diffobj, '_fenc', lambda: encoding)
122+ relpath = dirname + u'/' + filename
123+ fullpath = diffobj._safe_filename('safe', relpath)
124+ self.assertEqual(
125+ fullpath,
126+ fullpath.encode(encoding).decode(encoding)
127+ )
128+ self.assert_(fullpath.startswith(diffobj._root + '/safe'))
129+
130+ def test_unencodable_filename(self):
131+ import sys
132+ diffobj = diff.DiffFromTool(['dummy', '@old_path', '@new_path'],
133+ None, None, None)
134+ for _, scenario in EncodingAdapter.encoding_scenarios:
135+ encoding = scenario['encoding']
136+ dirname = scenario['info']['directory']
137+ filename = scenario['info']['filename']
138+
139+ if encoding == 'iso-8859-1':
140+ encoding = 'iso-8859-2'
141+ else:
142+ encoding = 'iso-8859-1'
143+
144+ self.overrideAttr(diffobj, '_fenc', lambda: encoding)
145+ relpath = dirname + u'/' + filename
146+ fullpath = diffobj._safe_filename('safe', relpath)
147+ self.assertEqual(
148+ fullpath,
149+ fullpath.encode(encoding).decode(encoding)
150+ )
151+ self.assert_(fullpath.startswith(diffobj._root + '/safe'))
152+
153+
154 class TestGetTreesAndBranchesToDiffLocked(tests.TestCaseWithTransport):
155
156 def call_gtabtd(self, path_list, revision_specs, old_url, new_url):