Merge lp:~gz/bzr/require_unicode_committer_614593 into lp:bzr

Proposed by Martin Packman
Status: Merged
Approved by: John A Meinel
Approved revision: no longer in the source branch.
Merged at revision: 5510
Proposed branch: lp:~gz/bzr/require_unicode_committer_614593
Merge into: lp:bzr
Diff against target: 46 lines (+13/-1)
3 files modified
bzrlib/repository.py (+2/-0)
bzrlib/tests/per_repository/test_commit_builder.py (+10/-0)
bzrlib/tests/test_testament.py (+1/-1)
To merge this branch: bzr merge lp:~gz/bzr/require_unicode_committer_614593
Reviewer Review Type Date Requested Status
Vincent Ladeuil Approve
John A Meinel Pending
Review via email: mp+38334@code.launchpad.net

Commit message

Check committer values are ascii or unicode and fix a test where it was not

Description of the change

Currently commit can be passed a non-ascii str value for committer and the value makes it all the way through to the repository serialisation code where it potentially outputs bogus data. As generally the input will be decoded to unicode already from the command line or config this isn't generally fatal, but is laying a bit of a trap for test and plugin authors to do the wrong thing without realising. With Python 2.7 it happens that this is now caught by the xml escaping function, and a misspelt test started failing.

As well as fixing that test, I've added a guard in commit that raises UnicodeDecodeError for all non-ascii str values. I think that's better than decoding with user_encoding or similar as we don't actually have a good basis for believing a str passed is one encoding or another, so it's better to leave that up to the caller. We could trap the decode error and raise some other flavour of exception instead, but as this will mostly be for bzrlib coders rather than users that's probably no more informative.

To post a comment you must log in.
Revision history for this message
Vincent Ladeuil (vila) wrote :

This sounds fine to me, but I'd like John to confirm that we are indeed expecting Unicode there.

review: Approve
Revision history for this message
John A Meinel (jameinel) wrote :

If I were doing this, I would probably have done it deeper. Specifically "committer" is just passed into Repository.get_commit_builder(), and the commit builder code has:

        if committer is None:
            self._committer = self._config.username()
        else:
            self._committer = committer

That is probably the point where I would assert that it is properly Unicode. That may be a bit later than you would like (since the commit is pretty much finished then), but it should help prevent injection via some other path.

You could also potentially check in the Revision constructor.

Either way, we are expecting a unicode string here.

Revision history for this message
Martin Packman (gz) wrote :

Thanks John, that's just the kind of feedback I wanted. I'll move the guard further in as you suggest.

Doing that, just realised the test I added doesn't actually fail unless I add a format="dirstate-with-subtree" like the testament tests have. This suggests it should be in... per_repository.test_commit_builder instead maybe?

Revision history for this message
Martin Packman (gz) wrote :

Implemented those changes, but am now worrying over some new things.

Using per_repository seems wasteful as the guard triggers before the repo actually gets touched, so the 22 variations aren't interestingly different.

Also, CommitBuilder has an existing _validate_unicode_text used with the message and rev props I could have used for the committer check... but it doesn't actually do what it says. Which suggests they also could introduce bad bytestrings to the lower levels.

Revision history for this message
John A Meinel (jameinel) wrote :

I'm fine with it being per-repository, since each repository can have a custom CommitBuilder. Even if 99% of them are the same, catching the failure on a new repo format is good.

I would be ok if you wanted to add something like "if not isinstance(text, unicode)" to the _validate check. The code exists because revision property values (vs keys) are meant to be unicode content. And the set of unicode strings has some additional restrictions (not having '\r' in them).

(Just because it doesn't check everything it could *today* doesn't mean we can't add more checks :)

Revision history for this message
Martin Packman (gz) wrote :

Thanks John, I'll land this then and file a followup bug on tightening up the existing _validate methods.

Revision history for this message
Martin Packman (gz) wrote :

sent to pqm by email

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'bzrlib/repository.py'
2--- bzrlib/repository.py 2010-10-15 11:20:45 +0000
3+++ bzrlib/repository.py 2010-10-15 16:41:54 +0000
4@@ -115,6 +115,8 @@
5
6 if committer is None:
7 self._committer = self._config.username()
8+ elif not isinstance(committer, unicode):
9+ self._committer = committer.decode() # throw if non-ascii
10 else:
11 self._committer = committer
12
13
14=== modified file 'bzrlib/tests/per_repository/test_commit_builder.py'
15--- bzrlib/tests/per_repository/test_commit_builder.py 2010-08-10 19:17:49 +0000
16+++ bzrlib/tests/per_repository/test_commit_builder.py 2010-10-15 16:41:54 +0000
17@@ -1288,6 +1288,16 @@
18 self.assertRaises(ValueError, builder.commit,
19 u'Invalid\r\ncommit message\r\n')
20
21+ def test_non_ascii_str_committer_rejected(self):
22+ """Ensure an error is raised on a non-ascii byte string committer"""
23+ branch = self.make_branch('.')
24+ branch.repository.lock_write()
25+ self.addCleanup(branch.repository.unlock)
26+ self.assertRaises(UnicodeDecodeError,
27+ branch.repository.get_commit_builder,
28+ branch, [], branch.get_config(),
29+ committer="Erik B\xe5gfors <erik@example.com>")
30+
31 def test_stacked_repositories_reject_commit_builder(self):
32 # As per bug 375013, committing to stacked repositories is currently
33 # broken, so repositories with fallbacks refuse to hand out a commit
34
35=== modified file 'bzrlib/tests/test_testament.py'
36--- bzrlib/tests/test_testament.py 2009-03-23 14:59:43 +0000
37+++ bzrlib/tests/test_testament.py 2010-10-15 16:41:54 +0000
38@@ -129,7 +129,7 @@
39 timestamp=1129025493,
40 timezone=36000,
41 rev_id='test@user-3',
42- committer='Erik B\xe5gfors <test@user>',
43+ committer=u'Erik B\xe5gfors <test@user>',
44 revprops={'uni':u'\xb5'}
45 )
46 t = self.from_revision(self.b.repository, 'test@user-3')