Merge lp:~spiv/bzr/insert-stream-check-chk-root into lp:bzr/2.0

Proposed by Andrew Bennetts
Status: Merged
Merged at revision: not available
Proposed branch: lp:~spiv/bzr/insert-stream-check-chk-root
Merge into: lp:bzr/2.0
Diff against target: 408 lines
To merge this branch: bzr merge lp:~spiv/bzr/insert-stream-check-chk-root
Reviewer Review Type Date Requested Status
Robert Collins (community) Approve
Review via email: mp+11033@code.launchpad.net
To post a comment you must log in.
Revision history for this message
Andrew Bennetts (spiv) wrote :

This is a partial fix for bug 406687, and an incremental step towards a full fix. It adds a check to RepositoryPackCollection._commit_write_group that verifies every revision added by that write group has a corresponding inventory, and that for every corresponding inventory plus any present parent inventories that the chk root entries are present ­— and presence in a fallback does not count. At the same time ghost parent inventories are still allowed. This fix is sufficient at least to catch some bugs we had in the past where no chk records were transferred at all in some cases.

(The full fix, which I am working on, will involve checking that all relevant chk records are present, and that the text versions named by those relevant records are also present.)

I experimented with not requiring chk root records in a stacked repository for inventories that have not changed, because in principle they aren't required (just comparing the key names is enough to show that the delta between those revs is empty), but bzr currently fails on such a repository, so allowing this would allow data incompatible with e.g. 2.0rc1, which I think is undesirable.

I am a little worried about the possible performance impact of this change. I don't think the code is particularly wasteful (except perhaps in holding collections of keys that are larger or more long-lived than strictly necessary), but it is fundamentally more work to perform on every 2a repository write.

Revision history for this message
Robert Collins (lifeless) wrote :

tweak

I think that it would be clearer to use method overriding to handle pack vs chk rather than a check on self in the pack code: the bulk of the method is chk specific.

[though pack repos want their inventories to, so that suggests either duplicate, or missing coverage]

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'bzrlib/groupcompress.py'
--- bzrlib/groupcompress.py 2009-08-26 16:47:51 +0000
+++ bzrlib/groupcompress.py 2009-09-02 03:35:25 +0000
@@ -1714,7 +1714,7 @@
17141714
1715 def __init__(self, graph_index, is_locked, parents=True,1715 def __init__(self, graph_index, is_locked, parents=True,
1716 add_callback=None, track_external_parent_refs=False,1716 add_callback=None, track_external_parent_refs=False,
1717 inconsistency_fatal=True):1717 inconsistency_fatal=True, track_new_keys=False):
1718 """Construct a _GCGraphIndex on a graph_index.1718 """Construct a _GCGraphIndex on a graph_index.
17191719
1720 :param graph_index: An implementation of bzrlib.index.GraphIndex.1720 :param graph_index: An implementation of bzrlib.index.GraphIndex.
@@ -1740,7 +1740,8 @@
1740 self._is_locked = is_locked1740 self._is_locked = is_locked
1741 self._inconsistency_fatal = inconsistency_fatal1741 self._inconsistency_fatal = inconsistency_fatal
1742 if track_external_parent_refs:1742 if track_external_parent_refs:
1743 self._key_dependencies = knit._KeyRefs()1743 self._key_dependencies = knit._KeyRefs(
1744 track_new_keys=track_new_keys)
1744 else:1745 else:
1745 self._key_dependencies = None1746 self._key_dependencies = None
17461747
@@ -1800,10 +1801,14 @@
1800 result.append((key, value))1801 result.append((key, value))
1801 records = result1802 records = result
1802 key_dependencies = self._key_dependencies1803 key_dependencies = self._key_dependencies
1803 if key_dependencies is not None and self._parents:1804 if key_dependencies is not None:
1804 for key, value, refs in records:1805 if self._parents:
1805 parents = refs[0]1806 for key, value, refs in records:
1806 key_dependencies.add_references(key, parents)1807 parents = refs[0]
1808 key_dependencies.add_references(key, parents)
1809 else:
1810 for key, value, refs in records:
1811 new_keys.add_key(key)
1807 self._add_callback(records)1812 self._add_callback(records)
18081813
1809 def _check_read(self):1814 def _check_read(self):
@@ -1866,7 +1871,7 @@
1866 """Return the keys of missing parents."""1871 """Return the keys of missing parents."""
1867 # Copied from _KnitGraphIndex.get_missing_parents1872 # Copied from _KnitGraphIndex.get_missing_parents
1868 # We may have false positives, so filter those out.1873 # We may have false positives, so filter those out.
1869 self._key_dependencies.add_keys(1874 self._key_dependencies.satisfy_refs_for_keys(
1870 self.get_parent_map(self._key_dependencies.get_unsatisfied_refs()))1875 self.get_parent_map(self._key_dependencies.get_unsatisfied_refs()))
1871 return frozenset(self._key_dependencies.get_unsatisfied_refs())1876 return frozenset(self._key_dependencies.get_unsatisfied_refs())
18721877
@@ -1926,17 +1931,17 @@
19261931
1927 This allows this _GCGraphIndex to keep track of any missing1932 This allows this _GCGraphIndex to keep track of any missing
1928 compression parents we may want to have filled in to make those1933 compression parents we may want to have filled in to make those
1929 indices valid.1934 indices valid. It also allows _GCGraphIndex to track any new keys.
19301935
1931 :param graph_index: A GraphIndex1936 :param graph_index: A GraphIndex
1932 """1937 """
1933 if self._key_dependencies is not None:1938 key_dependencies = self._key_dependencies
1934 # Add parent refs from graph_index (and discard parent refs that1939 if key_dependencies is None:
1935 # the graph_index has).1940 return
1936 add_refs = self._key_dependencies.add_references1941 for node in graph_index.iter_all_entries():
1937 for node in graph_index.iter_all_entries():1942 # Add parent refs from graph_index (and discard parent refs
1938 add_refs(node[1], node[3][0])1943 # that the graph_index has).
19391944 key_dependencies.add_references(node[1], node[3][0])
19401945
19411946
1942from bzrlib._groupcompress_py import (1947from bzrlib._groupcompress_py import (
19431948
=== modified file 'bzrlib/knit.py'
--- bzrlib/knit.py 2009-08-26 16:44:27 +0000
+++ bzrlib/knit.py 2009-09-02 03:35:25 +0000
@@ -2777,9 +2777,19 @@
27772777
2778class _KeyRefs(object):2778class _KeyRefs(object):
27792779
2780 def __init__(self):2780 def __init__(self, track_new_keys=False):
2781 # dict mapping 'key' to 'set of keys referring to that key'2781 # dict mapping 'key' to 'set of keys referring to that key'
2782 self.refs = {}2782 self.refs = {}
2783 if track_new_keys:
2784 self.new_keys = set()
2785 else:
2786 self.new_keys = None
2787
2788 def clear(self):
2789 if self.refs:
2790 self.refs.clear()
2791 if self.new_keys:
2792 self.new_keys.clear()
27832793
2784 def add_references(self, key, refs):2794 def add_references(self, key, refs):
2785 # Record the new references2795 # Record the new references
@@ -2792,19 +2802,28 @@
2792 # Discard references satisfied by the new key2802 # Discard references satisfied by the new key
2793 self.add_key(key)2803 self.add_key(key)
27942804
2805 def get_new_keys(self):
2806 return self.new_keys
2807
2795 def get_unsatisfied_refs(self):2808 def get_unsatisfied_refs(self):
2796 return self.refs.iterkeys()2809 return self.refs.iterkeys()
27972810
2798 def add_key(self, key):2811 def _satisfy_refs_for_key(self, key):
2799 try:2812 try:
2800 del self.refs[key]2813 del self.refs[key]
2801 except KeyError:2814 except KeyError:
2802 # No keys depended on this key. That's ok.2815 # No keys depended on this key. That's ok.
2803 pass2816 pass
28042817
2805 def add_keys(self, keys):2818 def add_key(self, key):
2819 # satisfy refs for key, and remember that we've seen this key.
2820 self._satisfy_refs_for_key(key)
2821 if self.new_keys is not None:
2822 self.new_keys.add(key)
2823
2824 def satisfy_refs_for_keys(self, keys):
2806 for key in keys:2825 for key in keys:
2807 self.add_key(key)2826 self._satisfy_refs_for_key(key)
28082827
2809 def get_referrers(self):2828 def get_referrers(self):
2810 result = set()2829 result = set()
@@ -2972,7 +2991,7 @@
2972 # If updating this, you should also update2991 # If updating this, you should also update
2973 # groupcompress._GCGraphIndex.get_missing_parents2992 # groupcompress._GCGraphIndex.get_missing_parents
2974 # We may have false positives, so filter those out.2993 # We may have false positives, so filter those out.
2975 self._key_dependencies.add_keys(2994 self._key_dependencies.satisfy_refs_for_keys(
2976 self.get_parent_map(self._key_dependencies.get_unsatisfied_refs()))2995 self.get_parent_map(self._key_dependencies.get_unsatisfied_refs()))
2977 return frozenset(self._key_dependencies.get_unsatisfied_refs())2996 return frozenset(self._key_dependencies.get_unsatisfied_refs())
29782997
29792998
=== modified file 'bzrlib/repofmt/groupcompress_repo.py'
--- bzrlib/repofmt/groupcompress_repo.py 2009-08-24 19:34:13 +0000
+++ bzrlib/repofmt/groupcompress_repo.py 2009-09-02 03:35:25 +0000
@@ -651,7 +651,7 @@
651 _GCGraphIndex(self._pack_collection.revision_index.combined_index,651 _GCGraphIndex(self._pack_collection.revision_index.combined_index,
652 add_callback=self._pack_collection.revision_index.add_callback,652 add_callback=self._pack_collection.revision_index.add_callback,
653 parents=True, is_locked=self.is_locked,653 parents=True, is_locked=self.is_locked,
654 track_external_parent_refs=True),654 track_external_parent_refs=True, track_new_keys=True),
655 access=self._pack_collection.revision_index.data_access,655 access=self._pack_collection.revision_index.data_access,
656 delta=False)656 delta=False)
657 self.signatures = GroupCompressVersionedFiles(657 self.signatures = GroupCompressVersionedFiles(
658658
=== modified file 'bzrlib/repofmt/pack_repo.py'
--- bzrlib/repofmt/pack_repo.py 2009-08-14 11:11:29 +0000
+++ bzrlib/repofmt/pack_repo.py 2009-09-02 03:35:25 +0000
@@ -2063,6 +2063,53 @@
2063 self._remove_pack_indices(resumed_pack)2063 self._remove_pack_indices(resumed_pack)
2064 del self._resumed_packs[:]2064 del self._resumed_packs[:]
20652065
2066 def _check_new_inventories(self):
2067 """Detect missing inventories or chk root entries for the new revisions
2068 in this write group.
2069
2070 :returns: set of missing keys. Note that not every missing key is
2071 guaranteed to be reported.
2072 """
2073 if getattr(self.repo, 'chk_bytes', None) is None:
2074 return set()
2075 # Ensure that all revisions added in this write group have:
2076 # - corresponding inventories,
2077 # - chk root entries for those inventories,
2078 # - and any present parent inventories have their chk root
2079 # entries too.
2080 # And all this should be independent of any fallback repository.
2081 key_deps = self.repo.revisions._index._key_dependencies
2082 new_revisions_keys = key_deps.get_new_keys()
2083 no_fallback_inv_index = self.repo.inventories._index
2084 no_fallback_chk_bytes_index = self.repo.chk_bytes._index
2085 inv_parent_map = no_fallback_inv_index.get_parent_map(
2086 new_revisions_keys)
2087 # Are any inventories for corresponding to the new revisions missing?
2088 corresponding_invs = set(inv_parent_map)
2089 missing_corresponding = set(new_revisions_keys)
2090 missing_corresponding.difference_update(corresponding_invs)
2091 if missing_corresponding:
2092 return [('inventories', key) for key in missing_corresponding]
2093 # Are any chk root entries missing for any inventories? This includes
2094 # any present parent inventories, which may be used when calculating
2095 # deltas for streaming.
2096 all_inv_keys = set(corresponding_invs)
2097 for parent_inv_keys in inv_parent_map.itervalues():
2098 all_inv_keys.update(parent_inv_keys)
2099 # Filter out ghost parents.
2100 all_inv_keys.intersection_update(
2101 no_fallback_inv_index.get_parent_map(all_inv_keys))
2102 all_missing = set()
2103 inv_ids = [key[-1] for key in all_inv_keys]
2104 for inv in self.repo.iter_inventories(inv_ids, 'unordered'):
2105 root_keys = set([inv.id_to_entry.key()])
2106 if inv.parent_id_basename_to_file_id is not None:
2107 root_keys.add(inv.parent_id_basename_to_file_id.key())
2108 present = no_fallback_chk_bytes_index.get_parent_map(root_keys)
2109 missing = root_keys.difference(present)
2110 all_missing.update([('chk_bytes',) + key for key in missing])
2111 return all_missing
2112
2066 def _commit_write_group(self):2113 def _commit_write_group(self):
2067 all_missing = set()2114 all_missing = set()
2068 for prefix, versioned_file in (2115 for prefix, versioned_file in (
@@ -2073,6 +2120,7 @@
2073 ):2120 ):
2074 missing = versioned_file.get_missing_compression_parent_keys()2121 missing = versioned_file.get_missing_compression_parent_keys()
2075 all_missing.update([(prefix,) + key for key in missing])2122 all_missing.update([(prefix,) + key for key in missing])
2123 all_missing.update(self._check_new_inventories())
2076 if all_missing:2124 if all_missing:
2077 raise errors.BzrCheckError(2125 raise errors.BzrCheckError(
2078 "Repository %s has missing compression parent(s) %r "2126 "Repository %s has missing compression parent(s) %r "
@@ -2222,7 +2270,7 @@
2222 % (self._format, self.bzrdir.transport.base))2270 % (self._format, self.bzrdir.transport.base))
22232271
2224 def _abort_write_group(self):2272 def _abort_write_group(self):
2225 self.revisions._index._key_dependencies.refs.clear()2273 self.revisions._index._key_dependencies.clear()
2226 self._pack_collection._abort_write_group()2274 self._pack_collection._abort_write_group()
22272275
2228 def _get_source(self, to_format):2276 def _get_source(self, to_format):
@@ -2242,13 +2290,14 @@
2242 self._pack_collection._start_write_group()2290 self._pack_collection._start_write_group()
22432291
2244 def _commit_write_group(self):2292 def _commit_write_group(self):
2245 self.revisions._index._key_dependencies.refs.clear()2293 hint = self._pack_collection._commit_write_group()
2246 return self._pack_collection._commit_write_group()2294 self.revisions._index._key_dependencies.clear()
2295 return hint
22472296
2248 def suspend_write_group(self):2297 def suspend_write_group(self):
2249 # XXX check self._write_group is self.get_transaction()?2298 # XXX check self._write_group is self.get_transaction()?
2250 tokens = self._pack_collection._suspend_write_group()2299 tokens = self._pack_collection._suspend_write_group()
2251 self.revisions._index._key_dependencies.refs.clear()2300 self.revisions._index._key_dependencies.clear()
2252 self._write_group = None2301 self._write_group = None
2253 return tokens2302 return tokens
22542303
22552304
=== modified file 'bzrlib/repository.py'
--- bzrlib/repository.py 2009-08-30 22:02:45 +0000
+++ bzrlib/repository.py 2009-09-02 03:35:25 +0000
@@ -1604,7 +1604,7 @@
1604 # but at the moment we're only checking for texts referenced by1604 # but at the moment we're only checking for texts referenced by
1605 # inventories at the graph's edge.1605 # inventories at the graph's edge.
1606 key_deps = self.revisions._index._key_dependencies1606 key_deps = self.revisions._index._key_dependencies
1607 key_deps.add_keys(present_inventories)1607 key_deps.satisfy_refs_for_keys(present_inventories)
1608 referrers = frozenset(r[0] for r in key_deps.get_referrers())1608 referrers = frozenset(r[0] for r in key_deps.get_referrers())
1609 file_ids = self.fileids_altered_by_revision_ids(referrers)1609 file_ids = self.fileids_altered_by_revision_ids(referrers)
1610 missing_texts = set()1610 missing_texts = set()
16111611
=== modified file 'bzrlib/tests/per_repository/test_write_group.py'
--- bzrlib/tests/per_repository/test_write_group.py 2009-08-17 04:18:57 +0000
+++ bzrlib/tests/per_repository/test_write_group.py 2009-09-02 03:35:25 +0000
@@ -361,6 +361,143 @@
361 sink.insert_stream((), repo._format, tokens)361 sink.insert_stream((), repo._format, tokens)
362 self.assertEqual([True], call_log)362 self.assertEqual([True], call_log)
363363
364 def test_missing_chk_root_for_inventory(self):
365 """commit_write_group fails with BzrCheckError when the chk root record
366 for a new inventory is missing.
367 """
368 builder = self.make_branch_builder('simple-branch')
369 builder.build_snapshot('A-id', None, [
370 ('add', ('', 'root-id', 'directory', None)),
371 ('add', ('file', 'file-id', 'file', 'content\n'))])
372 b = builder.get_branch()
373 if not b.repository._format.supports_chks:
374 raise TestNotApplicable('requires repository with chk_bytes')
375 b.lock_read()
376 self.addCleanup(b.unlock)
377 repo = self.make_repository('damaged-repo')
378 repo.lock_write()
379 repo.start_write_group()
380 # Now, add the objects manually
381 text_keys = [('file-id', 'A-id'), ('root-id', 'A-id')]
382 # Directly add the texts, inventory, and revision object for 'A-id' --
383 # but don't add the chk_bytes.
384 src_repo = b.repository
385 repo.texts.insert_record_stream(src_repo.texts.get_record_stream(
386 text_keys, 'unordered', True))
387 repo.inventories.insert_record_stream(
388 src_repo.inventories.get_record_stream(
389 [('A-id',)], 'unordered', True))
390 repo.revisions.insert_record_stream(
391 src_repo.revisions.get_record_stream(
392 [('A-id',)], 'unordered', True))
393 # Make sure the presence of the missing data in a fallback does not
394 # avoid the error.
395 repo.add_fallback_repository(b.repository)
396 self.assertRaises(errors.BzrCheckError, repo.commit_write_group)
397 reopened_repo = self.reopen_repo_and_resume_write_group(repo)
398 self.assertRaises(
399 errors.BzrCheckError, reopened_repo.commit_write_group)
400 reopened_repo.abort_write_group()
401
402 def test_missing_chk_root_for_unchanged_inventory(self):
403 """commit_write_group fails with BzrCheckError when the chk root record
404 for a new inventory is missing, even if the parent inventory is present
405 and has identical content (i.e. the same chk root).
406
407 A stacked repository containing only a revision with an identical
408 inventory to its parent will still have the chk root records for those
409 inventories.
410
411 (In principle the chk records are unnecessary in this case, but in
412 practice bzr 2.0rc1 (at least) expects to find them.)
413 """
414 # Make a branch where the last two revisions have identical
415 # inventories.
416 builder = self.make_branch_builder('simple-branch')
417 builder.build_snapshot('A-id', None, [
418 ('add', ('', 'root-id', 'directory', None)),
419 ('add', ('file', 'file-id', 'file', 'content\n'))])
420 builder.build_snapshot('B-id', None, [])
421 builder.build_snapshot('C-id', None, [])
422 b = builder.get_branch()
423 if not b.repository._format.supports_chks:
424 raise TestNotApplicable('requires repository with chk_bytes')
425 b.lock_read()
426 self.addCleanup(b.unlock)
427 # check our setup: B-id and C-id should have identical chk root keys.
428 inv_b = b.repository.get_inventory('B-id')
429 inv_c = b.repository.get_inventory('C-id')
430 self.assertEqual(inv_b.id_to_entry.key(), inv_c.id_to_entry.key())
431 # Now, manually insert objects for a stacked repo with only revision
432 # C-id:
433 # We need ('revisions', 'C-id'), ('inventories', 'C-id'),
434 # ('inventories', 'B-id'), and the corresponding chk roots for those
435 # inventories.
436 repo = self.make_repository('damaged-repo')
437 repo.lock_write()
438 repo.start_write_group()
439 src_repo = b.repository
440 repo.inventories.insert_record_stream(
441 src_repo.inventories.get_record_stream(
442 [('B-id',), ('C-id',)], 'unordered', True))
443 repo.revisions.insert_record_stream(
444 src_repo.revisions.get_record_stream(
445 [('C-id',)], 'unordered', True))
446 # Make sure the presence of the missing data in a fallback does not
447 # avoid the error.
448 repo.add_fallback_repository(b.repository)
449 self.assertRaises(errors.BzrCheckError, repo.commit_write_group)
450 reopened_repo = self.reopen_repo_and_resume_write_group(repo)
451 self.assertRaises(
452 errors.BzrCheckError, reopened_repo.commit_write_group)
453 reopened_repo.abort_write_group()
454
455 def test_missing_chk_root_for_parent_inventory(self):
456 """commit_write_group fails with BzrCheckError when the chk root record
457 for a parent inventory of a new revision is missing.
458 """
459 builder = self.make_branch_builder('simple-branch')
460 builder.build_snapshot('A-id', None, [
461 ('add', ('', 'root-id', 'directory', None)),
462 ('add', ('file', 'file-id', 'file', 'content\n'))])
463 builder.build_snapshot('B-id', None, [])
464 builder.build_snapshot('C-id', None, [
465 ('modify', ('file-id', 'new-content'))])
466 b = builder.get_branch()
467 if not b.repository._format.supports_chks:
468 raise TestNotApplicable('requires repository with chk_bytes')
469 b.lock_read()
470 self.addCleanup(b.unlock)
471 # Now, manually insert objects for a stacked repo with only revision
472 # C-id, *except* the chk root entry for the parent inventory.
473 # We need ('revisions', 'C-id'), ('inventories', 'C-id'),
474 # ('inventories', 'B-id'), and the corresponding chk roots for those
475 # inventories.
476 inv_c = b.repository.get_inventory('C-id')
477 chk_keys_for_c_only = [
478 inv_c.id_to_entry.key(), inv_c.parent_id_basename_to_file_id.key()]
479 repo = self.make_repository('damaged-repo')
480 repo.lock_write()
481 repo.start_write_group()
482 src_repo = b.repository
483 repo.chk_bytes.insert_record_stream(
484 src_repo.chk_bytes.get_record_stream(
485 chk_keys_for_c_only, 'unordered', True))
486 repo.inventories.insert_record_stream(
487 src_repo.inventories.get_record_stream(
488 [('B-id',), ('C-id',)], 'unordered', True))
489 repo.revisions.insert_record_stream(
490 src_repo.revisions.get_record_stream(
491 [('C-id',)], 'unordered', True))
492 # Make sure the presence of the missing data in a fallback does not
493 # avoid the error.
494 repo.add_fallback_repository(b.repository)
495 self.assertRaises(errors.BzrCheckError, repo.commit_write_group)
496 reopened_repo = self.reopen_repo_and_resume_write_group(repo)
497 self.assertRaises(
498 errors.BzrCheckError, reopened_repo.commit_write_group)
499 reopened_repo.abort_write_group()
500
364501
365class TestResumeableWriteGroup(TestCaseWithRepository):502class TestResumeableWriteGroup(TestCaseWithRepository):
366503

Subscribers

People subscribed via source and target branches