Merge lp:~jameinel/bzr/2.1-peak-mem-tweak into lp:bzr

Proposed by John A Meinel
Status: Merged
Approved by: Ian Clatworthy
Approved revision: no longer in the source branch.
Merged at revision: not available
Proposed branch: lp:~jameinel/bzr/2.1-peak-mem-tweak
Merge into: lp:bzr
Diff against target: 279 lines
9 files modified
bzrlib/btree_index.py (+13/-0)
bzrlib/groupcompress.py (+5/-0)
bzrlib/index.py (+20/-0)
bzrlib/repofmt/groupcompress_repo.py (+5/-0)
bzrlib/tests/per_versionedfile.py (+4/-0)
bzrlib/tests/test_btree_index.py (+27/-0)
bzrlib/tests/test_groupcompress.py (+13/-1)
bzrlib/tests/test_index.py (+35/-0)
bzrlib/versionedfile.py (+7/-0)
To merge this branch: bzr merge lp:~jameinel/bzr/2.1-peak-mem-tweak
Reviewer Review Type Date Requested Status
Ian Clatworthy Approve
Review via email: mp+13582@code.launchpad.net
To post a comment you must log in.
Revision history for this message
John A Meinel (jameinel) wrote :

This patch probably gives the greatest peak memory reduction of the patches I've proposed so far.

The basic work is to just provide "clear_cache()" functions, and then have them called at appropriate times.

Namely, VersionedFiles.clear_cache() and BTreeGraphIndex.clear_cache() are the actual points that matter. I then updated the GroupCompressStreamSource() so that caches are cleared once we have streamed the content out of that VF.

(So we stream revisions, and then clear the cache, then inventories, etc.)

The basic observation is that all of the GCVF instances have a '50MB' LRU cache for their GroupCompressBlocks. In reality, that 50MB is often 160MB do to 'counting errors'. I haven't tracked down the details there yet, but it doesn't really matter that the cache is oversized. What specifically matters is that once we are done reading all the inventories, we aren't going back and reading them again, but we keep the cached state around.

As for the effect. Branching 'launchpad' locally, I don't see a change in time (8m37s both times), but I see
  760MB bzr.dev@4756
  551MB clear from_repository versioned files after 'get_record_stream()'

One thing that this could effect is the "get_stream_for_missing_keys()". As it probably triggers the indexes to re-read some pages it may otherwise have in cache.

I *could* change the code to only clear the GroupCompressBlock cache. However, my numbers from earlier show that as:
  652MB clear _group_cache
  551MB clear _group_cache + index LeafNode cache

So clearing _group_cache is 110MB saved, clearing the index cache is another 100MB saved. (Give or take 10MB on each measurement, as they were done at different times.)

Note that I also tried adding 'self.to_repo.X.clear_cache()' to the StreamSink code, however that was
  549MB clear source and target

And 2MB wasn't worth it.

The numbers for branching bzr.dev
  345MB bzr.dev@4756
  227MB clear both

Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote :

In terms of code quality, this looks good. You've commented out an assert of internal_node_cache size post calling clear_cache() though. Maybe you want to grab the size beforehand and check it's unchanged afterwards? Or maybe that line can go altogether.

I'll run some usertest benchmarks in coming days once you're ready for that. As we measure more, we may tweak things further I guess. Even so, I can't see any faults in this patch and it's a step forward resource usage wise so I'm happy to see this land.

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'bzrlib/btree_index.py'
--- bzrlib/btree_index.py 2009-10-15 18:18:44 +0000
+++ bzrlib/btree_index.py 2009-10-20 04:27:13 +0000
@@ -853,6 +853,19 @@
853 new_tips = next_tips853 new_tips = next_tips
854 return final_offsets854 return final_offsets
855855
856 def clear_cache(self):
857 """Clear out any cached/memoized values.
858
859 This can be called at any time, but generally it is used when we have
860 extracted some information, but don't expect to be requesting any more
861 from this index.
862 """
863 # Note that we don't touch self._root_node or self._internal_node_cache
864 # We don't expect either of those to be big, and it can save
865 # round-trips in the future. We may re-evaluate this if InternalNode
866 # memory starts to be an issue.
867 self._leaf_node_cache.clear()
868
856 def external_references(self, ref_list_num):869 def external_references(self, ref_list_num):
857 if self._root_node is None:870 if self._root_node is None:
858 self._get_root_node()871 self._get_root_node()
859872
=== modified file 'bzrlib/groupcompress.py'
--- bzrlib/groupcompress.py 2009-10-17 04:43:14 +0000
+++ bzrlib/groupcompress.py 2009-10-20 04:27:13 +0000
@@ -1265,6 +1265,11 @@
1265 else:1265 else:
1266 return self.get_record_stream(keys, 'unordered', True)1266 return self.get_record_stream(keys, 'unordered', True)
12671267
1268 def clear_cache(self):
1269 """See VersionedFiles.clear_cache()"""
1270 self._group_cache.clear()
1271 self._index._graph_index.clear_cache()
1272
1268 def _check_add(self, key, lines, random_id, check_content):1273 def _check_add(self, key, lines, random_id, check_content):
1269 """check that version_id and lines are safe to add."""1274 """check that version_id and lines are safe to add."""
1270 version_id = key[-1]1275 version_id = key[-1]
12711276
=== modified file 'bzrlib/index.py'
--- bzrlib/index.py 2009-10-14 13:54:09 +0000
+++ bzrlib/index.py 2009-10-20 04:27:13 +0000
@@ -232,6 +232,13 @@
232 if self._nodes_by_key is not None and self._key_length > 1:232 if self._nodes_by_key is not None and self._key_length > 1:
233 self._update_nodes_by_key(key, value, node_refs)233 self._update_nodes_by_key(key, value, node_refs)
234234
235 def clear_cache(self):
236 """See GraphIndex.clear_cache()
237
238 This is a no-op, but we need the api to conform to a generic 'Index'
239 abstraction.
240 """
241
235 def finish(self):242 def finish(self):
236 lines = [_SIGNATURE]243 lines = [_SIGNATURE]
237 lines.append(_OPTION_NODE_REFS + str(self.reference_lists) + '\n')244 lines.append(_OPTION_NODE_REFS + str(self.reference_lists) + '\n')
@@ -461,6 +468,14 @@
461 # there must be one line - the empty trailer line.468 # there must be one line - the empty trailer line.
462 raise errors.BadIndexData(self)469 raise errors.BadIndexData(self)
463470
471 def clear_cache(self):
472 """Clear out any cached/memoized values.
473
474 This can be called at any time, but generally it is used when we have
475 extracted some information, but don't expect to be requesting any more
476 from this index.
477 """
478
464 def external_references(self, ref_list_num):479 def external_references(self, ref_list_num):
465 """Return references that are not present in this index.480 """Return references that are not present in this index.
466 """481 """
@@ -1226,6 +1241,11 @@
1226 self.__class__.__name__,1241 self.__class__.__name__,
1227 ', '.join(map(repr, self._indices)))1242 ', '.join(map(repr, self._indices)))
12281243
1244 def clear_cache(self):
1245 """See GraphIndex.clear_cache()"""
1246 for index in self._indices:
1247 index.clear_cache()
1248
1229 def get_parent_map(self, keys):1249 def get_parent_map(self, keys):
1230 """See graph.StackedParentsProvider.get_parent_map"""1250 """See graph.StackedParentsProvider.get_parent_map"""
1231 search_keys = set(keys)1251 search_keys = set(keys)
12321252
=== modified file 'bzrlib/repofmt/groupcompress_repo.py'
--- bzrlib/repofmt/groupcompress_repo.py 2009-09-25 21:24:21 +0000
+++ bzrlib/repofmt/groupcompress_repo.py 2009-10-20 04:27:13 +0000
@@ -1105,7 +1105,10 @@
1105 for stream_info in self._fetch_revision_texts(revision_ids):1105 for stream_info in self._fetch_revision_texts(revision_ids):
1106 yield stream_info1106 yield stream_info
1107 self._revision_keys = [(rev_id,) for rev_id in revision_ids]1107 self._revision_keys = [(rev_id,) for rev_id in revision_ids]
1108 self.from_repository.revisions.clear_cache()
1109 self.from_repository.signatures.clear_cache()
1108 yield self._get_inventory_stream(self._revision_keys)1110 yield self._get_inventory_stream(self._revision_keys)
1111 self.from_repository.inventories.clear_cache()
1109 # TODO: The keys to exclude might be part of the search recipe1112 # TODO: The keys to exclude might be part of the search recipe
1110 # For now, exclude all parents that are at the edge of ancestry, for1113 # For now, exclude all parents that are at the edge of ancestry, for
1111 # which we have inventories1114 # which we have inventories
@@ -1114,7 +1117,9 @@
1114 self._revision_keys)1117 self._revision_keys)
1115 for stream_info in self._get_filtered_chk_streams(parent_keys):1118 for stream_info in self._get_filtered_chk_streams(parent_keys):
1116 yield stream_info1119 yield stream_info
1120 self.from_repository.chk_bytes.clear_cache()
1117 yield self._get_text_stream()1121 yield self._get_text_stream()
1122 self.from_repository.texts.clear_cache()
11181123
1119 def get_stream_for_missing_keys(self, missing_keys):1124 def get_stream_for_missing_keys(self, missing_keys):
1120 # missing keys can only occur when we are byte copying and not1125 # missing keys can only occur when we are byte copying and not
11211126
=== modified file 'bzrlib/tests/per_versionedfile.py'
--- bzrlib/tests/per_versionedfile.py 2009-08-26 16:44:27 +0000
+++ bzrlib/tests/per_versionedfile.py 2009-10-20 04:27:13 +0000
@@ -1581,6 +1581,10 @@
1581 # All texts should be output.1581 # All texts should be output.
1582 self.assertEqual(set(keys), seen)1582 self.assertEqual(set(keys), seen)
15831583
1584 def test_clear_cache(self):
1585 files = self.get_versionedfiles()
1586 files.clear_cache()
1587
1584 def test_construct(self):1588 def test_construct(self):
1585 """Each parameterised test can be constructed on a transport."""1589 """Each parameterised test can be constructed on a transport."""
1586 files = self.get_versionedfiles()1590 files = self.get_versionedfiles()
15871591
=== modified file 'bzrlib/tests/test_btree_index.py'
--- bzrlib/tests/test_btree_index.py 2009-10-09 15:02:19 +0000
+++ bzrlib/tests/test_btree_index.py 2009-10-20 04:27:13 +0000
@@ -124,6 +124,12 @@
124124
125class TestBTreeBuilder(BTreeTestCase):125class TestBTreeBuilder(BTreeTestCase):
126126
127 def test_clear_cache(self):
128 builder = btree_index.BTreeBuilder(reference_lists=0, key_elements=1)
129 # This is a no-op, but we need the api to be consistent with other
130 # BTreeGraphIndex apis.
131 builder.clear_cache()
132
127 def test_empty_1_0(self):133 def test_empty_1_0(self):
128 builder = btree_index.BTreeBuilder(key_elements=1, reference_lists=0)134 builder = btree_index.BTreeBuilder(key_elements=1, reference_lists=0)
129 # NamedTemporaryFile dies on builder.finish().read(). weird.135 # NamedTemporaryFile dies on builder.finish().read(). weird.
@@ -639,6 +645,27 @@
639 size = trans.put_file('index', stream)645 size = trans.put_file('index', stream)
640 return btree_index.BTreeGraphIndex(trans, 'index', size)646 return btree_index.BTreeGraphIndex(trans, 'index', size)
641647
648 def test_clear_cache(self):
649 nodes = self.make_nodes(160, 2, 2)
650 index = self.make_index(ref_lists=2, key_elements=2, nodes=nodes)
651 self.assertEqual(1, len(list(index.iter_entries([nodes[30][0]]))))
652 self.assertEqual([1, 4], index._row_lengths)
653 self.assertIsNot(None, index._root_node)
654 internal_node_pre_clear = index._internal_node_cache.keys()
655 self.assertTrue(len(index._leaf_node_cache) > 0)
656 index.clear_cache()
657 # We don't touch _root_node or _internal_node_cache, both should be
658 # small, and can save a round trip or two
659 self.assertIsNot(None, index._root_node)
660 # NOTE: We don't want to affect the _internal_node_cache, as we expect
661 # it will be small, and if we ever do touch this index again, it
662 # will save round-trips. This assertion isn't very strong,
663 # becuase without a 3-level index, we don't have any internal
664 # nodes cached.
665 self.assertEqual(internal_node_pre_clear,
666 index._internal_node_cache.keys())
667 self.assertEqual(0, len(index._leaf_node_cache))
668
642 def test_trivial_constructor(self):669 def test_trivial_constructor(self):
643 transport = get_transport('trace+' + self.get_url(''))670 transport = get_transport('trace+' + self.get_url(''))
644 index = btree_index.BTreeGraphIndex(transport, 'index', None)671 index = btree_index.BTreeGraphIndex(transport, 'index', None)
645672
=== modified file 'bzrlib/tests/test_groupcompress.py'
--- bzrlib/tests/test_groupcompress.py 2009-10-17 04:43:14 +0000
+++ bzrlib/tests/test_groupcompress.py 2009-10-20 04:27:13 +0000
@@ -459,7 +459,8 @@
459 ], block._dump())459 ], block._dump())
460460
461461
462class TestCaseWithGroupCompressVersionedFiles(tests.TestCaseWithTransport):462class TestCaseWithGroupCompressVersionedFiles(
463 tests.TestCaseWithMemoryTransport):
463464
464 def make_test_vf(self, create_graph, keylength=1, do_cleanup=True,465 def make_test_vf(self, create_graph, keylength=1, do_cleanup=True,
465 dir='.', inconsistency_fatal=True):466 dir='.', inconsistency_fatal=True):
@@ -732,6 +733,17 @@
732 " \('b',\) \('42 32 0 8', \(\(\),\)\) \('74 32"733 " \('b',\) \('42 32 0 8', \(\(\),\)\) \('74 32"
733 " 0 8', \(\(\('a',\),\),\)\)")734 " 0 8', \(\(\('a',\),\),\)\)")
734735
736 def test_clear_cache(self):
737 vf = self.make_source_with_b(True, 'source')
738 vf.writer.end()
739 for record in vf.get_record_stream([('a',), ('b',)], 'unordered',
740 True):
741 pass
742 self.assertTrue(len(vf._group_cache) > 0)
743 vf.clear_cache()
744 self.assertEqual(0, len(vf._group_cache))
745
746
735747
736class StubGCVF(object):748class StubGCVF(object):
737 def __init__(self, canned_get_blocks=None):749 def __init__(self, canned_get_blocks=None):
738750
=== modified file 'bzrlib/tests/test_index.py'
--- bzrlib/tests/test_index.py 2009-09-09 18:52:56 +0000
+++ bzrlib/tests/test_index.py 2009-10-20 04:27:13 +0000
@@ -173,6 +173,11 @@
173 "key\x00\x00\t\x00data\n"173 "key\x00\x00\t\x00data\n"
174 "\n", contents)174 "\n", contents)
175175
176 def test_clear_cache(self):
177 builder = GraphIndexBuilder(reference_lists=2)
178 # This is a no-op, but the api should exist
179 builder.clear_cache()
180
176 def test_node_references_are_byte_offsets(self):181 def test_node_references_are_byte_offsets(self):
177 builder = GraphIndexBuilder(reference_lists=1)182 builder = GraphIndexBuilder(reference_lists=1)
178 builder.add_node(('reference', ), 'data', ([], ))183 builder.add_node(('reference', ), 'data', ([], ))
@@ -383,6 +388,12 @@
383 size = trans.put_file('index', stream)388 size = trans.put_file('index', stream)
384 return GraphIndex(trans, 'index', size)389 return GraphIndex(trans, 'index', size)
385390
391 def test_clear_cache(self):
392 index = self.make_index()
393 # For now, we just want to make sure the api is available. As this is
394 # old code, we don't really worry if it *does* anything.
395 index.clear_cache()
396
386 def test_open_bad_index_no_error(self):397 def test_open_bad_index_no_error(self):
387 trans = self.get_transport()398 trans = self.get_transport()
388 trans.put_bytes('name', "not an index\n")399 trans.put_bytes('name', "not an index\n")
@@ -1071,6 +1082,30 @@
1071 index.insert_index(0, index1)1082 index.insert_index(0, index1)
1072 self.assertEqual([(index1, ('key', ), '')], list(index.iter_all_entries()))1083 self.assertEqual([(index1, ('key', ), '')], list(index.iter_all_entries()))
10731084
1085 def test_clear_cache(self):
1086 log = []
1087
1088 class ClearCacheProxy(object):
1089
1090 def __init__(self, index):
1091 self._index = index
1092
1093 def __getattr__(self, name):
1094 return getattr(self._index)
1095
1096 def clear_cache(self):
1097 log.append(self._index)
1098 return self._index.clear_cache()
1099
1100 index = CombinedGraphIndex([])
1101 index1 = self.make_index('name', 0, nodes=[(('key', ), '', ())])
1102 index.insert_index(0, ClearCacheProxy(index1))
1103 index2 = self.make_index('name', 0, nodes=[(('key', ), '', ())])
1104 index.insert_index(1, ClearCacheProxy(index2))
1105 # CombinedGraphIndex should call 'clear_cache()' on all children
1106 index.clear_cache()
1107 self.assertEqual(sorted([index1, index2]), sorted(log))
1108
1074 def test_iter_all_entries_empty(self):1109 def test_iter_all_entries_empty(self):
1075 index = CombinedGraphIndex([])1110 index = CombinedGraphIndex([])
1076 self.assertEqual([], list(index.iter_all_entries()))1111 self.assertEqual([], list(index.iter_all_entries()))
10771112
=== modified file 'bzrlib/versionedfile.py'
--- bzrlib/versionedfile.py 2009-08-17 22:08:21 +0000
+++ bzrlib/versionedfile.py 2009-10-20 04:27:13 +0000
@@ -930,6 +930,13 @@
930 def check_not_reserved_id(version_id):930 def check_not_reserved_id(version_id):
931 revision.check_not_reserved_id(version_id)931 revision.check_not_reserved_id(version_id)
932932
933 def clear_cache(self):
934 """Clear whatever caches this VersionedFile holds.
935
936 This is generally called after an operation has been performed, when we
937 don't expect to be using this versioned file again soon.
938 """
939
933 def _check_lines_not_unicode(self, lines):940 def _check_lines_not_unicode(self, lines):
934 """Check that lines being added to a versioned file are not unicode."""941 """Check that lines being added to a versioned file are not unicode."""
935 for line in lines:942 for line in lines: