Merge lp:~jameinel/bzr/2.1-peak-mem-tweak into lp:bzr

Proposed by John A Meinel
Status: Merged
Approved by: Ian Clatworthy
Approved revision: no longer in the source branch.
Merged at revision: not available
Proposed branch: lp:~jameinel/bzr/2.1-peak-mem-tweak
Merge into: lp:bzr
Diff against target: 279 lines
9 files modified
bzrlib/btree_index.py (+13/-0)
bzrlib/groupcompress.py (+5/-0)
bzrlib/index.py (+20/-0)
bzrlib/repofmt/groupcompress_repo.py (+5/-0)
bzrlib/tests/per_versionedfile.py (+4/-0)
bzrlib/tests/test_btree_index.py (+27/-0)
bzrlib/tests/test_groupcompress.py (+13/-1)
bzrlib/tests/test_index.py (+35/-0)
bzrlib/versionedfile.py (+7/-0)
To merge this branch: bzr merge lp:~jameinel/bzr/2.1-peak-mem-tweak
Reviewer Review Type Date Requested Status
Ian Clatworthy Approve
Review via email: mp+13582@code.launchpad.net
To post a comment you must log in.
Revision history for this message
John A Meinel (jameinel) wrote :

This patch probably gives the greatest peak memory reduction of the patches I've proposed so far.

The basic work is to just provide "clear_cache()" functions, and then have them called at appropriate times.

Namely, VersionedFiles.clear_cache() and BTreeGraphIndex.clear_cache() are the actual points that matter. I then updated the GroupCompressStreamSource() so that caches are cleared once we have streamed the content out of that VF.

(So we stream revisions, and then clear the cache, then inventories, etc.)

The basic observation is that all of the GCVF instances have a '50MB' LRU cache for their GroupCompressBlocks. In reality, that 50MB is often 160MB do to 'counting errors'. I haven't tracked down the details there yet, but it doesn't really matter that the cache is oversized. What specifically matters is that once we are done reading all the inventories, we aren't going back and reading them again, but we keep the cached state around.

As for the effect. Branching 'launchpad' locally, I don't see a change in time (8m37s both times), but I see
  760MB bzr.dev@4756
  551MB clear from_repository versioned files after 'get_record_stream()'

One thing that this could effect is the "get_stream_for_missing_keys()". As it probably triggers the indexes to re-read some pages it may otherwise have in cache.

I *could* change the code to only clear the GroupCompressBlock cache. However, my numbers from earlier show that as:
  652MB clear _group_cache
  551MB clear _group_cache + index LeafNode cache

So clearing _group_cache is 110MB saved, clearing the index cache is another 100MB saved. (Give or take 10MB on each measurement, as they were done at different times.)

Note that I also tried adding 'self.to_repo.X.clear_cache()' to the StreamSink code, however that was
  549MB clear source and target

And 2MB wasn't worth it.

The numbers for branching bzr.dev
  345MB bzr.dev@4756
  227MB clear both

Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote :

In terms of code quality, this looks good. You've commented out an assert of internal_node_cache size post calling clear_cache() though. Maybe you want to grab the size beforehand and check it's unchanged afterwards? Or maybe that line can go altogether.

I'll run some usertest benchmarks in coming days once you're ready for that. As we measure more, we may tweak things further I guess. Even so, I can't see any faults in this patch and it's a step forward resource usage wise so I'm happy to see this land.

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'bzrlib/btree_index.py'
2--- bzrlib/btree_index.py 2009-10-15 18:18:44 +0000
3+++ bzrlib/btree_index.py 2009-10-20 04:27:13 +0000
4@@ -853,6 +853,19 @@
5 new_tips = next_tips
6 return final_offsets
7
8+ def clear_cache(self):
9+ """Clear out any cached/memoized values.
10+
11+ This can be called at any time, but generally it is used when we have
12+ extracted some information, but don't expect to be requesting any more
13+ from this index.
14+ """
15+ # Note that we don't touch self._root_node or self._internal_node_cache
16+ # We don't expect either of those to be big, and it can save
17+ # round-trips in the future. We may re-evaluate this if InternalNode
18+ # memory starts to be an issue.
19+ self._leaf_node_cache.clear()
20+
21 def external_references(self, ref_list_num):
22 if self._root_node is None:
23 self._get_root_node()
24
25=== modified file 'bzrlib/groupcompress.py'
26--- bzrlib/groupcompress.py 2009-10-17 04:43:14 +0000
27+++ bzrlib/groupcompress.py 2009-10-20 04:27:13 +0000
28@@ -1265,6 +1265,11 @@
29 else:
30 return self.get_record_stream(keys, 'unordered', True)
31
32+ def clear_cache(self):
33+ """See VersionedFiles.clear_cache()"""
34+ self._group_cache.clear()
35+ self._index._graph_index.clear_cache()
36+
37 def _check_add(self, key, lines, random_id, check_content):
38 """check that version_id and lines are safe to add."""
39 version_id = key[-1]
40
41=== modified file 'bzrlib/index.py'
42--- bzrlib/index.py 2009-10-14 13:54:09 +0000
43+++ bzrlib/index.py 2009-10-20 04:27:13 +0000
44@@ -232,6 +232,13 @@
45 if self._nodes_by_key is not None and self._key_length > 1:
46 self._update_nodes_by_key(key, value, node_refs)
47
48+ def clear_cache(self):
49+ """See GraphIndex.clear_cache()
50+
51+ This is a no-op, but we need the api to conform to a generic 'Index'
52+ abstraction.
53+ """
54+
55 def finish(self):
56 lines = [_SIGNATURE]
57 lines.append(_OPTION_NODE_REFS + str(self.reference_lists) + '\n')
58@@ -461,6 +468,14 @@
59 # there must be one line - the empty trailer line.
60 raise errors.BadIndexData(self)
61
62+ def clear_cache(self):
63+ """Clear out any cached/memoized values.
64+
65+ This can be called at any time, but generally it is used when we have
66+ extracted some information, but don't expect to be requesting any more
67+ from this index.
68+ """
69+
70 def external_references(self, ref_list_num):
71 """Return references that are not present in this index.
72 """
73@@ -1226,6 +1241,11 @@
74 self.__class__.__name__,
75 ', '.join(map(repr, self._indices)))
76
77+ def clear_cache(self):
78+ """See GraphIndex.clear_cache()"""
79+ for index in self._indices:
80+ index.clear_cache()
81+
82 def get_parent_map(self, keys):
83 """See graph.StackedParentsProvider.get_parent_map"""
84 search_keys = set(keys)
85
86=== modified file 'bzrlib/repofmt/groupcompress_repo.py'
87--- bzrlib/repofmt/groupcompress_repo.py 2009-09-25 21:24:21 +0000
88+++ bzrlib/repofmt/groupcompress_repo.py 2009-10-20 04:27:13 +0000
89@@ -1105,7 +1105,10 @@
90 for stream_info in self._fetch_revision_texts(revision_ids):
91 yield stream_info
92 self._revision_keys = [(rev_id,) for rev_id in revision_ids]
93+ self.from_repository.revisions.clear_cache()
94+ self.from_repository.signatures.clear_cache()
95 yield self._get_inventory_stream(self._revision_keys)
96+ self.from_repository.inventories.clear_cache()
97 # TODO: The keys to exclude might be part of the search recipe
98 # For now, exclude all parents that are at the edge of ancestry, for
99 # which we have inventories
100@@ -1114,7 +1117,9 @@
101 self._revision_keys)
102 for stream_info in self._get_filtered_chk_streams(parent_keys):
103 yield stream_info
104+ self.from_repository.chk_bytes.clear_cache()
105 yield self._get_text_stream()
106+ self.from_repository.texts.clear_cache()
107
108 def get_stream_for_missing_keys(self, missing_keys):
109 # missing keys can only occur when we are byte copying and not
110
111=== modified file 'bzrlib/tests/per_versionedfile.py'
112--- bzrlib/tests/per_versionedfile.py 2009-08-26 16:44:27 +0000
113+++ bzrlib/tests/per_versionedfile.py 2009-10-20 04:27:13 +0000
114@@ -1581,6 +1581,10 @@
115 # All texts should be output.
116 self.assertEqual(set(keys), seen)
117
118+ def test_clear_cache(self):
119+ files = self.get_versionedfiles()
120+ files.clear_cache()
121+
122 def test_construct(self):
123 """Each parameterised test can be constructed on a transport."""
124 files = self.get_versionedfiles()
125
126=== modified file 'bzrlib/tests/test_btree_index.py'
127--- bzrlib/tests/test_btree_index.py 2009-10-09 15:02:19 +0000
128+++ bzrlib/tests/test_btree_index.py 2009-10-20 04:27:13 +0000
129@@ -124,6 +124,12 @@
130
131 class TestBTreeBuilder(BTreeTestCase):
132
133+ def test_clear_cache(self):
134+ builder = btree_index.BTreeBuilder(reference_lists=0, key_elements=1)
135+ # This is a no-op, but we need the api to be consistent with other
136+ # BTreeGraphIndex apis.
137+ builder.clear_cache()
138+
139 def test_empty_1_0(self):
140 builder = btree_index.BTreeBuilder(key_elements=1, reference_lists=0)
141 # NamedTemporaryFile dies on builder.finish().read(). weird.
142@@ -639,6 +645,27 @@
143 size = trans.put_file('index', stream)
144 return btree_index.BTreeGraphIndex(trans, 'index', size)
145
146+ def test_clear_cache(self):
147+ nodes = self.make_nodes(160, 2, 2)
148+ index = self.make_index(ref_lists=2, key_elements=2, nodes=nodes)
149+ self.assertEqual(1, len(list(index.iter_entries([nodes[30][0]]))))
150+ self.assertEqual([1, 4], index._row_lengths)
151+ self.assertIsNot(None, index._root_node)
152+ internal_node_pre_clear = index._internal_node_cache.keys()
153+ self.assertTrue(len(index._leaf_node_cache) > 0)
154+ index.clear_cache()
155+ # We don't touch _root_node or _internal_node_cache, both should be
156+ # small, and can save a round trip or two
157+ self.assertIsNot(None, index._root_node)
158+ # NOTE: We don't want to affect the _internal_node_cache, as we expect
159+ # it will be small, and if we ever do touch this index again, it
160+ # will save round-trips. This assertion isn't very strong,
161+ # becuase without a 3-level index, we don't have any internal
162+ # nodes cached.
163+ self.assertEqual(internal_node_pre_clear,
164+ index._internal_node_cache.keys())
165+ self.assertEqual(0, len(index._leaf_node_cache))
166+
167 def test_trivial_constructor(self):
168 transport = get_transport('trace+' + self.get_url(''))
169 index = btree_index.BTreeGraphIndex(transport, 'index', None)
170
171=== modified file 'bzrlib/tests/test_groupcompress.py'
172--- bzrlib/tests/test_groupcompress.py 2009-10-17 04:43:14 +0000
173+++ bzrlib/tests/test_groupcompress.py 2009-10-20 04:27:13 +0000
174@@ -459,7 +459,8 @@
175 ], block._dump())
176
177
178-class TestCaseWithGroupCompressVersionedFiles(tests.TestCaseWithTransport):
179+class TestCaseWithGroupCompressVersionedFiles(
180+ tests.TestCaseWithMemoryTransport):
181
182 def make_test_vf(self, create_graph, keylength=1, do_cleanup=True,
183 dir='.', inconsistency_fatal=True):
184@@ -732,6 +733,17 @@
185 " \('b',\) \('42 32 0 8', \(\(\),\)\) \('74 32"
186 " 0 8', \(\(\('a',\),\),\)\)")
187
188+ def test_clear_cache(self):
189+ vf = self.make_source_with_b(True, 'source')
190+ vf.writer.end()
191+ for record in vf.get_record_stream([('a',), ('b',)], 'unordered',
192+ True):
193+ pass
194+ self.assertTrue(len(vf._group_cache) > 0)
195+ vf.clear_cache()
196+ self.assertEqual(0, len(vf._group_cache))
197+
198+
199
200 class StubGCVF(object):
201 def __init__(self, canned_get_blocks=None):
202
203=== modified file 'bzrlib/tests/test_index.py'
204--- bzrlib/tests/test_index.py 2009-09-09 18:52:56 +0000
205+++ bzrlib/tests/test_index.py 2009-10-20 04:27:13 +0000
206@@ -173,6 +173,11 @@
207 "key\x00\x00\t\x00data\n"
208 "\n", contents)
209
210+ def test_clear_cache(self):
211+ builder = GraphIndexBuilder(reference_lists=2)
212+ # This is a no-op, but the api should exist
213+ builder.clear_cache()
214+
215 def test_node_references_are_byte_offsets(self):
216 builder = GraphIndexBuilder(reference_lists=1)
217 builder.add_node(('reference', ), 'data', ([], ))
218@@ -383,6 +388,12 @@
219 size = trans.put_file('index', stream)
220 return GraphIndex(trans, 'index', size)
221
222+ def test_clear_cache(self):
223+ index = self.make_index()
224+ # For now, we just want to make sure the api is available. As this is
225+ # old code, we don't really worry if it *does* anything.
226+ index.clear_cache()
227+
228 def test_open_bad_index_no_error(self):
229 trans = self.get_transport()
230 trans.put_bytes('name', "not an index\n")
231@@ -1071,6 +1082,30 @@
232 index.insert_index(0, index1)
233 self.assertEqual([(index1, ('key', ), '')], list(index.iter_all_entries()))
234
235+ def test_clear_cache(self):
236+ log = []
237+
238+ class ClearCacheProxy(object):
239+
240+ def __init__(self, index):
241+ self._index = index
242+
243+ def __getattr__(self, name):
244+ return getattr(self._index)
245+
246+ def clear_cache(self):
247+ log.append(self._index)
248+ return self._index.clear_cache()
249+
250+ index = CombinedGraphIndex([])
251+ index1 = self.make_index('name', 0, nodes=[(('key', ), '', ())])
252+ index.insert_index(0, ClearCacheProxy(index1))
253+ index2 = self.make_index('name', 0, nodes=[(('key', ), '', ())])
254+ index.insert_index(1, ClearCacheProxy(index2))
255+ # CombinedGraphIndex should call 'clear_cache()' on all children
256+ index.clear_cache()
257+ self.assertEqual(sorted([index1, index2]), sorted(log))
258+
259 def test_iter_all_entries_empty(self):
260 index = CombinedGraphIndex([])
261 self.assertEqual([], list(index.iter_all_entries()))
262
263=== modified file 'bzrlib/versionedfile.py'
264--- bzrlib/versionedfile.py 2009-08-17 22:08:21 +0000
265+++ bzrlib/versionedfile.py 2009-10-20 04:27:13 +0000
266@@ -930,6 +930,13 @@
267 def check_not_reserved_id(version_id):
268 revision.check_not_reserved_id(version_id)
269
270+ def clear_cache(self):
271+ """Clear whatever caches this VersionedFile holds.
272+
273+ This is generally called after an operation has been performed, when we
274+ don't expect to be using this versioned file again soon.
275+ """
276+
277 def _check_lines_not_unicode(self, lines):
278 """Check that lines being added to a versioned file are not unicode."""
279 for line in lines: