Merge lp:~jameinel/bzr/2.0-cache-cix into lp:bzr/2.0
- 2.0-cache-cix
- Merge into 2.0
Status: | Merged |
---|---|
Merged at revision: | not available |
Proposed branch: | lp:~jameinel/bzr/2.0-cache-cix |
Merge into: | lp:bzr/2.0 |
Diff against target: |
300 lines 7 files modified
NEWS (+10/-2) bzrlib/btree_index.py (+15/-9) bzrlib/index.py (+2/-2) bzrlib/repofmt/pack_repo.py (+11/-5) bzrlib/tests/per_repository_chk/test_supported.py (+35/-0) bzrlib/tests/test_btree_index.py (+39/-0) bzrlib/tests/test_index.py (+10/-1) |
To merge this branch: | bzr merge lp:~jameinel/bzr/2.0-cache-cix |
Related bugs: |
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Vincent Ladeuil | Approve | ||
Review via email: mp+11465@code.launchpad.net |
Commit message
Description of the change
John A Meinel (jameinel) wrote : | # |
John A Meinel (jameinel) wrote : | # |
There is a minor update in the branch:
=== modified file 'bzrlib/
--- bzrlib/
+++ bzrlib/
@@ -696,9 +696,9 @@
if start_of_leaves is None:
if node_pos < start_of_leaves:
- self._internal_
+ self._internal_
- self._leaf_
+ self._leaf_
return found
=== modified file 'bzrlib/
--- bzrlib/
+++ bzrlib/
@@ -1119,6 +1119,9 @@
def test_supports_
builder = btree_index.
+ nodes = self.make_
+ for node in nodes:
+ builder.
stream = builder.finish()
trans = get_transport(
size = trans.put_
@@ -1140,6 +1143,7 @@
+ entries = set(index.
Basically, the test suite didn't notice that the caching functions weren't correct, because all of the test data wasn't creating enough pages to actually trigger caching... I, of course, noticed while trying to download launchpad which has 3k+ pages.
This at least tests the 2-level case (one root, lots of leaves), it doesn't get to the 3-level case to test the internal_node_cache path, but those make for slow tests...
Vincent Ladeuil (vila) wrote : | # |
Sounds fine to me and low risk enough to land for 2.0.1.
A comment is needed in test_supports_
and also what the last line is testing (presence in cache ? Why not checking for *all* nodes ?)
+ entries = set(index.
John A Meinel (jameinel) wrote : | # |
> Sounds fine to me and low risk enough to land for 2.0.1.
>
> A comment is needed in test_supports_
> values 500 and 300,
> and also what the last line is testing (presence in cache ? Why not checking
> for *all* nodes ?)
>
> + entries = set(index.
I added a comment about 500, and added a couple tests that check the size of the rows, etc, to validate the results of the magic 500.
I switched to just iter_entries([n[0] for n in nodes]), and asserting that it returns all nodes.
Preview Diff
1 | === modified file 'NEWS' |
2 | --- NEWS 2009-10-08 04:05:25 +0000 |
3 | +++ NEWS 2009-10-09 15:09:13 +0000 |
4 | @@ -2,8 +2,11 @@ |
5 | Bazaar Release Notes |
6 | #################### |
7 | |
8 | -2.0.1 (not released) |
9 | -#################### |
10 | +.. contents:: List of Releases |
11 | + :depth: 1 |
12 | + |
13 | +bzr 2.0.1 (not released) |
14 | +######################## |
15 | |
16 | Compatibility Breaks |
17 | ******************** |
18 | @@ -55,6 +58,11 @@ |
19 | * Retrieving file text or mtime from a _PreviewTree has good performance when |
20 | there are many changes. (Aaron Bentley) |
21 | |
22 | +* The CHK index pages now use an unlimited cache size. With a limited |
23 | + cache and a large project, the random access of chk pages could cause us |
24 | + to download the entire cix file many times. |
25 | + (John Arbash Meinel, #402623) |
26 | + |
27 | * When a file kind becomes unversionable after being added, a sensible |
28 | error will be shown instead of a traceback. (Robert Collins, #438569) |
29 | |
30 | |
31 | === modified file 'bzrlib/btree_index.py' |
32 | --- bzrlib/btree_index.py 2009-08-17 22:11:06 +0000 |
33 | +++ bzrlib/btree_index.py 2009-10-09 15:09:13 +0000 |
34 | @@ -628,7 +628,7 @@ |
35 | memory except when very large walks are done. |
36 | """ |
37 | |
38 | - def __init__(self, transport, name, size): |
39 | + def __init__(self, transport, name, size, unlimited_cache=False): |
40 | """Create a B+Tree index object on the index name. |
41 | |
42 | :param transport: The transport to read data for the index from. |
43 | @@ -638,6 +638,9 @@ |
44 | the initial read (to read the root node header) can be done |
45 | without over-reading even on empty indices, and on small indices |
46 | allows single-IO to read the entire index. |
47 | + :param unlimited_cache: If set to True, then instead of using an |
48 | + LRUCache with size _NODE_CACHE_SIZE, we will use a dict and always |
49 | + cache all leaf nodes. |
50 | """ |
51 | self._transport = transport |
52 | self._name = name |
53 | @@ -647,12 +650,15 @@ |
54 | self._root_node = None |
55 | # Default max size is 100,000 leave values |
56 | self._leaf_value_cache = None # lru_cache.LRUCache(100*1000) |
57 | - self._leaf_node_cache = lru_cache.LRUCache(_NODE_CACHE_SIZE) |
58 | - # We could limit this, but even a 300k record btree has only 3k leaf |
59 | - # nodes, and only 20 internal nodes. So the default of 100 nodes in an |
60 | - # LRU would mean we always cache everything anyway, no need to pay the |
61 | - # overhead of LRU |
62 | - self._internal_node_cache = fifo_cache.FIFOCache(100) |
63 | + if unlimited_cache: |
64 | + self._leaf_node_cache = {} |
65 | + self._internal_node_cache = {} |
66 | + else: |
67 | + self._leaf_node_cache = lru_cache.LRUCache(_NODE_CACHE_SIZE) |
68 | + # We use a FIFO here just to prevent possible blowout. However, a |
69 | + # 300k record btree has only 3k leaf nodes, and only 20 internal |
70 | + # nodes. A value of 100 scales to ~100*100*100 = 1M records. |
71 | + self._internal_node_cache = fifo_cache.FIFOCache(100) |
72 | self._key_count = None |
73 | self._row_lengths = None |
74 | self._row_offsets = None # Start of each row, [-1] is the end |
75 | @@ -690,9 +696,9 @@ |
76 | if start_of_leaves is None: |
77 | start_of_leaves = self._row_offsets[-2] |
78 | if node_pos < start_of_leaves: |
79 | - self._internal_node_cache.add(node_pos, node) |
80 | + self._internal_node_cache[node_pos] = node |
81 | else: |
82 | - self._leaf_node_cache.add(node_pos, node) |
83 | + self._leaf_node_cache[node_pos] = node |
84 | found[node_pos] = node |
85 | return found |
86 | |
87 | |
88 | === modified file 'bzrlib/index.py' |
89 | --- bzrlib/index.py 2009-08-17 22:11:06 +0000 |
90 | +++ bzrlib/index.py 2009-10-09 15:09:13 +0000 |
91 | @@ -1,4 +1,4 @@ |
92 | -# Copyright (C) 2007, 2008 Canonical Ltd |
93 | +# Copyright (C) 2007, 2008, 2009 Canonical Ltd |
94 | # |
95 | # This program is free software; you can redistribute it and/or modify |
96 | # it under the terms of the GNU General Public License as published by |
97 | @@ -368,7 +368,7 @@ |
98 | suitable for production use. :XXX |
99 | """ |
100 | |
101 | - def __init__(self, transport, name, size): |
102 | + def __init__(self, transport, name, size, unlimited_cache=False): |
103 | """Open an index called name on transport. |
104 | |
105 | :param transport: A bzrlib.transport.Transport. |
106 | |
107 | === modified file 'bzrlib/repofmt/pack_repo.py' |
108 | --- bzrlib/repofmt/pack_repo.py 2009-09-08 05:51:36 +0000 |
109 | +++ bzrlib/repofmt/pack_repo.py 2009-10-09 15:09:13 +0000 |
110 | @@ -224,10 +224,14 @@ |
111 | return self.index_name('text', name) |
112 | |
113 | def _replace_index_with_readonly(self, index_type): |
114 | + unlimited_cache = False |
115 | + if index_type == 'chk': |
116 | + unlimited_cache = True |
117 | setattr(self, index_type + '_index', |
118 | self.index_class(self.index_transport, |
119 | self.index_name(index_type, self.name), |
120 | - self.index_sizes[self.index_offset(index_type)])) |
121 | + self.index_sizes[self.index_offset(index_type)], |
122 | + unlimited_cache=unlimited_cache)) |
123 | |
124 | |
125 | class ExistingPack(Pack): |
126 | @@ -1674,7 +1678,7 @@ |
127 | txt_index = self._make_index(name, '.tix') |
128 | sig_index = self._make_index(name, '.six') |
129 | if self.chk_index is not None: |
130 | - chk_index = self._make_index(name, '.cix') |
131 | + chk_index = self._make_index(name, '.cix', unlimited_cache=True) |
132 | else: |
133 | chk_index = None |
134 | result = ExistingPack(self._pack_transport, name, rev_index, |
135 | @@ -1699,7 +1703,8 @@ |
136 | txt_index = self._make_index(name, '.tix', resume=True) |
137 | sig_index = self._make_index(name, '.six', resume=True) |
138 | if self.chk_index is not None: |
139 | - chk_index = self._make_index(name, '.cix', resume=True) |
140 | + chk_index = self._make_index(name, '.cix', resume=True, |
141 | + unlimited_cache=True) |
142 | else: |
143 | chk_index = None |
144 | result = self.resumed_pack_factory(name, rev_index, inv_index, |
145 | @@ -1735,7 +1740,7 @@ |
146 | return self._index_class(self.transport, 'pack-names', None |
147 | ).iter_all_entries() |
148 | |
149 | - def _make_index(self, name, suffix, resume=False): |
150 | + def _make_index(self, name, suffix, resume=False, unlimited_cache=False): |
151 | size_offset = self._suffix_offsets[suffix] |
152 | index_name = name + suffix |
153 | if resume: |
154 | @@ -1744,7 +1749,8 @@ |
155 | else: |
156 | transport = self._index_transport |
157 | index_size = self._names[name][size_offset] |
158 | - return self._index_class(transport, index_name, index_size) |
159 | + return self._index_class(transport, index_name, index_size, |
160 | + unlimited_cache=unlimited_cache) |
161 | |
162 | def _max_pack_count(self, total_revisions): |
163 | """Return the maximum number of packs to use for total revisions. |
164 | |
165 | === modified file 'bzrlib/tests/per_repository_chk/test_supported.py' |
166 | --- bzrlib/tests/per_repository_chk/test_supported.py 2009-09-08 06:25:26 +0000 |
167 | +++ bzrlib/tests/per_repository_chk/test_supported.py 2009-10-09 15:09:13 +0000 |
168 | @@ -17,8 +17,10 @@ |
169 | """Tests for repositories that support CHK indices.""" |
170 | |
171 | from bzrlib import ( |
172 | + btree_index, |
173 | errors, |
174 | osutils, |
175 | + repository, |
176 | ) |
177 | from bzrlib.versionedfile import VersionedFiles |
178 | from bzrlib.tests.per_repository_chk import TestCaseWithRepositoryCHK |
179 | @@ -108,6 +110,39 @@ |
180 | finally: |
181 | repo.unlock() |
182 | |
183 | + def test_chk_bytes_are_fully_buffered(self): |
184 | + repo = self.make_repository('.') |
185 | + repo.lock_write() |
186 | + self.addCleanup(repo.unlock) |
187 | + repo.start_write_group() |
188 | + try: |
189 | + sha1, len, _ = repo.chk_bytes.add_lines((None,), |
190 | + None, ["foo\n", "bar\n"], random_id=True) |
191 | + self.assertEqual('4e48e2c9a3d2ca8a708cb0cc545700544efb5021', |
192 | + sha1) |
193 | + self.assertEqual( |
194 | + set([('sha1:4e48e2c9a3d2ca8a708cb0cc545700544efb5021',)]), |
195 | + repo.chk_bytes.keys()) |
196 | + except: |
197 | + repo.abort_write_group() |
198 | + raise |
199 | + else: |
200 | + repo.commit_write_group() |
201 | + # This may not always be correct if we change away from BTreeGraphIndex |
202 | + # in the future. But for now, lets check that chk_bytes are fully |
203 | + # buffered |
204 | + index = repo.chk_bytes._index._graph_index._indices[0] |
205 | + self.assertIsInstance(index, btree_index.BTreeGraphIndex) |
206 | + self.assertIs(type(index._leaf_node_cache), dict) |
207 | + # Re-opening the repository should also have a repo with everything |
208 | + # fully buffered |
209 | + repo2 = repository.Repository.open(self.get_url()) |
210 | + repo2.lock_read() |
211 | + self.addCleanup(repo2.unlock) |
212 | + index = repo2.chk_bytes._index._graph_index._indices[0] |
213 | + self.assertIsInstance(index, btree_index.BTreeGraphIndex) |
214 | + self.assertIs(type(index._leaf_node_cache), dict) |
215 | + |
216 | |
217 | class TestCommitWriteGroupIntegrityCheck(TestCaseWithRepositoryCHK): |
218 | """Tests that commit_write_group prevents various kinds of invalid data |
219 | |
220 | === modified file 'bzrlib/tests/test_btree_index.py' |
221 | --- bzrlib/tests/test_btree_index.py 2009-08-13 19:56:26 +0000 |
222 | +++ bzrlib/tests/test_btree_index.py 2009-10-09 15:09:13 +0000 |
223 | @@ -23,6 +23,8 @@ |
224 | from bzrlib import ( |
225 | btree_index, |
226 | errors, |
227 | + fifo_cache, |
228 | + lru_cache, |
229 | osutils, |
230 | tests, |
231 | ) |
232 | @@ -1115,6 +1117,43 @@ |
233 | self.assertEqual({}, parent_map) |
234 | self.assertEqual(set([('one',), ('two',)]), missing_keys) |
235 | |
236 | + def test_supports_unlimited_cache(self): |
237 | + builder = btree_index.BTreeBuilder(reference_lists=0, key_elements=1) |
238 | + # We need enough nodes to cause a page split (so we have both an |
239 | + # internal node and a couple leaf nodes. 500 seems to be enough.) |
240 | + nodes = self.make_nodes(500, 1, 0) |
241 | + for node in nodes: |
242 | + builder.add_node(*node) |
243 | + stream = builder.finish() |
244 | + trans = get_transport(self.get_url()) |
245 | + size = trans.put_file('index', stream) |
246 | + index = btree_index.BTreeGraphIndex(trans, 'index', size) |
247 | + self.assertEqual(500, index.key_count()) |
248 | + # We have an internal node |
249 | + self.assertEqual(2, len(index._row_lengths)) |
250 | + # We have at least 2 leaf nodes |
251 | + self.assertTrue(index._row_lengths[-1] >= 2) |
252 | + self.assertIsInstance(index._leaf_node_cache, lru_cache.LRUCache) |
253 | + self.assertEqual(btree_index._NODE_CACHE_SIZE, |
254 | + index._leaf_node_cache._max_cache) |
255 | + self.assertIsInstance(index._internal_node_cache, fifo_cache.FIFOCache) |
256 | + self.assertEqual(100, index._internal_node_cache._max_cache) |
257 | + # No change if unlimited_cache=False is passed |
258 | + index = btree_index.BTreeGraphIndex(trans, 'index', size, |
259 | + unlimited_cache=False) |
260 | + self.assertIsInstance(index._leaf_node_cache, lru_cache.LRUCache) |
261 | + self.assertEqual(btree_index._NODE_CACHE_SIZE, |
262 | + index._leaf_node_cache._max_cache) |
263 | + self.assertIsInstance(index._internal_node_cache, fifo_cache.FIFOCache) |
264 | + self.assertEqual(100, index._internal_node_cache._max_cache) |
265 | + index = btree_index.BTreeGraphIndex(trans, 'index', size, |
266 | + unlimited_cache=True) |
267 | + self.assertIsInstance(index._leaf_node_cache, dict) |
268 | + self.assertIs(type(index._internal_node_cache), dict) |
269 | + # Exercise the lookup code |
270 | + entries = set(index.iter_entries([n[0] for n in nodes])) |
271 | + self.assertEqual(500, len(entries)) |
272 | + |
273 | |
274 | class TestBTreeNodes(BTreeTestCase): |
275 | |
276 | |
277 | === modified file 'bzrlib/tests/test_index.py' |
278 | --- bzrlib/tests/test_index.py 2009-08-13 19:56:26 +0000 |
279 | +++ bzrlib/tests/test_index.py 2009-10-09 15:09:13 +0000 |
280 | @@ -1,4 +1,4 @@ |
281 | -# Copyright (C) 2007 Canonical Ltd |
282 | +# Copyright (C) 2007, 2009 Canonical Ltd |
283 | # |
284 | # This program is free software; you can redistribute it and/or modify |
285 | # it under the terms of the GNU General Public License as published by |
286 | @@ -1006,6 +1006,15 @@ |
287 | self.assertEqual(set(), missing_keys) |
288 | self.assertEqual(set(), search_keys) |
289 | |
290 | + def test_supports_unlimited_cache(self): |
291 | + builder = GraphIndexBuilder(0, key_elements=1) |
292 | + stream = builder.finish() |
293 | + trans = get_transport(self.get_url()) |
294 | + size = trans.put_file('index', stream) |
295 | + # It doesn't matter what unlimited_cache does here, just that it can be |
296 | + # passed |
297 | + index = GraphIndex(trans, 'index', size, unlimited_cache=True) |
298 | + |
299 | |
300 | class TestCombinedGraphIndex(TestCaseWithMemoryTransport): |
301 |
This is meant to be part of the 2.0 stable series. Though my guess is that I missed 2.0rc2 and thus it shouldn't actually land until 2.0.1.
Which brings up a small point... Do we need to have a separate release branch for 2.0 releases, separate from the '2.0.NEXT' branch?
Anyway, this is a workaround for the problem identified in: /bugs.edge. launchpad. net/bzr/ +bug/402623
https:/
Because we access the cix pages in 'random order', once you have more than 1000 nodes (100k keys), you can start thrashing the cache. And any cache misses result in re-downloading that page and reparsing it. The net effect I observed originally was that we ended up downloading the whole .cix file about 5 times, which was 50MB out of the 150MB transfer of Launchpad. (So quite significant overall.)
(I should also note that my local bzr+plugins repo just hit 1016 nodes in the .cix, so even bzr is close to be effected by this, and it isn't *that* big. Of course, we'd have 25% more room if we didn't include new-root-texts in the conversion... :)