Merge lp:~jameinel/bzr/2.0-cache-cix into lp:bzr/2.0

Proposed by John A Meinel
Status: Merged
Merged at revision: not available
Proposed branch: lp:~jameinel/bzr/2.0-cache-cix
Merge into: lp:bzr/2.0
Diff against target: 300 lines
7 files modified
NEWS (+10/-2)
bzrlib/btree_index.py (+15/-9)
bzrlib/index.py (+2/-2)
bzrlib/repofmt/pack_repo.py (+11/-5)
bzrlib/tests/per_repository_chk/test_supported.py (+35/-0)
bzrlib/tests/test_btree_index.py (+39/-0)
bzrlib/tests/test_index.py (+10/-1)
To merge this branch: bzr merge lp:~jameinel/bzr/2.0-cache-cix
Reviewer Review Type Date Requested Status
Vincent Ladeuil Approve
Review via email: mp+11465@code.launchpad.net
To post a comment you must log in.
Revision history for this message
John A Meinel (jameinel) wrote :

This is meant to be part of the 2.0 stable series. Though my guess is that I missed 2.0rc2 and thus it shouldn't actually land until 2.0.1.

Which brings up a small point... Do we need to have a separate release branch for 2.0 releases, separate from the '2.0.NEXT' branch?

Anyway, this is a workaround for the problem identified in:
  https://bugs.edge.launchpad.net/bzr/+bug/402623

Because we access the cix pages in 'random order', once you have more than 1000 nodes (100k keys), you can start thrashing the cache. And any cache misses result in re-downloading that page and reparsing it. The net effect I observed originally was that we ended up downloading the whole .cix file about 5 times, which was 50MB out of the 150MB transfer of Launchpad. (So quite significant overall.)

(I should also note that my local bzr+plugins repo just hit 1016 nodes in the .cix, so even bzr is close to be effected by this, and it isn't *that* big. Of course, we'd have 25% more room if we didn't include new-root-texts in the conversion... :)

Revision history for this message
John A Meinel (jameinel) wrote :

There is a minor update in the branch:
=== modified file 'bzrlib/btree_index.py'
--- bzrlib/btree_index.py 2009-09-09 18:52:56 +0000
+++ bzrlib/btree_index.py 2009-09-09 19:31:54 +0000
@@ -696,9 +696,9 @@
                 if start_of_leaves is None:
                     start_of_leaves = self._row_offsets[-2]
                 if node_pos < start_of_leaves:
- self._internal_node_cache.add(node_pos, node)
+ self._internal_node_cache[node_pos] = node
                 else:
- self._leaf_node_cache.add(node_pos, node)
+ self._leaf_node_cache[node_pos] = node
             found[node_pos] = node
         return found

=== modified file 'bzrlib/tests/test_btree_index.py'
--- bzrlib/tests/test_btree_index.py 2009-09-09 18:52:56 +0000
+++ bzrlib/tests/test_btree_index.py 2009-09-09 19:31:26 +0000
@@ -1119,6 +1119,9 @@

     def test_supports_unlimited_cache(self):
         builder = btree_index.BTreeBuilder(reference_lists=0, key_elements=1)
+ nodes = self.make_nodes(500, 1, 0)
+ for node in nodes:
+ builder.add_node(*node)
         stream = builder.finish()
         trans = get_transport(self.get_url())
         size = trans.put_file('index', stream)
@@ -1140,6 +1143,7 @@
                                             unlimited_cache=True)
         self.assertIsInstance(index._leaf_node_cache, dict)
         self.assertIs(type(index._internal_node_cache), dict)
+ entries = set(index.iter_entries([nodes[0][1], nodes[300][0]]))

Basically, the test suite didn't notice that the caching functions weren't correct, because all of the test data wasn't creating enough pages to actually trigger caching... I, of course, noticed while trying to download launchpad which has 3k+ pages.

This at least tests the 2-level case (one root, lots of leaves), it doesn't get to the 3-level case to test the internal_node_cache path, but those make for slow tests...

Revision history for this message
Vincent Ladeuil (vila) wrote :

Sounds fine to me and low risk enough to land for 2.0.1.

A comment is needed in test_supports_unlimited_cache to explain the magic values 500 and 300,
and also what the last line is testing (presence in cache ? Why not checking for *all* nodes ?)

+ entries = set(index.iter_entries([nodes[0][1], nodes[300][0]]))

review: Approve
Revision history for this message
John A Meinel (jameinel) wrote :

> Sounds fine to me and low risk enough to land for 2.0.1.
>
> A comment is needed in test_supports_unlimited_cache to explain the magic
> values 500 and 300,
> and also what the last line is testing (presence in cache ? Why not checking
> for *all* nodes ?)
>
> + entries = set(index.iter_entries([nodes[0][1], nodes[300][0]]))

I added a comment about 500, and added a couple tests that check the size of the rows, etc, to validate the results of the magic 500.

I switched to just iter_entries([n[0] for n in nodes]), and asserting that it returns all nodes.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'NEWS'
2--- NEWS 2009-10-08 04:05:25 +0000
3+++ NEWS 2009-10-09 15:09:13 +0000
4@@ -2,8 +2,11 @@
5 Bazaar Release Notes
6 ####################
7
8-2.0.1 (not released)
9-####################
10+.. contents:: List of Releases
11+ :depth: 1
12+
13+bzr 2.0.1 (not released)
14+########################
15
16 Compatibility Breaks
17 ********************
18@@ -55,6 +58,11 @@
19 * Retrieving file text or mtime from a _PreviewTree has good performance when
20 there are many changes. (Aaron Bentley)
21
22+* The CHK index pages now use an unlimited cache size. With a limited
23+ cache and a large project, the random access of chk pages could cause us
24+ to download the entire cix file many times.
25+ (John Arbash Meinel, #402623)
26+
27 * When a file kind becomes unversionable after being added, a sensible
28 error will be shown instead of a traceback. (Robert Collins, #438569)
29
30
31=== modified file 'bzrlib/btree_index.py'
32--- bzrlib/btree_index.py 2009-08-17 22:11:06 +0000
33+++ bzrlib/btree_index.py 2009-10-09 15:09:13 +0000
34@@ -628,7 +628,7 @@
35 memory except when very large walks are done.
36 """
37
38- def __init__(self, transport, name, size):
39+ def __init__(self, transport, name, size, unlimited_cache=False):
40 """Create a B+Tree index object on the index name.
41
42 :param transport: The transport to read data for the index from.
43@@ -638,6 +638,9 @@
44 the initial read (to read the root node header) can be done
45 without over-reading even on empty indices, and on small indices
46 allows single-IO to read the entire index.
47+ :param unlimited_cache: If set to True, then instead of using an
48+ LRUCache with size _NODE_CACHE_SIZE, we will use a dict and always
49+ cache all leaf nodes.
50 """
51 self._transport = transport
52 self._name = name
53@@ -647,12 +650,15 @@
54 self._root_node = None
55 # Default max size is 100,000 leave values
56 self._leaf_value_cache = None # lru_cache.LRUCache(100*1000)
57- self._leaf_node_cache = lru_cache.LRUCache(_NODE_CACHE_SIZE)
58- # We could limit this, but even a 300k record btree has only 3k leaf
59- # nodes, and only 20 internal nodes. So the default of 100 nodes in an
60- # LRU would mean we always cache everything anyway, no need to pay the
61- # overhead of LRU
62- self._internal_node_cache = fifo_cache.FIFOCache(100)
63+ if unlimited_cache:
64+ self._leaf_node_cache = {}
65+ self._internal_node_cache = {}
66+ else:
67+ self._leaf_node_cache = lru_cache.LRUCache(_NODE_CACHE_SIZE)
68+ # We use a FIFO here just to prevent possible blowout. However, a
69+ # 300k record btree has only 3k leaf nodes, and only 20 internal
70+ # nodes. A value of 100 scales to ~100*100*100 = 1M records.
71+ self._internal_node_cache = fifo_cache.FIFOCache(100)
72 self._key_count = None
73 self._row_lengths = None
74 self._row_offsets = None # Start of each row, [-1] is the end
75@@ -690,9 +696,9 @@
76 if start_of_leaves is None:
77 start_of_leaves = self._row_offsets[-2]
78 if node_pos < start_of_leaves:
79- self._internal_node_cache.add(node_pos, node)
80+ self._internal_node_cache[node_pos] = node
81 else:
82- self._leaf_node_cache.add(node_pos, node)
83+ self._leaf_node_cache[node_pos] = node
84 found[node_pos] = node
85 return found
86
87
88=== modified file 'bzrlib/index.py'
89--- bzrlib/index.py 2009-08-17 22:11:06 +0000
90+++ bzrlib/index.py 2009-10-09 15:09:13 +0000
91@@ -1,4 +1,4 @@
92-# Copyright (C) 2007, 2008 Canonical Ltd
93+# Copyright (C) 2007, 2008, 2009 Canonical Ltd
94 #
95 # This program is free software; you can redistribute it and/or modify
96 # it under the terms of the GNU General Public License as published by
97@@ -368,7 +368,7 @@
98 suitable for production use. :XXX
99 """
100
101- def __init__(self, transport, name, size):
102+ def __init__(self, transport, name, size, unlimited_cache=False):
103 """Open an index called name on transport.
104
105 :param transport: A bzrlib.transport.Transport.
106
107=== modified file 'bzrlib/repofmt/pack_repo.py'
108--- bzrlib/repofmt/pack_repo.py 2009-09-08 05:51:36 +0000
109+++ bzrlib/repofmt/pack_repo.py 2009-10-09 15:09:13 +0000
110@@ -224,10 +224,14 @@
111 return self.index_name('text', name)
112
113 def _replace_index_with_readonly(self, index_type):
114+ unlimited_cache = False
115+ if index_type == 'chk':
116+ unlimited_cache = True
117 setattr(self, index_type + '_index',
118 self.index_class(self.index_transport,
119 self.index_name(index_type, self.name),
120- self.index_sizes[self.index_offset(index_type)]))
121+ self.index_sizes[self.index_offset(index_type)],
122+ unlimited_cache=unlimited_cache))
123
124
125 class ExistingPack(Pack):
126@@ -1674,7 +1678,7 @@
127 txt_index = self._make_index(name, '.tix')
128 sig_index = self._make_index(name, '.six')
129 if self.chk_index is not None:
130- chk_index = self._make_index(name, '.cix')
131+ chk_index = self._make_index(name, '.cix', unlimited_cache=True)
132 else:
133 chk_index = None
134 result = ExistingPack(self._pack_transport, name, rev_index,
135@@ -1699,7 +1703,8 @@
136 txt_index = self._make_index(name, '.tix', resume=True)
137 sig_index = self._make_index(name, '.six', resume=True)
138 if self.chk_index is not None:
139- chk_index = self._make_index(name, '.cix', resume=True)
140+ chk_index = self._make_index(name, '.cix', resume=True,
141+ unlimited_cache=True)
142 else:
143 chk_index = None
144 result = self.resumed_pack_factory(name, rev_index, inv_index,
145@@ -1735,7 +1740,7 @@
146 return self._index_class(self.transport, 'pack-names', None
147 ).iter_all_entries()
148
149- def _make_index(self, name, suffix, resume=False):
150+ def _make_index(self, name, suffix, resume=False, unlimited_cache=False):
151 size_offset = self._suffix_offsets[suffix]
152 index_name = name + suffix
153 if resume:
154@@ -1744,7 +1749,8 @@
155 else:
156 transport = self._index_transport
157 index_size = self._names[name][size_offset]
158- return self._index_class(transport, index_name, index_size)
159+ return self._index_class(transport, index_name, index_size,
160+ unlimited_cache=unlimited_cache)
161
162 def _max_pack_count(self, total_revisions):
163 """Return the maximum number of packs to use for total revisions.
164
165=== modified file 'bzrlib/tests/per_repository_chk/test_supported.py'
166--- bzrlib/tests/per_repository_chk/test_supported.py 2009-09-08 06:25:26 +0000
167+++ bzrlib/tests/per_repository_chk/test_supported.py 2009-10-09 15:09:13 +0000
168@@ -17,8 +17,10 @@
169 """Tests for repositories that support CHK indices."""
170
171 from bzrlib import (
172+ btree_index,
173 errors,
174 osutils,
175+ repository,
176 )
177 from bzrlib.versionedfile import VersionedFiles
178 from bzrlib.tests.per_repository_chk import TestCaseWithRepositoryCHK
179@@ -108,6 +110,39 @@
180 finally:
181 repo.unlock()
182
183+ def test_chk_bytes_are_fully_buffered(self):
184+ repo = self.make_repository('.')
185+ repo.lock_write()
186+ self.addCleanup(repo.unlock)
187+ repo.start_write_group()
188+ try:
189+ sha1, len, _ = repo.chk_bytes.add_lines((None,),
190+ None, ["foo\n", "bar\n"], random_id=True)
191+ self.assertEqual('4e48e2c9a3d2ca8a708cb0cc545700544efb5021',
192+ sha1)
193+ self.assertEqual(
194+ set([('sha1:4e48e2c9a3d2ca8a708cb0cc545700544efb5021',)]),
195+ repo.chk_bytes.keys())
196+ except:
197+ repo.abort_write_group()
198+ raise
199+ else:
200+ repo.commit_write_group()
201+ # This may not always be correct if we change away from BTreeGraphIndex
202+ # in the future. But for now, lets check that chk_bytes are fully
203+ # buffered
204+ index = repo.chk_bytes._index._graph_index._indices[0]
205+ self.assertIsInstance(index, btree_index.BTreeGraphIndex)
206+ self.assertIs(type(index._leaf_node_cache), dict)
207+ # Re-opening the repository should also have a repo with everything
208+ # fully buffered
209+ repo2 = repository.Repository.open(self.get_url())
210+ repo2.lock_read()
211+ self.addCleanup(repo2.unlock)
212+ index = repo2.chk_bytes._index._graph_index._indices[0]
213+ self.assertIsInstance(index, btree_index.BTreeGraphIndex)
214+ self.assertIs(type(index._leaf_node_cache), dict)
215+
216
217 class TestCommitWriteGroupIntegrityCheck(TestCaseWithRepositoryCHK):
218 """Tests that commit_write_group prevents various kinds of invalid data
219
220=== modified file 'bzrlib/tests/test_btree_index.py'
221--- bzrlib/tests/test_btree_index.py 2009-08-13 19:56:26 +0000
222+++ bzrlib/tests/test_btree_index.py 2009-10-09 15:09:13 +0000
223@@ -23,6 +23,8 @@
224 from bzrlib import (
225 btree_index,
226 errors,
227+ fifo_cache,
228+ lru_cache,
229 osutils,
230 tests,
231 )
232@@ -1115,6 +1117,43 @@
233 self.assertEqual({}, parent_map)
234 self.assertEqual(set([('one',), ('two',)]), missing_keys)
235
236+ def test_supports_unlimited_cache(self):
237+ builder = btree_index.BTreeBuilder(reference_lists=0, key_elements=1)
238+ # We need enough nodes to cause a page split (so we have both an
239+ # internal node and a couple leaf nodes. 500 seems to be enough.)
240+ nodes = self.make_nodes(500, 1, 0)
241+ for node in nodes:
242+ builder.add_node(*node)
243+ stream = builder.finish()
244+ trans = get_transport(self.get_url())
245+ size = trans.put_file('index', stream)
246+ index = btree_index.BTreeGraphIndex(trans, 'index', size)
247+ self.assertEqual(500, index.key_count())
248+ # We have an internal node
249+ self.assertEqual(2, len(index._row_lengths))
250+ # We have at least 2 leaf nodes
251+ self.assertTrue(index._row_lengths[-1] >= 2)
252+ self.assertIsInstance(index._leaf_node_cache, lru_cache.LRUCache)
253+ self.assertEqual(btree_index._NODE_CACHE_SIZE,
254+ index._leaf_node_cache._max_cache)
255+ self.assertIsInstance(index._internal_node_cache, fifo_cache.FIFOCache)
256+ self.assertEqual(100, index._internal_node_cache._max_cache)
257+ # No change if unlimited_cache=False is passed
258+ index = btree_index.BTreeGraphIndex(trans, 'index', size,
259+ unlimited_cache=False)
260+ self.assertIsInstance(index._leaf_node_cache, lru_cache.LRUCache)
261+ self.assertEqual(btree_index._NODE_CACHE_SIZE,
262+ index._leaf_node_cache._max_cache)
263+ self.assertIsInstance(index._internal_node_cache, fifo_cache.FIFOCache)
264+ self.assertEqual(100, index._internal_node_cache._max_cache)
265+ index = btree_index.BTreeGraphIndex(trans, 'index', size,
266+ unlimited_cache=True)
267+ self.assertIsInstance(index._leaf_node_cache, dict)
268+ self.assertIs(type(index._internal_node_cache), dict)
269+ # Exercise the lookup code
270+ entries = set(index.iter_entries([n[0] for n in nodes]))
271+ self.assertEqual(500, len(entries))
272+
273
274 class TestBTreeNodes(BTreeTestCase):
275
276
277=== modified file 'bzrlib/tests/test_index.py'
278--- bzrlib/tests/test_index.py 2009-08-13 19:56:26 +0000
279+++ bzrlib/tests/test_index.py 2009-10-09 15:09:13 +0000
280@@ -1,4 +1,4 @@
281-# Copyright (C) 2007 Canonical Ltd
282+# Copyright (C) 2007, 2009 Canonical Ltd
283 #
284 # This program is free software; you can redistribute it and/or modify
285 # it under the terms of the GNU General Public License as published by
286@@ -1006,6 +1006,15 @@
287 self.assertEqual(set(), missing_keys)
288 self.assertEqual(set(), search_keys)
289
290+ def test_supports_unlimited_cache(self):
291+ builder = GraphIndexBuilder(0, key_elements=1)
292+ stream = builder.finish()
293+ trans = get_transport(self.get_url())
294+ size = trans.put_file('index', stream)
295+ # It doesn't matter what unlimited_cache does here, just that it can be
296+ # passed
297+ index = GraphIndex(trans, 'index', size, unlimited_cache=True)
298+
299
300 class TestCombinedGraphIndex(TestCaseWithMemoryTransport):
301

Subscribers

People subscribed via source and target branches