Merge lp:~spiv/bzr/smarter-index-search into lp:bzr

Proposed by Andrew Bennetts
Status: Merged
Approved by: Andrew Bennetts
Approved revision: no longer in the source branch.
Merged at revision: not available
Proposed branch: lp:~spiv/bzr/smarter-index-search
Merge into: lp:bzr
Diff against target: 365 lines (+176/-29)
5 files modified
NEWS (+6/-0)
bzrlib/index.py (+95/-7)
bzrlib/repofmt/pack_repo.py (+14/-22)
bzrlib/tests/per_pack_repository.py (+17/-0)
bzrlib/tests/test_index.py (+44/-0)
To merge this branch: bzr merge lp:~spiv/bzr/smarter-index-search
Reviewer Review Type Date Requested Status
Robert Collins (community) Approve
Martin Packman (community) Approve
John A Meinel Pending
Review via email: mp+21615@code.launchpad.net

This proposal supersedes a proposal from 2010-03-12.

Description of the change

Optimise index lookups in repositories with many pack files.

First, the headline: this greatly improves "bzr pull" of one new revision of grub from savannah by HTTP (as reported on the mailing list, which has further analysis):

bzr.dev: 2424kB (50.2kB/s r:2395kB w:30kB)
this patch: 1034kB (43.3kB/s r:1022kB w:12kB)

Given that the pack data transferred is 701266 bytes (which itself seems quite large for such a small change...), that brings the index-searching overhead from 2.42x to 0.45x of the bytes read. It also halves the wall-clock time :)

That repo has I think 14 packs, and bzr.dev tries 11 indices for each of rix, iix, etc before finding the data it needs for that fetch.

There are two parts to this change:

 1) when a CombinedGraphIndex performs a lookup, it shuffles the index or indices that contained the records to the front of self._indices on the assumption that future lookups should try those first.
 2) propagates that reordering to the other CombinedGraphIndex objects from the same pack collection. This is done by a) associating a name (the pack name) with the elements of CombinedGraphIndex, and b) linking the revisions/inventories/etc CombinedGraphIndex objects belonging to a single pack collection via setting a _sibling_indices attribute on them, c) using those links and names to apply the same reordering to those sibling indices.

I've been pretty conservative with API changes: the new behaviour is only activated by optional keyword arguments, so existing uses of CombinedGraphIndex should see no change of behaviour (including no improvement). This is to make it as easy as possible to backport this change to 2.1 and 2.0 if we choose to.

I think this change needs some tests before it's truly ready to merge, but it's getting to the end of my work week and I think this code is ready for feedback, so here it is!

To post a comment you must log in.
Revision history for this message
John A Meinel (jameinel) wrote : Posted in a previous version of this proposal
Download full text (5.3 KiB)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
> Andrew Bennetts has proposed merging lp:~spiv/bzr/smarter-index-search into lp:bzr.
>
> Requested reviews:
> bzr-core (bzr-core)
>
>
> Optimise index lookups in repositories with many pack files.
>
> First, the headline: this greatly improves "bzr pull" of one new revision of grub from savannah by HTTP (as reported on the mailing list, which has further analysis):
>
> bzr.dev: 2424kB (50.2kB/s r:2395kB w:30kB)
> this patch: 1034kB (43.3kB/s r:1022kB w:12kB)
>
> Given that the pack data transferred is 701266 bytes (which itself seems quite large for such a small change...), that brings the index-searching overhead from 2.42x to 0.45x of the bytes read. It also halves the wall-clock time :)
>
> That repo has I think 14 packs, and bzr.dev tries 11 indices for each of rix, iix, etc before finding the data it needs for that fetch.
>
> There are two parts to this change:
>
> 1) when a CombinedGraphIndex performs a lookup, it shuffles the index or indices that contained the records to the front of self._indices on the assumption that future lookups should try those first.
> 2) propagates that reordering to the other CombinedGraphIndex objects from the same pack collection. This is done by a) associating a name (the pack name) with the elements of CombinedGraphIndex, and b) linking the revisions/inventories/etc CombinedGraphIndex objects belonging to a single pack collection via setting a _sibling_indices attribute on them, c) using those links and names to apply the same reordering to those sibling indices.
>
> I've been pretty conservative with API changes: the new behaviour is only activated by optional keyword arguments, so existing uses of CombinedGraphIndex should see no change of behaviour (including no improvement). This is to make it as easy as possible to backport this change to 2.1 and 2.0 if we choose to.
>
> I think this change needs some tests before it's truly ready to merge, but it's getting to the end of my work week and I think this code is ready for feedback, so here it is!
>

Thanks for doing this. I did the work in Packer, but it is nice to have
it here.

I would tend to set "auto_reorder" to always on. I don't see it ever
making things worse, and there should only be a small overhead for the
reordering. As such, I would also get rid of the constructor flag.

Doing it by names is ok, I think Index already has _name for most
indices. You could also order by "access_tuple()" which I believe is a
public API and returns (transport, name) to the original .pack file. The
*nice* thing about doing that is that you don't have to introduce any
new apis here. But if it isn't as clean, then don't worry too much about it.

It also avoids having to rebuild the mapping for every move call:
+ indices_info = zip(self._index_names, self._indices)
+ hit_indices_info = []
+ hit_names = []
+ unhit_indices_info = []
+ for name, idx in indices_info:
+ if idx in hit_indices:
+ info = hit_indices_info
+ hit_names.append(name)
+ else:
+ info = unhit_indices_info
+ ...

Read more...

Revision history for this message
John A Meinel (jameinel) wrote : Posted in a previous version of this proposal

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
> Andrew Bennetts has proposed merging lp:~spiv/bzr/smarter-index-search into lp:bzr.
>
> Requested reviews:
> bzr-core (bzr-core)
>
>
> Optimise index lookups in repositories with many pack files.
>
> First, the headline: this greatly improves "bzr pull" of one new revision of grub from savannah by HTTP (as reported on the mailing list, which has further analysis):
>
> bzr.dev: 2424kB (50.2kB/s r:2395kB w:30kB)
> this patch: 1034kB (43.3kB/s r:1022kB w:12kB)
>
> Given that the pack data transferred is 701266 bytes (which itself seems quite large for such a small change...), that brings the index-searching overhead from 2.42x to 0.45x of the bytes read. It also halves the wall-clock time :)
>
> That repo has I think 14 packs, and bzr.dev tries 11 indices for each of rix, iix, etc before finding the data it needs for that fetch.
>
> There are two parts to this change:
>
> 1) when a CombinedGraphIndex performs a lookup, it shuffles the index or indices that contained the records to the front of self._indices on the assumption that future lookups should try those first.
> 2) propagates that reordering to the other CombinedGraphIndex objects from the same pack collection. This is done by a) associating a name (the pack name) with the elements of CombinedGraphIndex, and b) linking the revisions/inventories/etc CombinedGraphIndex objects belonging to a single pack collection via setting a _sibling_indices attribute on them, c) using those links and names to apply the same reordering to those sibling indices.
>
> I've been pretty conservative with API changes: the new behaviour is only activated by optional keyword arguments, so existing uses of CombinedGraphIndex should see no change of behaviour (including no improvement). This is to make it as easy as possible to backport this change to 2.1 and 2.0 if we choose to.
>
> I think this change needs some tests before it's truly ready to merge, but it's getting to the end of my work week and I think this code is ready for feedback, so here it is!
>

I forgot to vote.

  review: needs_fixing

Because it certainly should have some amount of testing. If only to
prevent future regressions (like we already have :). Though you don't
have to test it at the "get_stream()" level.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkubrdwACgkQJdeBCYSNAAPiOgCcDS74mDZFRHc6TOjROAUcFsjj
0ykAoM7AAlpwfv/uNire7vqtLaBtkohG
=90G5
-----END PGP SIGNATURE-----

review: Needs Fixing
Revision history for this message
Andrew Bennetts (spiv) wrote : Posted in a previous version of this proposal
Download full text (3.6 KiB)

John A Meinel wrote:
[...]
> Thanks for doing this. I did the work in Packer, but it is nice to have
> it here.

Ah! I thought we had something like this. I'll see what code and ideas
I can reuse.

> I would tend to set "auto_reorder" to always on. I don't see it ever
> making things worse, and there should only be a small overhead for the
> reordering. As such, I would also get rid of the constructor flag.

Hmm, ok. If I was just targetting bzr.dev I would have done this
already, but if you're confident it won't cause unexpected issues I'm
happy to make it unconditional (including in backports).

> Doing it by names is ok, I think Index already has _name for most
> indices. You could also order by "access_tuple()" which I believe is a
> public API and returns (transport, name) to the original .pack file. The
> *nice* thing about doing that is that you don't have to introduce any
> new apis here. But if it isn't as clean, then don't worry too much about it.

_name is really a path for _transport, and as such includes the suffix.
So I don't think it's as clean either code-wise or conceptually to rely
on that. And as you noticed later access_tuple is on AggregateIndex, so
it's not readily available either.

> If it isn't too much overhead, you might consider:
>
> keys_remove = keys.remove
> # Be careful, if keys starts empty, we would break, and get a name error
> index_hit_count = 0
> for index in self._indices:
> if not keys:
> break
> index_hit_count = 0
> for node in index.iter_entries(keys):
> keys_remove(node[1])
> yield node
> index_hit_count += 1
>
> if index_hit_count:
> hit_indices.append((index_hit_count, index))
>
> hit_indices.sort(reverse=True)
>
> self._move_index_order(hit_indices)
>
>
> The nice thing about it, is that it is adaptive. In that if you start a
> search, and hit it in index 1, then you keep looking there, but if the
> keys start showing up in index 2, then you'll switch. The only other bit
> of logic that I could think of is:

I don't quite follow you here. Wouldn't my logic switch to preferring
index 2 anyway?

It's not clear to me that ordering the hit_indices by hit count is
better.

> if index_hit_count == index.key_count():
> index_hit_count = 0 (or maybe 1)
>
> I realize to do this correctly, you'd need to track hit count between
> calls, which isn't realistic. However, that basic check will work for
> pack files that only have 1 or 2 keys that are found quickly. (And
> commit creates indexes with only 1 entry, so it is fairly common.) Just
> a thought, though.

This is an interesting idea. Although perhaps those look ups will
already fail very quickly because at that point that entire index has
already been read and cached? I don't think I'll do this in this patch
as I think it'd need more investigation and measurement to justify, but
it might make a good incremental improvement later.

> + # Tell all the CombinedGraphIndex objects about each other, so
> they can
> + # share hints about which pack names to search first.
> + all_combined = [agg_idx.combined_index for agg_idx in all_indices]
> + for combined_idx in all_combined:
> + ...

Read more...

Revision history for this message
Andrew Bennetts (spiv) wrote : Posted in a previous version of this proposal

Andrew Bennetts wrote:
> John A Meinel wrote:
[...]
> > The nice thing about it, is that it is adaptive. In that if you start a
> > search, and hit it in index 1, then you keep looking there, but if the
> > keys start showing up in index 2, then you'll switch. The only other bit
> > of logic that I could think of is:
>
> I don't quite follow you here. Wouldn't my logic switch to preferring
> index 2 anyway?

On reflection, I *think* what you mean is if a call to iter_entries (via
get_parent_map or otherwise) queries for [key_from_1, key_from_2,
key_from_2], my logic will keep the index order as [1, 2], but it might
be better to switch them to [2, 1]. (I initially thought that you were
talking about a search conducted over multiple calls to iter_entries,
not just one.)

That could be true, but I'm not sure it would really make a big
difference in practice. How about writing a patch for my patch and
doing some measurements? I'll happily review it :)

Or do you already have some analysis that would support this from when
you changed the Packer code?

-Andrew.

Revision history for this message
John A Meinel (jameinel) wrote : Posted in a previous version of this proposal
Download full text (3.5 KiB)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
> John A Meinel wrote:
> [...]
>> Thanks for doing this. I did the work in Packer, but it is nice to have
>> it here.
>
> Ah! I thought we had something like this. I'll see what code and ideas
> I can reuse.
>
>> I would tend to set "auto_reorder" to always on. I don't see it ever
>> making things worse, and there should only be a small overhead for the
>> reordering. As such, I would also get rid of the constructor flag.
>
> Hmm, ok. If I was just targetting bzr.dev I would have done this
> already, but if you're confident it won't cause unexpected issues I'm
> happy to make it unconditional (including in backports).
>
>> Doing it by names is ok, I think Index already has _name for most
>> indices. You could also order by "access_tuple()" which I believe is a
>> public API and returns (transport, name) to the original .pack file. The
>> *nice* thing about doing that is that you don't have to introduce any
>> new apis here. But if it isn't as clean, then don't worry too much about it.
>
> _name is really a path for _transport, and as such includes the suffix.
> So I don't think it's as clean either code-wise or conceptually to rely
> on that. And as you noticed later access_tuple is on AggregateIndex, so
> it's not readily available either.
>
>> If it isn't too much overhead, you might consider:
>>
>> keys_remove = keys.remove
>> # Be careful, if keys starts empty, we would break, and get a name error
>> index_hit_count = 0
>> for index in self._indices:
>> if not keys:
>> break
>> index_hit_count = 0
>> for node in index.iter_entries(keys):
>> keys_remove(node[1])
>> yield node
>> index_hit_count += 1
>>
>> if index_hit_count:
>> hit_indices.append((index_hit_count, index))
>>
>> hit_indices.sort(reverse=True)
>>
>> self._move_index_order(hit_indices)
>>
>>
>> The nice thing about it, is that it is adaptive. In that if you start a
>> search, and hit it in index 1, then you keep looking there, but if the
>> keys start showing up in index 2, then you'll switch. The only other bit
>> of logic that I could think of is:
>
> I don't quite follow you here. Wouldn't my logic switch to preferring
> index 2 anyway?
>
> It's not clear to me that ordering the hit_indices by hit count is
> better.
>

The point is that if the first index in the list gets 1 key hit, but the
second gets 10, and the third gets 5, then it will get sorted into
(index2, index3, index1) rather than (index1, index2, index3). At least,
what I saw with your code was that it is just 'preserve existing order,
but move the ones hit to the front'. Put another way, say the hits are:

5 index1
10 index2
0 index3
4 index5
0 index6

your version would sort to

index1
index2
index5
index3
index6

mine would sort to

index2
index1
index5
index3
index6

I think the latter is slightly preferred. I don't have any numbers to
back that up, just a feeling. (Most likely in this case index2 is a
'bigger' index, and thus has more to *be* hit.)

As for small indices getting rejected quickly, it is true they won't
encounter any I/O overhead, which may be sufficient. It is a function
call, but th...

Read more...

Revision history for this message
Andrew Bennetts (spiv) wrote : Posted in a previous version of this proposal

John A Meinel wrote:
[...]
> I think the latter is slightly preferred. I don't have any numbers to
> back that up, just a feeling. (Most likely in this case index2 is a
> 'bigger' index, and thus has more to *be* hit.)

I agree it probably helps, a little, but I'm not going to do it without
some measurements to confirm it does, because of the risk of
counter-intuitive results to untested optimisations. I know you've done
more exploring in this area so I trust your intuititions a bit more than
mine, but I'm still reluctant as the bang for buck seems low.

Also, even my simple logic will tend to keep the more hit indices closer
to the front than others after more and more queries, just not quite as
quickly as it might with hit counts. And this way we don't need to
worry about whether we should be keeping and aging hit counts between
queries, etc.

> As for small indices getting rejected quickly, it is true they won't
> encounter any I/O overhead, which may be sufficient. It is a function
> call, but that is probably tiny overhead, and they'll get sorted late
> quickly anyway.

Right. (FWIW, it's I/O overhead I'm most worried about.)

Revision history for this message
Andrew Bennetts (spiv) wrote :

I think all review comments have been addressed, but I'm sure John will tell me if I've missed something :)

Revision history for this message
Martin Packman (gz) wrote :

Don't know the code well enough to do a thorough review, but looks sane to me, tests pass, and it'll save me bandwidth.

review: Approve
Revision history for this message
Robert Collins (lifeless) wrote :

Andrew, when you and John are happy, I think you should just land this.

On the 'always reorder' side; if you do do that please check that the end of transaction stuff that pops off the in-progress indices won't be affected, back in the dim past it used [0] to 'know' that the first index was the one being written.

review: Approve
Revision history for this message
Andrew Bennetts (spiv) wrote :

Robert: Thanks for suggesting that. I've re-read a bunch of code and grepped, and as far as I can see, nothing is cheating by assuming [0] is the in-progress index that is being written. The test suite seems to pass too :)

So no changes needed there. I'm going to land this now before it goes stale.

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
> Andrew Bennetts has proposed merging lp:~spiv/bzr/smarter-index-search into lp:bzr.
>
> Requested reviews:
> John A Meinel (jameinel)
>
>
> Optimise index lookups in repositories with many pack files.
>
> First, the headline: this greatly improves "bzr pull" of one new revision of grub from savannah by HTTP (as reported on the mailing list, which has further analysis):

This doesn't seem to be as cheap as I expected it to be. I'm not 100%
sure who is at fault here, but I'm doing some profiling of loggerhead +
bzr-history-db, and I came across this:

       25765 3.4509 1.8229 bzrlib.index:1425(_move_to_front_by_index)
     +489715 0.8904 0.8904 +bzrlib.btree_index:689(__eq__)
     +541075 0.6721 0.6721 +<method 'append' of 'list' objects>
      +25765 0.0654 0.0654 +<zip>
      489809 0.8906 0.8906 bzrlib.btree_index:689(__eq__)
      590334 0.7374 0.7374 <method 'append' of 'list' objects>
         153 0.2031 0.2031 <method 'write' of 'file' objects>
       20612 3.0278 0.1986 bzrlib.index:1451(_move_to_front_by_name)
      +20612 2.7483 1.4505 +bzrlib.index:1425(_move_to_front_by_index)

I'm guessing the issue is something walking over the ancestry (like bzr log)

5153 3.8017 0.0712 bzrlib.index:1410(_move_to_front)

Note that only 5k calls to _move_to_front gets increased to 25k calls to
_move_to_front_by_index. (probably because there are
rix/iix/tix/cix/six, so a 5:1 factor).

However, we must be missing something for move to be that expensive.

I'm guessing this is:
https://bugs.launchpad.net/bugs/562429

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkvFOf4ACgkQJdeBCYSNAAN1DACggkq3WmxiFrLuE6Nqi/a7Jr+7
k6gAniipAYSMpW98NK22pdksK588g+1a
=RE1+
-----END PGP SIGNATURE-----

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'NEWS'
--- NEWS 2010-04-08 04:34:03 +0000
+++ NEWS 2010-04-08 07:11:22 +0000
@@ -38,6 +38,12 @@
38 generated by a template and not edited by the user.38 generated by a template and not edited by the user.
39 (Robert Collins, #530265)39 (Robert Collins, #530265)
4040
41* Index lookups in pack repositories search recently hit pack files first.
42 In repositories with many pack files this can greatly reduce the
43 number of files accessed, the number of bytes read, and the number of
44 read calls. An incremental pull via plain HTTP takes half the time and
45 bytes for a moderately large repository. (Andrew Bennetts)
46
41* Less code is loaded at startup. (Cold-cache start time is about 10-20%47* Less code is loaded at startup. (Cold-cache start time is about 10-20%
42 less.)48 less.)
43 (Martin Pool, #553017)49 (Martin Pool, #553017)
4450
=== modified file 'bzrlib/index.py'
--- bzrlib/index.py 2010-03-05 17:56:55 +0000
+++ bzrlib/index.py 2010-04-08 07:11:22 +0000
@@ -1245,10 +1245,15 @@
1245 static data.1245 static data.
12461246
1247 Queries against the combined index will be made against the first index,1247 Queries against the combined index will be made against the first index,
1248 and then the second and so on. The order of index's can thus influence1248 and then the second and so on. The order of indices can thus influence
1249 performance significantly. For example, if one index is on local disk and a1249 performance significantly. For example, if one index is on local disk and a
1250 second on a remote server, the local disk index should be before the other1250 second on a remote server, the local disk index should be before the other
1251 in the index list.1251 in the index list.
1252
1253 Also, queries tend to need results from the same indices as previous
1254 queries. So the indices will be reordered after every query to put the
1255 indices that had the result(s) of that query first (while otherwise
1256 preserving the relative ordering).
1252 """1257 """
12531258
1254 def __init__(self, indices, reload_func=None):1259 def __init__(self, indices, reload_func=None):
@@ -1261,6 +1266,13 @@
1261 """1266 """
1262 self._indices = indices1267 self._indices = indices
1263 self._reload_func = reload_func1268 self._reload_func = reload_func
1269 # Sibling indices are other CombinedGraphIndex that we should call
1270 # _move_to_front_by_name on when we auto-reorder ourself.
1271 self._sibling_indices = []
1272 # A list of names that corresponds to the instances in self._indices,
1273 # so _index_names[0] is always the name for _indices[0], etc. Sibling
1274 # indices must all use the same set of names as each other.
1275 self._index_names = [None] * len(self._indices)
12641276
1265 def __repr__(self):1277 def __repr__(self):
1266 return "%s(%s)" % (1278 return "%s(%s)" % (
@@ -1289,13 +1301,17 @@
12891301
1290 has_key = _has_key_from_parent_map1302 has_key = _has_key_from_parent_map
12911303
1292 def insert_index(self, pos, index):1304 def insert_index(self, pos, index, name=None):
1293 """Insert a new index in the list of indices to query.1305 """Insert a new index in the list of indices to query.
12941306
1295 :param pos: The position to insert the index.1307 :param pos: The position to insert the index.
1296 :param index: The index to insert.1308 :param index: The index to insert.
1309 :param name: a name for this index, e.g. a pack name. These names can
1310 be used to reflect index reorderings to related CombinedGraphIndex
1311 instances that use the same names. (see set_sibling_indices)
1297 """1312 """
1298 self._indices.insert(pos, index)1313 self._indices.insert(pos, index)
1314 self._index_names.insert(pos, name)
12991315
1300 def iter_all_entries(self):1316 def iter_all_entries(self):
1301 """Iterate over all keys within the index1317 """Iterate over all keys within the index
@@ -1326,22 +1342,28 @@
1326 value and are only reported once.1342 value and are only reported once.
13271343
1328 :param keys: An iterable providing the keys to be retrieved.1344 :param keys: An iterable providing the keys to be retrieved.
1329 :return: An iterable of (index, key, reference_lists, value). There is no1345 :return: An iterable of (index, key, reference_lists, value). There is
1330 defined order for the result iteration - it will be in the most1346 no defined order for the result iteration - it will be in the most
1331 efficient order for the index.1347 efficient order for the index.
1332 """1348 """
1333 keys = set(keys)1349 keys = set(keys)
1350 hit_indices = []
1334 while True:1351 while True:
1335 try:1352 try:
1336 for index in self._indices:1353 for index in self._indices:
1337 if not keys:1354 if not keys:
1338 return1355 break
1356 index_hit = False
1339 for node in index.iter_entries(keys):1357 for node in index.iter_entries(keys):
1340 keys.remove(node[1])1358 keys.remove(node[1])
1341 yield node1359 yield node
1342 return1360 index_hit = True
1361 if index_hit:
1362 hit_indices.append(index)
1363 break
1343 except errors.NoSuchFile:1364 except errors.NoSuchFile:
1344 self._reload_or_raise()1365 self._reload_or_raise()
1366 self._move_to_front(hit_indices)
13451367
1346 def iter_entries_prefix(self, keys):1368 def iter_entries_prefix(self, keys):
1347 """Iterate over keys within the index using prefix matching.1369 """Iterate over keys within the index using prefix matching.
@@ -1367,17 +1389,77 @@
1367 if not keys:1389 if not keys:
1368 return1390 return
1369 seen_keys = set()1391 seen_keys = set()
1392 hit_indices = []
1370 while True:1393 while True:
1371 try:1394 try:
1372 for index in self._indices:1395 for index in self._indices:
1396 index_hit = False
1373 for node in index.iter_entries_prefix(keys):1397 for node in index.iter_entries_prefix(keys):
1374 if node[1] in seen_keys:1398 if node[1] in seen_keys:
1375 continue1399 continue
1376 seen_keys.add(node[1])1400 seen_keys.add(node[1])
1377 yield node1401 yield node
1378 return1402 index_hit = True
1403 if index_hit:
1404 hit_indices.append(index)
1405 break
1379 except errors.NoSuchFile:1406 except errors.NoSuchFile:
1380 self._reload_or_raise()1407 self._reload_or_raise()
1408 self._move_to_front(hit_indices)
1409
1410 def _move_to_front(self, hit_indices):
1411 """Rearrange self._indices so that hit_indices are first.
1412
1413 Order is maintained as much as possible, e.g. the first unhit index
1414 will be the first index in _indices after the hit_indices, and the
1415 hit_indices will be present in exactly the order they are passed to
1416 _move_to_front.
1417
1418 _move_to_front propagates to all objects in self._sibling_indices by
1419 calling _move_to_front_by_name.
1420 """
1421 hit_names = self._move_to_front_by_index(hit_indices)
1422 for sibling_idx in self._sibling_indices:
1423 sibling_idx._move_to_front_by_name(hit_names)
1424
1425 def _move_to_front_by_index(self, hit_indices):
1426 """Core logic for _move_to_front.
1427
1428 Returns a list of names corresponding to the hit_indices param.
1429 """
1430 indices_info = zip(self._index_names, self._indices)
1431 if 'index' in debug.debug_flags:
1432 mutter('CombinedGraphIndex reordering: currently %r, promoting %r',
1433 indices_info, hit_indices)
1434 hit_indices_info = []
1435 hit_names = []
1436 unhit_indices_info = []
1437 for name, idx in indices_info:
1438 if idx in hit_indices:
1439 info = hit_indices_info
1440 hit_names.append(name)
1441 else:
1442 info = unhit_indices_info
1443 info.append((name, idx))
1444 final_info = hit_indices_info + unhit_indices_info
1445 self._indices = [idx for (name, idx) in final_info]
1446 self._index_names = [name for (name, idx) in final_info]
1447 if 'index' in debug.debug_flags:
1448 mutter('CombinedGraphIndex reordered: %r', self._indices)
1449 return hit_names
1450
1451 def _move_to_front_by_name(self, hit_names):
1452 """Moves indices named by 'hit_names' to front of the search order, as
1453 described in _move_to_front.
1454 """
1455 # Translate names to index instances, and then call
1456 # _move_to_front_by_index.
1457 indices_info = zip(self._index_names, self._indices)
1458 hit_indices = []
1459 for name, idx in indices_info:
1460 if name in hit_names:
1461 hit_indices.append(idx)
1462 self._move_to_front_by_index(hit_indices)
13811463
1382 def find_ancestry(self, keys, ref_list_num):1464 def find_ancestry(self, keys, ref_list_num):
1383 """Find the complete ancestry for the given set of keys.1465 """Find the complete ancestry for the given set of keys.
@@ -1390,6 +1472,7 @@
1390 we care about.1472 we care about.
1391 :return: (parent_map, missing_keys)1473 :return: (parent_map, missing_keys)
1392 """1474 """
1475 # XXX: make this call _move_to_front?
1393 missing_keys = set()1476 missing_keys = set()
1394 parent_map = {}1477 parent_map = {}
1395 keys_to_lookup = set(keys)1478 keys_to_lookup = set(keys)
@@ -1475,6 +1558,11 @@
1475 ' Raising original exception.')1558 ' Raising original exception.')
1476 raise exc_type, exc_value, exc_traceback1559 raise exc_type, exc_value, exc_traceback
14771560
1561 def set_sibling_indices(self, sibling_combined_graph_indices):
1562 """Set the CombinedGraphIndex objects to reorder after reordering self.
1563 """
1564 self._sibling_indices = sibling_combined_graph_indices
1565
1478 def validate(self):1566 def validate(self):
1479 """Validate that everything in the index can be accessed."""1567 """Validate that everything in the index can be accessed."""
1480 while True:1568 while True:
14811569
=== modified file 'bzrlib/repofmt/pack_repo.py'
--- bzrlib/repofmt/pack_repo.py 2010-02-12 11:58:21 +0000
+++ bzrlib/repofmt/pack_repo.py 2010-04-08 07:11:22 +0000
@@ -587,26 +587,6 @@
587 flush_func=flush_func)587 flush_func=flush_func)
588 self.add_callback = None588 self.add_callback = None
589589
590 def replace_indices(self, index_to_pack, indices):
591 """Replace the current mappings with fresh ones.
592
593 This should probably not be used eventually, rather incremental add and
594 removal of indices. It has been added during refactoring of existing
595 code.
596
597 :param index_to_pack: A mapping from index objects to
598 (transport, name) tuples for the pack file data.
599 :param indices: A list of indices.
600 """
601 # refresh the revision pack map dict without replacing the instance.
602 self.index_to_pack.clear()
603 self.index_to_pack.update(index_to_pack)
604 # XXX: API break - clearly a 'replace' method would be good?
605 self.combined_index._indices[:] = indices
606 # the current add nodes callback for the current writable index if
607 # there is one.
608 self.add_callback = None
609
610 def add_index(self, index, pack):590 def add_index(self, index, pack):
611 """Add index to the aggregate, which is an index for Pack pack.591 """Add index to the aggregate, which is an index for Pack pack.
612592
@@ -619,7 +599,7 @@
619 # expose it to the index map599 # expose it to the index map
620 self.index_to_pack[index] = pack.access_tuple()600 self.index_to_pack[index] = pack.access_tuple()
621 # put it at the front of the linear index list601 # put it at the front of the linear index list
622 self.combined_index.insert_index(0, index)602 self.combined_index.insert_index(0, index, pack.name)
623603
624 def add_writable_index(self, index, pack):604 def add_writable_index(self, index, pack):
625 """Add an index which is able to have data added to it.605 """Add an index which is able to have data added to it.
@@ -645,6 +625,7 @@
645 self.data_access.set_writer(None, None, (None, None))625 self.data_access.set_writer(None, None, (None, None))
646 self.index_to_pack.clear()626 self.index_to_pack.clear()
647 del self.combined_index._indices[:]627 del self.combined_index._indices[:]
628 del self.combined_index._index_names[:]
648 self.add_callback = None629 self.add_callback = None
649630
650 def remove_index(self, index):631 def remove_index(self, index):
@@ -653,7 +634,9 @@
653 :param index: An index from the pack parameter.634 :param index: An index from the pack parameter.
654 """635 """
655 del self.index_to_pack[index]636 del self.index_to_pack[index]
656 self.combined_index._indices.remove(index)637 pos = self.combined_index._indices.index(index)
638 del self.combined_index._indices[pos]
639 del self.combined_index._index_names[pos]
657 if (self.add_callback is not None and640 if (self.add_callback is not None and
658 getattr(index, 'add_nodes', None) == self.add_callback):641 getattr(index, 'add_nodes', None) == self.add_callback):
659 self.add_callback = None642 self.add_callback = None
@@ -1415,11 +1398,20 @@
1415 self.inventory_index = AggregateIndex(self.reload_pack_names, flush)1398 self.inventory_index = AggregateIndex(self.reload_pack_names, flush)
1416 self.text_index = AggregateIndex(self.reload_pack_names, flush)1399 self.text_index = AggregateIndex(self.reload_pack_names, flush)
1417 self.signature_index = AggregateIndex(self.reload_pack_names, flush)1400 self.signature_index = AggregateIndex(self.reload_pack_names, flush)
1401 all_indices = [self.revision_index, self.inventory_index,
1402 self.text_index, self.signature_index]
1418 if use_chk_index:1403 if use_chk_index:
1419 self.chk_index = AggregateIndex(self.reload_pack_names, flush)1404 self.chk_index = AggregateIndex(self.reload_pack_names, flush)
1405 all_indices.append(self.chk_index)
1420 else:1406 else:
1421 # used to determine if we're using a chk_index elsewhere.1407 # used to determine if we're using a chk_index elsewhere.
1422 self.chk_index = None1408 self.chk_index = None
1409 # Tell all the CombinedGraphIndex objects about each other, so they can
1410 # share hints about which pack names to search first.
1411 all_combined = [agg_idx.combined_index for agg_idx in all_indices]
1412 for combined_idx in all_combined:
1413 combined_idx.set_sibling_indices(
1414 set(all_combined).difference([combined_idx]))
1423 # resumed packs1415 # resumed packs
1424 self._resumed_packs = []1416 self._resumed_packs = []
14251417
14261418
=== modified file 'bzrlib/tests/per_pack_repository.py'
--- bzrlib/tests/per_pack_repository.py 2010-02-23 07:43:11 +0000
+++ bzrlib/tests/per_pack_repository.py 2010-04-08 07:11:22 +0000
@@ -288,6 +288,23 @@
288 repo._pack_collection._clear_obsolete_packs()288 repo._pack_collection._clear_obsolete_packs()
289 self.assertTrue(repo_transport.has('obsolete_packs/.nfsblahblah'))289 self.assertTrue(repo_transport.has('obsolete_packs/.nfsblahblah'))
290290
291 def test_pack_collection_sets_sibling_indices(self):
292 """The CombinedGraphIndex objects in the pack collection are all
293 siblings of each other, so that search-order reorderings will be copied
294 to each other.
295 """
296 repo = self.make_repository('repo')
297 pack_coll = repo._pack_collection
298 indices = set([pack_coll.revision_index, pack_coll.inventory_index,
299 pack_coll.text_index, pack_coll.signature_index])
300 if pack_coll.chk_index is not None:
301 indices.add(pack_coll.chk_index)
302 combined_indices = set(idx.combined_index for idx in indices)
303 for combined_index in combined_indices:
304 self.assertEqual(
305 combined_indices.difference([combined_index]),
306 combined_index._sibling_indices)
307
291 def test_pack_after_two_commits_packs_everything(self):308 def test_pack_after_two_commits_packs_everything(self):
292 format = self.get_format()309 format = self.get_format()
293 tree = self.make_branch_and_tree('.', format=format)310 tree = self.make_branch_and_tree('.', format=format)
294311
=== modified file 'bzrlib/tests/test_index.py'
--- bzrlib/tests/test_index.py 2010-03-05 17:56:55 +0000
+++ bzrlib/tests/test_index.py 2010-04-08 07:11:22 +0000
@@ -1380,6 +1380,50 @@
1380 self.assertListRaises(errors.NoSuchFile, index.iter_entries_prefix,1380 self.assertListRaises(errors.NoSuchFile, index.iter_entries_prefix,
1381 [('1',)])1381 [('1',)])
13821382
1383
1384 def make_index_with_simple_nodes(self, name, num_nodes=1):
1385 """Make an index named after 'name', with keys named after 'name' too.
1386
1387 Nodes will have a value of '' and no references.
1388 """
1389 nodes = [
1390 (('index-%s-key-%s' % (name, n),), '', ())
1391 for n in range(1, num_nodes+1)]
1392 return self.make_index('index-%s' % name, 0, nodes=nodes)
1393
1394 def test_reorder_after_iter_entries(self):
1395 # Four indices: [key1] in index1, [key2,key3] in index2, [] in index3,
1396 # [key4] in index4.
1397 index = CombinedGraphIndex([])
1398 index.insert_index(0, self.make_index_with_simple_nodes('1'), '1')
1399 index.insert_index(1, self.make_index_with_simple_nodes('2'), '2')
1400 index.insert_index(2, self.make_index_with_simple_nodes('3'), '3')
1401 index.insert_index(3, self.make_index_with_simple_nodes('4'), '4')
1402 index1, index2, index3, index4 = index._indices
1403 # Query a key from index4 and index2.
1404 self.assertLength(2, list(index.iter_entries(
1405 [('index-4-key-1',), ('index-2-key-1',)])))
1406 # Now index2 and index4 should be moved to the front (and index1 should
1407 # still be before index3).
1408 self.assertEqual([index2, index4, index1, index3], index._indices)
1409 self.assertEqual(['2', '4', '1', '3'], index._index_names)
1410
1411 def test_reorder_propagates_to_siblings(self):
1412 # Two CombinedGraphIndex objects, with the same number of indicies with
1413 # matching names.
1414 cgi1 = CombinedGraphIndex([])
1415 cgi2 = CombinedGraphIndex([])
1416 cgi1.insert_index(0, self.make_index_with_simple_nodes('1-1'), 'one')
1417 cgi1.insert_index(1, self.make_index_with_simple_nodes('1-2'), 'two')
1418 cgi2.insert_index(0, self.make_index_with_simple_nodes('2-1'), 'one')
1419 cgi2.insert_index(1, self.make_index_with_simple_nodes('2-2'), 'two')
1420 index2_1, index2_2 = cgi2._indices
1421 cgi1.set_sibling_indices([cgi2])
1422 # Trigger a reordering in cgi1. cgi2 will be reordered as well.
1423 list(cgi1.iter_entries([('index-1-2-key-1',)]))
1424 self.assertEqual([index2_2, index2_1], cgi2._indices)
1425 self.assertEqual(['two', 'one'], cgi2._index_names)
1426
1383 def test_validate_reloads(self):1427 def test_validate_reloads(self):
1384 index, reload_counter = self.make_combined_index_with_missing()1428 index, reload_counter = self.make_combined_index_with_missing()
1385 index.validate()1429 index.validate()