Merge lp:~spiv/bzr/smarter-index-search into lp:bzr

Proposed by Andrew Bennetts
Status: Merged
Approved by: Andrew Bennetts
Approved revision: no longer in the source branch.
Merged at revision: not available
Proposed branch: lp:~spiv/bzr/smarter-index-search
Merge into: lp:bzr
Diff against target: 365 lines (+176/-29)
5 files modified
NEWS (+6/-0)
bzrlib/index.py (+95/-7)
bzrlib/repofmt/pack_repo.py (+14/-22)
bzrlib/tests/per_pack_repository.py (+17/-0)
bzrlib/tests/test_index.py (+44/-0)
To merge this branch: bzr merge lp:~spiv/bzr/smarter-index-search
Reviewer Review Type Date Requested Status
Robert Collins (community) Approve
Martin Packman (community) Approve
John A Meinel Pending
Review via email: mp+21615@code.launchpad.net

This proposal supersedes a proposal from 2010-03-12.

Description of the change

Optimise index lookups in repositories with many pack files.

First, the headline: this greatly improves "bzr pull" of one new revision of grub from savannah by HTTP (as reported on the mailing list, which has further analysis):

bzr.dev: 2424kB (50.2kB/s r:2395kB w:30kB)
this patch: 1034kB (43.3kB/s r:1022kB w:12kB)

Given that the pack data transferred is 701266 bytes (which itself seems quite large for such a small change...), that brings the index-searching overhead from 2.42x to 0.45x of the bytes read. It also halves the wall-clock time :)

That repo has I think 14 packs, and bzr.dev tries 11 indices for each of rix, iix, etc before finding the data it needs for that fetch.

There are two parts to this change:

 1) when a CombinedGraphIndex performs a lookup, it shuffles the index or indices that contained the records to the front of self._indices on the assumption that future lookups should try those first.
 2) propagates that reordering to the other CombinedGraphIndex objects from the same pack collection. This is done by a) associating a name (the pack name) with the elements of CombinedGraphIndex, and b) linking the revisions/inventories/etc CombinedGraphIndex objects belonging to a single pack collection via setting a _sibling_indices attribute on them, c) using those links and names to apply the same reordering to those sibling indices.

I've been pretty conservative with API changes: the new behaviour is only activated by optional keyword arguments, so existing uses of CombinedGraphIndex should see no change of behaviour (including no improvement). This is to make it as easy as possible to backport this change to 2.1 and 2.0 if we choose to.

I think this change needs some tests before it's truly ready to merge, but it's getting to the end of my work week and I think this code is ready for feedback, so here it is!

To post a comment you must log in.
Revision history for this message
John A Meinel (jameinel) wrote : Posted in a previous version of this proposal
Download full text (5.3 KiB)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
> Andrew Bennetts has proposed merging lp:~spiv/bzr/smarter-index-search into lp:bzr.
>
> Requested reviews:
> bzr-core (bzr-core)
>
>
> Optimise index lookups in repositories with many pack files.
>
> First, the headline: this greatly improves "bzr pull" of one new revision of grub from savannah by HTTP (as reported on the mailing list, which has further analysis):
>
> bzr.dev: 2424kB (50.2kB/s r:2395kB w:30kB)
> this patch: 1034kB (43.3kB/s r:1022kB w:12kB)
>
> Given that the pack data transferred is 701266 bytes (which itself seems quite large for such a small change...), that brings the index-searching overhead from 2.42x to 0.45x of the bytes read. It also halves the wall-clock time :)
>
> That repo has I think 14 packs, and bzr.dev tries 11 indices for each of rix, iix, etc before finding the data it needs for that fetch.
>
> There are two parts to this change:
>
> 1) when a CombinedGraphIndex performs a lookup, it shuffles the index or indices that contained the records to the front of self._indices on the assumption that future lookups should try those first.
> 2) propagates that reordering to the other CombinedGraphIndex objects from the same pack collection. This is done by a) associating a name (the pack name) with the elements of CombinedGraphIndex, and b) linking the revisions/inventories/etc CombinedGraphIndex objects belonging to a single pack collection via setting a _sibling_indices attribute on them, c) using those links and names to apply the same reordering to those sibling indices.
>
> I've been pretty conservative with API changes: the new behaviour is only activated by optional keyword arguments, so existing uses of CombinedGraphIndex should see no change of behaviour (including no improvement). This is to make it as easy as possible to backport this change to 2.1 and 2.0 if we choose to.
>
> I think this change needs some tests before it's truly ready to merge, but it's getting to the end of my work week and I think this code is ready for feedback, so here it is!
>

Thanks for doing this. I did the work in Packer, but it is nice to have
it here.

I would tend to set "auto_reorder" to always on. I don't see it ever
making things worse, and there should only be a small overhead for the
reordering. As such, I would also get rid of the constructor flag.

Doing it by names is ok, I think Index already has _name for most
indices. You could also order by "access_tuple()" which I believe is a
public API and returns (transport, name) to the original .pack file. The
*nice* thing about doing that is that you don't have to introduce any
new apis here. But if it isn't as clean, then don't worry too much about it.

It also avoids having to rebuild the mapping for every move call:
+ indices_info = zip(self._index_names, self._indices)
+ hit_indices_info = []
+ hit_names = []
+ unhit_indices_info = []
+ for name, idx in indices_info:
+ if idx in hit_indices:
+ info = hit_indices_info
+ hit_names.append(name)
+ else:
+ info = unhit_indices_info
+ ...

Read more...

Revision history for this message
John A Meinel (jameinel) wrote : Posted in a previous version of this proposal

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
> Andrew Bennetts has proposed merging lp:~spiv/bzr/smarter-index-search into lp:bzr.
>
> Requested reviews:
> bzr-core (bzr-core)
>
>
> Optimise index lookups in repositories with many pack files.
>
> First, the headline: this greatly improves "bzr pull" of one new revision of grub from savannah by HTTP (as reported on the mailing list, which has further analysis):
>
> bzr.dev: 2424kB (50.2kB/s r:2395kB w:30kB)
> this patch: 1034kB (43.3kB/s r:1022kB w:12kB)
>
> Given that the pack data transferred is 701266 bytes (which itself seems quite large for such a small change...), that brings the index-searching overhead from 2.42x to 0.45x of the bytes read. It also halves the wall-clock time :)
>
> That repo has I think 14 packs, and bzr.dev tries 11 indices for each of rix, iix, etc before finding the data it needs for that fetch.
>
> There are two parts to this change:
>
> 1) when a CombinedGraphIndex performs a lookup, it shuffles the index or indices that contained the records to the front of self._indices on the assumption that future lookups should try those first.
> 2) propagates that reordering to the other CombinedGraphIndex objects from the same pack collection. This is done by a) associating a name (the pack name) with the elements of CombinedGraphIndex, and b) linking the revisions/inventories/etc CombinedGraphIndex objects belonging to a single pack collection via setting a _sibling_indices attribute on them, c) using those links and names to apply the same reordering to those sibling indices.
>
> I've been pretty conservative with API changes: the new behaviour is only activated by optional keyword arguments, so existing uses of CombinedGraphIndex should see no change of behaviour (including no improvement). This is to make it as easy as possible to backport this change to 2.1 and 2.0 if we choose to.
>
> I think this change needs some tests before it's truly ready to merge, but it's getting to the end of my work week and I think this code is ready for feedback, so here it is!
>

I forgot to vote.

  review: needs_fixing

Because it certainly should have some amount of testing. If only to
prevent future regressions (like we already have :). Though you don't
have to test it at the "get_stream()" level.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkubrdwACgkQJdeBCYSNAAPiOgCcDS74mDZFRHc6TOjROAUcFsjj
0ykAoM7AAlpwfv/uNire7vqtLaBtkohG
=90G5
-----END PGP SIGNATURE-----

review: Needs Fixing
Revision history for this message
Andrew Bennetts (spiv) wrote : Posted in a previous version of this proposal
Download full text (3.6 KiB)

John A Meinel wrote:
[...]
> Thanks for doing this. I did the work in Packer, but it is nice to have
> it here.

Ah! I thought we had something like this. I'll see what code and ideas
I can reuse.

> I would tend to set "auto_reorder" to always on. I don't see it ever
> making things worse, and there should only be a small overhead for the
> reordering. As such, I would also get rid of the constructor flag.

Hmm, ok. If I was just targetting bzr.dev I would have done this
already, but if you're confident it won't cause unexpected issues I'm
happy to make it unconditional (including in backports).

> Doing it by names is ok, I think Index already has _name for most
> indices. You could also order by "access_tuple()" which I believe is a
> public API and returns (transport, name) to the original .pack file. The
> *nice* thing about doing that is that you don't have to introduce any
> new apis here. But if it isn't as clean, then don't worry too much about it.

_name is really a path for _transport, and as such includes the suffix.
So I don't think it's as clean either code-wise or conceptually to rely
on that. And as you noticed later access_tuple is on AggregateIndex, so
it's not readily available either.

> If it isn't too much overhead, you might consider:
>
> keys_remove = keys.remove
> # Be careful, if keys starts empty, we would break, and get a name error
> index_hit_count = 0
> for index in self._indices:
> if not keys:
> break
> index_hit_count = 0
> for node in index.iter_entries(keys):
> keys_remove(node[1])
> yield node
> index_hit_count += 1
>
> if index_hit_count:
> hit_indices.append((index_hit_count, index))
>
> hit_indices.sort(reverse=True)
>
> self._move_index_order(hit_indices)
>
>
> The nice thing about it, is that it is adaptive. In that if you start a
> search, and hit it in index 1, then you keep looking there, but if the
> keys start showing up in index 2, then you'll switch. The only other bit
> of logic that I could think of is:

I don't quite follow you here. Wouldn't my logic switch to preferring
index 2 anyway?

It's not clear to me that ordering the hit_indices by hit count is
better.

> if index_hit_count == index.key_count():
> index_hit_count = 0 (or maybe 1)
>
> I realize to do this correctly, you'd need to track hit count between
> calls, which isn't realistic. However, that basic check will work for
> pack files that only have 1 or 2 keys that are found quickly. (And
> commit creates indexes with only 1 entry, so it is fairly common.) Just
> a thought, though.

This is an interesting idea. Although perhaps those look ups will
already fail very quickly because at that point that entire index has
already been read and cached? I don't think I'll do this in this patch
as I think it'd need more investigation and measurement to justify, but
it might make a good incremental improvement later.

> + # Tell all the CombinedGraphIndex objects about each other, so
> they can
> + # share hints about which pack names to search first.
> + all_combined = [agg_idx.combined_index for agg_idx in all_indices]
> + for combined_idx in all_combined:
> + ...

Read more...

Revision history for this message
Andrew Bennetts (spiv) wrote : Posted in a previous version of this proposal

Andrew Bennetts wrote:
> John A Meinel wrote:
[...]
> > The nice thing about it, is that it is adaptive. In that if you start a
> > search, and hit it in index 1, then you keep looking there, but if the
> > keys start showing up in index 2, then you'll switch. The only other bit
> > of logic that I could think of is:
>
> I don't quite follow you here. Wouldn't my logic switch to preferring
> index 2 anyway?

On reflection, I *think* what you mean is if a call to iter_entries (via
get_parent_map or otherwise) queries for [key_from_1, key_from_2,
key_from_2], my logic will keep the index order as [1, 2], but it might
be better to switch them to [2, 1]. (I initially thought that you were
talking about a search conducted over multiple calls to iter_entries,
not just one.)

That could be true, but I'm not sure it would really make a big
difference in practice. How about writing a patch for my patch and
doing some measurements? I'll happily review it :)

Or do you already have some analysis that would support this from when
you changed the Packer code?

-Andrew.

Revision history for this message
John A Meinel (jameinel) wrote : Posted in a previous version of this proposal
Download full text (3.5 KiB)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
> John A Meinel wrote:
> [...]
>> Thanks for doing this. I did the work in Packer, but it is nice to have
>> it here.
>
> Ah! I thought we had something like this. I'll see what code and ideas
> I can reuse.
>
>> I would tend to set "auto_reorder" to always on. I don't see it ever
>> making things worse, and there should only be a small overhead for the
>> reordering. As such, I would also get rid of the constructor flag.
>
> Hmm, ok. If I was just targetting bzr.dev I would have done this
> already, but if you're confident it won't cause unexpected issues I'm
> happy to make it unconditional (including in backports).
>
>> Doing it by names is ok, I think Index already has _name for most
>> indices. You could also order by "access_tuple()" which I believe is a
>> public API and returns (transport, name) to the original .pack file. The
>> *nice* thing about doing that is that you don't have to introduce any
>> new apis here. But if it isn't as clean, then don't worry too much about it.
>
> _name is really a path for _transport, and as such includes the suffix.
> So I don't think it's as clean either code-wise or conceptually to rely
> on that. And as you noticed later access_tuple is on AggregateIndex, so
> it's not readily available either.
>
>> If it isn't too much overhead, you might consider:
>>
>> keys_remove = keys.remove
>> # Be careful, if keys starts empty, we would break, and get a name error
>> index_hit_count = 0
>> for index in self._indices:
>> if not keys:
>> break
>> index_hit_count = 0
>> for node in index.iter_entries(keys):
>> keys_remove(node[1])
>> yield node
>> index_hit_count += 1
>>
>> if index_hit_count:
>> hit_indices.append((index_hit_count, index))
>>
>> hit_indices.sort(reverse=True)
>>
>> self._move_index_order(hit_indices)
>>
>>
>> The nice thing about it, is that it is adaptive. In that if you start a
>> search, and hit it in index 1, then you keep looking there, but if the
>> keys start showing up in index 2, then you'll switch. The only other bit
>> of logic that I could think of is:
>
> I don't quite follow you here. Wouldn't my logic switch to preferring
> index 2 anyway?
>
> It's not clear to me that ordering the hit_indices by hit count is
> better.
>

The point is that if the first index in the list gets 1 key hit, but the
second gets 10, and the third gets 5, then it will get sorted into
(index2, index3, index1) rather than (index1, index2, index3). At least,
what I saw with your code was that it is just 'preserve existing order,
but move the ones hit to the front'. Put another way, say the hits are:

5 index1
10 index2
0 index3
4 index5
0 index6

your version would sort to

index1
index2
index5
index3
index6

mine would sort to

index2
index1
index5
index3
index6

I think the latter is slightly preferred. I don't have any numbers to
back that up, just a feeling. (Most likely in this case index2 is a
'bigger' index, and thus has more to *be* hit.)

As for small indices getting rejected quickly, it is true they won't
encounter any I/O overhead, which may be sufficient. It is a function
call, but th...

Read more...

Revision history for this message
Andrew Bennetts (spiv) wrote : Posted in a previous version of this proposal

John A Meinel wrote:
[...]
> I think the latter is slightly preferred. I don't have any numbers to
> back that up, just a feeling. (Most likely in this case index2 is a
> 'bigger' index, and thus has more to *be* hit.)

I agree it probably helps, a little, but I'm not going to do it without
some measurements to confirm it does, because of the risk of
counter-intuitive results to untested optimisations. I know you've done
more exploring in this area so I trust your intuititions a bit more than
mine, but I'm still reluctant as the bang for buck seems low.

Also, even my simple logic will tend to keep the more hit indices closer
to the front than others after more and more queries, just not quite as
quickly as it might with hit counts. And this way we don't need to
worry about whether we should be keeping and aging hit counts between
queries, etc.

> As for small indices getting rejected quickly, it is true they won't
> encounter any I/O overhead, which may be sufficient. It is a function
> call, but that is probably tiny overhead, and they'll get sorted late
> quickly anyway.

Right. (FWIW, it's I/O overhead I'm most worried about.)

Revision history for this message
Andrew Bennetts (spiv) wrote :

I think all review comments have been addressed, but I'm sure John will tell me if I've missed something :)

Revision history for this message
Martin Packman (gz) wrote :

Don't know the code well enough to do a thorough review, but looks sane to me, tests pass, and it'll save me bandwidth.

review: Approve
Revision history for this message
Robert Collins (lifeless) wrote :

Andrew, when you and John are happy, I think you should just land this.

On the 'always reorder' side; if you do do that please check that the end of transaction stuff that pops off the in-progress indices won't be affected, back in the dim past it used [0] to 'know' that the first index was the one being written.

review: Approve
Revision history for this message
Andrew Bennetts (spiv) wrote :

Robert: Thanks for suggesting that. I've re-read a bunch of code and grepped, and as far as I can see, nothing is cheating by assuming [0] is the in-progress index that is being written. The test suite seems to pass too :)

So no changes needed there. I'm going to land this now before it goes stale.

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
> Andrew Bennetts has proposed merging lp:~spiv/bzr/smarter-index-search into lp:bzr.
>
> Requested reviews:
> John A Meinel (jameinel)
>
>
> Optimise index lookups in repositories with many pack files.
>
> First, the headline: this greatly improves "bzr pull" of one new revision of grub from savannah by HTTP (as reported on the mailing list, which has further analysis):

This doesn't seem to be as cheap as I expected it to be. I'm not 100%
sure who is at fault here, but I'm doing some profiling of loggerhead +
bzr-history-db, and I came across this:

       25765 3.4509 1.8229 bzrlib.index:1425(_move_to_front_by_index)
     +489715 0.8904 0.8904 +bzrlib.btree_index:689(__eq__)
     +541075 0.6721 0.6721 +<method 'append' of 'list' objects>
      +25765 0.0654 0.0654 +<zip>
      489809 0.8906 0.8906 bzrlib.btree_index:689(__eq__)
      590334 0.7374 0.7374 <method 'append' of 'list' objects>
         153 0.2031 0.2031 <method 'write' of 'file' objects>
       20612 3.0278 0.1986 bzrlib.index:1451(_move_to_front_by_name)
      +20612 2.7483 1.4505 +bzrlib.index:1425(_move_to_front_by_index)

I'm guessing the issue is something walking over the ancestry (like bzr log)

5153 3.8017 0.0712 bzrlib.index:1410(_move_to_front)

Note that only 5k calls to _move_to_front gets increased to 25k calls to
_move_to_front_by_index. (probably because there are
rix/iix/tix/cix/six, so a 5:1 factor).

However, we must be missing something for move to be that expensive.

I'm guessing this is:
https://bugs.launchpad.net/bugs/562429

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkvFOf4ACgkQJdeBCYSNAAN1DACggkq3WmxiFrLuE6Nqi/a7Jr+7
k6gAniipAYSMpW98NK22pdksK588g+1a
=RE1+
-----END PGP SIGNATURE-----

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'NEWS'
2--- NEWS 2010-04-08 04:34:03 +0000
3+++ NEWS 2010-04-08 07:11:22 +0000
4@@ -38,6 +38,12 @@
5 generated by a template and not edited by the user.
6 (Robert Collins, #530265)
7
8+* Index lookups in pack repositories search recently hit pack files first.
9+ In repositories with many pack files this can greatly reduce the
10+ number of files accessed, the number of bytes read, and the number of
11+ read calls. An incremental pull via plain HTTP takes half the time and
12+ bytes for a moderately large repository. (Andrew Bennetts)
13+
14 * Less code is loaded at startup. (Cold-cache start time is about 10-20%
15 less.)
16 (Martin Pool, #553017)
17
18=== modified file 'bzrlib/index.py'
19--- bzrlib/index.py 2010-03-05 17:56:55 +0000
20+++ bzrlib/index.py 2010-04-08 07:11:22 +0000
21@@ -1245,10 +1245,15 @@
22 static data.
23
24 Queries against the combined index will be made against the first index,
25- and then the second and so on. The order of index's can thus influence
26+ and then the second and so on. The order of indices can thus influence
27 performance significantly. For example, if one index is on local disk and a
28 second on a remote server, the local disk index should be before the other
29 in the index list.
30+
31+ Also, queries tend to need results from the same indices as previous
32+ queries. So the indices will be reordered after every query to put the
33+ indices that had the result(s) of that query first (while otherwise
34+ preserving the relative ordering).
35 """
36
37 def __init__(self, indices, reload_func=None):
38@@ -1261,6 +1266,13 @@
39 """
40 self._indices = indices
41 self._reload_func = reload_func
42+ # Sibling indices are other CombinedGraphIndex that we should call
43+ # _move_to_front_by_name on when we auto-reorder ourself.
44+ self._sibling_indices = []
45+ # A list of names that corresponds to the instances in self._indices,
46+ # so _index_names[0] is always the name for _indices[0], etc. Sibling
47+ # indices must all use the same set of names as each other.
48+ self._index_names = [None] * len(self._indices)
49
50 def __repr__(self):
51 return "%s(%s)" % (
52@@ -1289,13 +1301,17 @@
53
54 has_key = _has_key_from_parent_map
55
56- def insert_index(self, pos, index):
57+ def insert_index(self, pos, index, name=None):
58 """Insert a new index in the list of indices to query.
59
60 :param pos: The position to insert the index.
61 :param index: The index to insert.
62+ :param name: a name for this index, e.g. a pack name. These names can
63+ be used to reflect index reorderings to related CombinedGraphIndex
64+ instances that use the same names. (see set_sibling_indices)
65 """
66 self._indices.insert(pos, index)
67+ self._index_names.insert(pos, name)
68
69 def iter_all_entries(self):
70 """Iterate over all keys within the index
71@@ -1326,22 +1342,28 @@
72 value and are only reported once.
73
74 :param keys: An iterable providing the keys to be retrieved.
75- :return: An iterable of (index, key, reference_lists, value). There is no
76- defined order for the result iteration - it will be in the most
77+ :return: An iterable of (index, key, reference_lists, value). There is
78+ no defined order for the result iteration - it will be in the most
79 efficient order for the index.
80 """
81 keys = set(keys)
82+ hit_indices = []
83 while True:
84 try:
85 for index in self._indices:
86 if not keys:
87- return
88+ break
89+ index_hit = False
90 for node in index.iter_entries(keys):
91 keys.remove(node[1])
92 yield node
93- return
94+ index_hit = True
95+ if index_hit:
96+ hit_indices.append(index)
97+ break
98 except errors.NoSuchFile:
99 self._reload_or_raise()
100+ self._move_to_front(hit_indices)
101
102 def iter_entries_prefix(self, keys):
103 """Iterate over keys within the index using prefix matching.
104@@ -1367,17 +1389,77 @@
105 if not keys:
106 return
107 seen_keys = set()
108+ hit_indices = []
109 while True:
110 try:
111 for index in self._indices:
112+ index_hit = False
113 for node in index.iter_entries_prefix(keys):
114 if node[1] in seen_keys:
115 continue
116 seen_keys.add(node[1])
117 yield node
118- return
119+ index_hit = True
120+ if index_hit:
121+ hit_indices.append(index)
122+ break
123 except errors.NoSuchFile:
124 self._reload_or_raise()
125+ self._move_to_front(hit_indices)
126+
127+ def _move_to_front(self, hit_indices):
128+ """Rearrange self._indices so that hit_indices are first.
129+
130+ Order is maintained as much as possible, e.g. the first unhit index
131+ will be the first index in _indices after the hit_indices, and the
132+ hit_indices will be present in exactly the order they are passed to
133+ _move_to_front.
134+
135+ _move_to_front propagates to all objects in self._sibling_indices by
136+ calling _move_to_front_by_name.
137+ """
138+ hit_names = self._move_to_front_by_index(hit_indices)
139+ for sibling_idx in self._sibling_indices:
140+ sibling_idx._move_to_front_by_name(hit_names)
141+
142+ def _move_to_front_by_index(self, hit_indices):
143+ """Core logic for _move_to_front.
144+
145+ Returns a list of names corresponding to the hit_indices param.
146+ """
147+ indices_info = zip(self._index_names, self._indices)
148+ if 'index' in debug.debug_flags:
149+ mutter('CombinedGraphIndex reordering: currently %r, promoting %r',
150+ indices_info, hit_indices)
151+ hit_indices_info = []
152+ hit_names = []
153+ unhit_indices_info = []
154+ for name, idx in indices_info:
155+ if idx in hit_indices:
156+ info = hit_indices_info
157+ hit_names.append(name)
158+ else:
159+ info = unhit_indices_info
160+ info.append((name, idx))
161+ final_info = hit_indices_info + unhit_indices_info
162+ self._indices = [idx for (name, idx) in final_info]
163+ self._index_names = [name for (name, idx) in final_info]
164+ if 'index' in debug.debug_flags:
165+ mutter('CombinedGraphIndex reordered: %r', self._indices)
166+ return hit_names
167+
168+ def _move_to_front_by_name(self, hit_names):
169+ """Moves indices named by 'hit_names' to front of the search order, as
170+ described in _move_to_front.
171+ """
172+ # Translate names to index instances, and then call
173+ # _move_to_front_by_index.
174+ indices_info = zip(self._index_names, self._indices)
175+ hit_indices = []
176+ for name, idx in indices_info:
177+ if name in hit_names:
178+ hit_indices.append(idx)
179+ self._move_to_front_by_index(hit_indices)
180
181 def find_ancestry(self, keys, ref_list_num):
182 """Find the complete ancestry for the given set of keys.
183@@ -1390,6 +1472,7 @@
184 we care about.
185 :return: (parent_map, missing_keys)
186 """
187+ # XXX: make this call _move_to_front?
188 missing_keys = set()
189 parent_map = {}
190 keys_to_lookup = set(keys)
191@@ -1475,6 +1558,11 @@
192 ' Raising original exception.')
193 raise exc_type, exc_value, exc_traceback
194
195+ def set_sibling_indices(self, sibling_combined_graph_indices):
196+ """Set the CombinedGraphIndex objects to reorder after reordering self.
197+ """
198+ self._sibling_indices = sibling_combined_graph_indices
199+
200 def validate(self):
201 """Validate that everything in the index can be accessed."""
202 while True:
203
204=== modified file 'bzrlib/repofmt/pack_repo.py'
205--- bzrlib/repofmt/pack_repo.py 2010-02-12 11:58:21 +0000
206+++ bzrlib/repofmt/pack_repo.py 2010-04-08 07:11:22 +0000
207@@ -587,26 +587,6 @@
208 flush_func=flush_func)
209 self.add_callback = None
210
211- def replace_indices(self, index_to_pack, indices):
212- """Replace the current mappings with fresh ones.
213-
214- This should probably not be used eventually, rather incremental add and
215- removal of indices. It has been added during refactoring of existing
216- code.
217-
218- :param index_to_pack: A mapping from index objects to
219- (transport, name) tuples for the pack file data.
220- :param indices: A list of indices.
221- """
222- # refresh the revision pack map dict without replacing the instance.
223- self.index_to_pack.clear()
224- self.index_to_pack.update(index_to_pack)
225- # XXX: API break - clearly a 'replace' method would be good?
226- self.combined_index._indices[:] = indices
227- # the current add nodes callback for the current writable index if
228- # there is one.
229- self.add_callback = None
230-
231 def add_index(self, index, pack):
232 """Add index to the aggregate, which is an index for Pack pack.
233
234@@ -619,7 +599,7 @@
235 # expose it to the index map
236 self.index_to_pack[index] = pack.access_tuple()
237 # put it at the front of the linear index list
238- self.combined_index.insert_index(0, index)
239+ self.combined_index.insert_index(0, index, pack.name)
240
241 def add_writable_index(self, index, pack):
242 """Add an index which is able to have data added to it.
243@@ -645,6 +625,7 @@
244 self.data_access.set_writer(None, None, (None, None))
245 self.index_to_pack.clear()
246 del self.combined_index._indices[:]
247+ del self.combined_index._index_names[:]
248 self.add_callback = None
249
250 def remove_index(self, index):
251@@ -653,7 +634,9 @@
252 :param index: An index from the pack parameter.
253 """
254 del self.index_to_pack[index]
255- self.combined_index._indices.remove(index)
256+ pos = self.combined_index._indices.index(index)
257+ del self.combined_index._indices[pos]
258+ del self.combined_index._index_names[pos]
259 if (self.add_callback is not None and
260 getattr(index, 'add_nodes', None) == self.add_callback):
261 self.add_callback = None
262@@ -1415,11 +1398,20 @@
263 self.inventory_index = AggregateIndex(self.reload_pack_names, flush)
264 self.text_index = AggregateIndex(self.reload_pack_names, flush)
265 self.signature_index = AggregateIndex(self.reload_pack_names, flush)
266+ all_indices = [self.revision_index, self.inventory_index,
267+ self.text_index, self.signature_index]
268 if use_chk_index:
269 self.chk_index = AggregateIndex(self.reload_pack_names, flush)
270+ all_indices.append(self.chk_index)
271 else:
272 # used to determine if we're using a chk_index elsewhere.
273 self.chk_index = None
274+ # Tell all the CombinedGraphIndex objects about each other, so they can
275+ # share hints about which pack names to search first.
276+ all_combined = [agg_idx.combined_index for agg_idx in all_indices]
277+ for combined_idx in all_combined:
278+ combined_idx.set_sibling_indices(
279+ set(all_combined).difference([combined_idx]))
280 # resumed packs
281 self._resumed_packs = []
282
283
284=== modified file 'bzrlib/tests/per_pack_repository.py'
285--- bzrlib/tests/per_pack_repository.py 2010-02-23 07:43:11 +0000
286+++ bzrlib/tests/per_pack_repository.py 2010-04-08 07:11:22 +0000
287@@ -288,6 +288,23 @@
288 repo._pack_collection._clear_obsolete_packs()
289 self.assertTrue(repo_transport.has('obsolete_packs/.nfsblahblah'))
290
291+ def test_pack_collection_sets_sibling_indices(self):
292+ """The CombinedGraphIndex objects in the pack collection are all
293+ siblings of each other, so that search-order reorderings will be copied
294+ to each other.
295+ """
296+ repo = self.make_repository('repo')
297+ pack_coll = repo._pack_collection
298+ indices = set([pack_coll.revision_index, pack_coll.inventory_index,
299+ pack_coll.text_index, pack_coll.signature_index])
300+ if pack_coll.chk_index is not None:
301+ indices.add(pack_coll.chk_index)
302+ combined_indices = set(idx.combined_index for idx in indices)
303+ for combined_index in combined_indices:
304+ self.assertEqual(
305+ combined_indices.difference([combined_index]),
306+ combined_index._sibling_indices)
307+
308 def test_pack_after_two_commits_packs_everything(self):
309 format = self.get_format()
310 tree = self.make_branch_and_tree('.', format=format)
311
312=== modified file 'bzrlib/tests/test_index.py'
313--- bzrlib/tests/test_index.py 2010-03-05 17:56:55 +0000
314+++ bzrlib/tests/test_index.py 2010-04-08 07:11:22 +0000
315@@ -1380,6 +1380,50 @@
316 self.assertListRaises(errors.NoSuchFile, index.iter_entries_prefix,
317 [('1',)])
318
319+
320+ def make_index_with_simple_nodes(self, name, num_nodes=1):
321+ """Make an index named after 'name', with keys named after 'name' too.
322+
323+ Nodes will have a value of '' and no references.
324+ """
325+ nodes = [
326+ (('index-%s-key-%s' % (name, n),), '', ())
327+ for n in range(1, num_nodes+1)]
328+ return self.make_index('index-%s' % name, 0, nodes=nodes)
329+
330+ def test_reorder_after_iter_entries(self):
331+ # Four indices: [key1] in index1, [key2,key3] in index2, [] in index3,
332+ # [key4] in index4.
333+ index = CombinedGraphIndex([])
334+ index.insert_index(0, self.make_index_with_simple_nodes('1'), '1')
335+ index.insert_index(1, self.make_index_with_simple_nodes('2'), '2')
336+ index.insert_index(2, self.make_index_with_simple_nodes('3'), '3')
337+ index.insert_index(3, self.make_index_with_simple_nodes('4'), '4')
338+ index1, index2, index3, index4 = index._indices
339+ # Query a key from index4 and index2.
340+ self.assertLength(2, list(index.iter_entries(
341+ [('index-4-key-1',), ('index-2-key-1',)])))
342+ # Now index2 and index4 should be moved to the front (and index1 should
343+ # still be before index3).
344+ self.assertEqual([index2, index4, index1, index3], index._indices)
345+ self.assertEqual(['2', '4', '1', '3'], index._index_names)
346+
347+ def test_reorder_propagates_to_siblings(self):
348+ # Two CombinedGraphIndex objects, with the same number of indicies with
349+ # matching names.
350+ cgi1 = CombinedGraphIndex([])
351+ cgi2 = CombinedGraphIndex([])
352+ cgi1.insert_index(0, self.make_index_with_simple_nodes('1-1'), 'one')
353+ cgi1.insert_index(1, self.make_index_with_simple_nodes('1-2'), 'two')
354+ cgi2.insert_index(0, self.make_index_with_simple_nodes('2-1'), 'one')
355+ cgi2.insert_index(1, self.make_index_with_simple_nodes('2-2'), 'two')
356+ index2_1, index2_2 = cgi2._indices
357+ cgi1.set_sibling_indices([cgi2])
358+ # Trigger a reordering in cgi1. cgi2 will be reordered as well.
359+ list(cgi1.iter_entries([('index-1-2-key-1',)]))
360+ self.assertEqual([index2_2, index2_1], cgi2._indices)
361+ self.assertEqual(['two', 'one'], cgi2._index_names)
362+
363 def test_validate_reloads(self):
364 index, reload_counter = self.make_combined_index_with_missing()
365 index.validate()