Bazaar

Code review comment for lp:~spiv/bzr/smarter-index-search

smarter-index-search
Merge into bzr.dev

Revision history for this message

John A Meinel (jameinel) wrote on 2010-03-13:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
> Andrew Bennetts has proposed merging lp:~spiv/bzr/smarter-index-search into lp:bzr.
>
> Requested reviews:
> bzr-core (bzr-core)
>
>
> Optimise index lookups in repositories with many pack files.
>
> First, the headline: this greatly improves "bzr pull" of one new revision of grub from savannah by HTTP (as reported on the mailing list, which has further analysis):
>
> bzr.dev: 2424kB (50.2kB/s r:2395kB w:30kB)
> this patch: 1034kB (43.3kB/s r:1022kB w:12kB)
>
> Given that the pack data transferred is 701266 bytes (which itself seems quite large for such a small change...), that brings the index-searching overhead from 2.42x to 0.45x of the bytes read. It also halves the wall-clock time :)
>
> That repo has I think 14 packs, and bzr.dev tries 11 indices for each of rix, iix, etc before finding the data it needs for that fetch.
>
> There are two parts to this change:
>
> 1) when a CombinedGraphIndex performs a lookup, it shuffles the index or indices that contained the records to the front of self._indices on the assumption that future lookups should try those first.
> 2) propagates that reordering to the other CombinedGraphIndex objects from the same pack collection. This is done by a) associating a name (the pack name) with the elements of CombinedGraphIndex, and b) linking the revisions/inventories/etc CombinedGraphIndex objects belonging to a single pack collection via setting a _sibling_indices attribute on them, c) using those links and names to apply the same reordering to those sibling indices.
>
> I've been pretty conservative with API changes: the new behaviour is only activated by optional keyword arguments, so existing uses of CombinedGraphIndex should see no change of behaviour (including no improvement). This is to make it as easy as possible to backport this change to 2.1 and 2.0 if we choose to.
>
> I think this change needs some tests before it's truly ready to merge, but it's getting to the end of my work week and I think this code is ready for feedback, so here it is!
>

Thanks for doing this. I did the work in Packer, but it is nice to have
it here.

I would tend to set "auto_reorder" to always on. I don't see it ever
making things worse, and there should only be a small overhead for the
reordering. As such, I would also get rid of the constructor flag.

Doing it by names is ok, I think Index already has _name for most
indices. You could also order by "access_tuple()" which I believe is a
public API and returns (transport, name) to the original .pack file. The
*nice* thing about doing that is that you don't have to introduce any
new apis here. But if it isn't as clean, then don't worry too much about it.

It also avoids having to rebuild the mapping for every move call:
+ indices_info = zip(self._index_names, self._indices)
+ hit_indices_info = []
+ hit_names = []
+ unhit_indices_info = []
+ for name, idx in indices_info:
+ if idx in hit_indices:
+ info = hit_indices_info
+ hit_names.append(name)
+ else:
+ info = unhit_indices_info
+ info.append((name, idx))

(Note: I realize late that access_tuple() is on AggregateIndex and not
GraphIndex, but you might be able to pull something out from there.)

If it isn't too much overhead, you might consider:

keys_remove = keys.remove
# Be careful, if keys starts empty, we would break, and get a name error
index_hit_count = 0
for index in self._indices:
  if not keys:
    break
  index_hit_count = 0
  for node in index.iter_entries(keys):
    keys_remove(node[1])
    yield node
    index_hit_count += 1

if index_hit_count:
hit_indices.append((index_hit_count, index))

hit_indices.sort(reverse=True)

self._move_index_order(hit_indices)

The nice thing about it, is that it is adaptive. In that if you start a
search, and hit it in index 1, then you keep looking there, but if the
keys start showing up in index 2, then you'll switch. The only other bit
of logic that I could think of is:

if index_hit_count == index.key_count():
index_hit_count = 0 (or maybe 1)

I realize to do this correctly, you'd need to track hit count between
calls, which isn't realistic. However, that basic check will work for
pack files that only have 1 or 2 keys that are found quickly. (And
commit creates indexes with only 1 entry, so it is fairly common.) Just
a thought, though.

+ # Tell all the CombinedGraphIndex objects about each other, so
they can
+ # share hints about which pack names to search first.
+ all_combined = [agg_idx.combined_index for agg_idx in all_indices]
+ for combined_idx in all_combined:
+ combined_idx._sibling_indices =
set(all_combined).difference([combined_idx])
# resumed packs

^- You are accessing a private variable _sibling_indices here. I'd
rather make it a public api that you call.

And certainly, this should have tests. I don't think it would be too
hard to set up a case with ~3 indices, put them in a CombinedGraph,
query for rev X, make sure it is in the front, query for Y, etc.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkubrakACgkQJdeBCYSNAAOTZACgpK26j725lqpU67s4nef8Yn0F
H2QAn16w7rzQML2tqtGbVhQA9Y98qqVx
=Hm5f
-----END PGP SIGNATURE-----

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
> Andrew Bennetts has proposed merging lp:~spiv/bzr/smarter-index-search into lp:bzr.
> 
> Requested reviews:
>   bzr-core (bzr-core)
> 
> 
> Optimise index lookups in repositories with many pack files.
> 
> First, the headline: this greatly improves "bzr pull" of one new revision of grub from savannah by HTTP (as reported on the mailing list, which has further analysis):
> 
> bzr.dev: 2424kB (50.2kB/s r:2395kB w:30kB)
> this patch: 1034kB (43.3kB/s r:1022kB w:12kB)
> 
> Given that the pack data transferred is 701266 bytes (which itself seems quite large for such a small change...), that brings the index-searching overhead from 2.42x to 0.45x of the bytes read.  It also halves the wall-clock time :)
> 
> That repo has I think 14 packs, and bzr.dev tries 11 indices for each of rix, iix, etc before finding the data it needs for that fetch.
> 
> There are two parts to this change:
> 
>  1) when a CombinedGraphIndex performs a lookup, it shuffles the index or indices that contained the records to the front of self._indices on the assumption that future lookups should try those first.
>  2) propagates that reordering to the other CombinedGraphIndex objects from the same pack collection.  This is done by a) associating a name (the pack name) with the elements of CombinedGraphIndex, and b) linking the revisions/inventories/etc CombinedGraphIndex objects belonging to a single pack collection via setting a _sibling_indices attribute on them, c) using those links and names to apply the same reordering to those sibling indices.
> 
> I've been pretty conservative with API changes: the new behaviour is only activated by optional keyword arguments, so existing uses of CombinedGraphIndex should see no change of behaviour (including no improvement).  This is to make it as easy as possible to backport this change to 2.1 and 2.0 if we choose to.
> 
> I think this change needs some tests before it's truly ready to merge, but it's getting to the end of my work week and I think this code is ready for feedback, so here it is!
>

Thanks for doing this. I did the work in Packer, but it is nice to have
it here.

It also avoids having to rebuild the mapping for every move call:
+        indices_info = zip(self._index_names, self._indices)
+        hit_indices_info = []
+        hit_names = []
+        unhit_indices_info = []
+        for name, idx in indices_info:
+            if idx in hit_indices:
+                info = hit_indices_info
+                hit_names.append(name)
+            else:
+                info = unhit_indices_info
+            info.append((name, idx))

(Note: I realize late that access_tuple() is on AggregateIndex and not
GraphIndex, but you might be able to pull something out from there.)

If it isn't too much overhead, you might consider:

if index_hit_count:
  hit_indices.append((index_hit_count, index))

hit_indices.sort(reverse=True)

self._move_index_order(hit_indices)

if index_hit_count == index.key_count():
   index_hit_count = 0 (or maybe 1)

+        # Tell all the CombinedGraphIndex objects about each other, so
they can
+        # share hints about which pack names to search first.
+        all_combined = [agg_idx.combined_index for agg_idx in all_indices]
+        for combined_idx in all_combined:
+            combined_idx._sibling_indices =
set(all_combined).difference([combined_idx])
         # resumed packs

^- You are accessing a private variable _sibling_indices here. I'd
rather make it a public api that you call.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkubrakACgkQJdeBCYSNAAOTZACgpK26j725lqpU67s4nef8Yn0F
H2QAn16w7rzQML2tqtGbVhQA9Y98qqVx
=Hm5f
-----END PGP SIGNATURE-----

« Back to merge proposal