Bazaar

Code review comment for lp:~spiv/bzr/smarter-index-search

smarter-index-search
Merge into bzr.dev

Revision history for this message

Andrew Bennetts (spiv) wrote on 2010-03-17:

John A Meinel wrote:
[...]
> Thanks for doing this. I did the work in Packer, but it is nice to have
> it here.

Ah! I thought we had something like this. I'll see what code and ideas
I can reuse.

> I would tend to set "auto_reorder" to always on. I don't see it ever
> making things worse, and there should only be a small overhead for the
> reordering. As such, I would also get rid of the constructor flag.

Hmm, ok. If I was just targetting bzr.dev I would have done this
already, but if you're confident it won't cause unexpected issues I'm
happy to make it unconditional (including in backports).

> Doing it by names is ok, I think Index already has _name for most
> indices. You could also order by "access_tuple()" which I believe is a
> public API and returns (transport, name) to the original .pack file. The
> *nice* thing about doing that is that you don't have to introduce any
> new apis here. But if it isn't as clean, then don't worry too much about it.

_name is really a path for _transport, and as such includes the suffix.
So I don't think it's as clean either code-wise or conceptually to rely
on that. And as you noticed later access_tuple is on AggregateIndex, so
it's not readily available either.

> If it isn't too much overhead, you might consider:
>
> keys_remove = keys.remove
> # Be careful, if keys starts empty, we would break, and get a name error
> index_hit_count = 0
> for index in self._indices:
> if not keys:
> break
> index_hit_count = 0
> for node in index.iter_entries(keys):
> keys_remove(node[1])
> yield node
> index_hit_count += 1
>
> if index_hit_count:
> hit_indices.append((index_hit_count, index))
>
> hit_indices.sort(reverse=True)
>
> self._move_index_order(hit_indices)
>
>
> The nice thing about it, is that it is adaptive. In that if you start a
> search, and hit it in index 1, then you keep looking there, but if the
> keys start showing up in index 2, then you'll switch. The only other bit
> of logic that I could think of is:

I don't quite follow you here. Wouldn't my logic switch to preferring
index 2 anyway?

It's not clear to me that ordering the hit_indices by hit count is
better.

> if index_hit_count == index.key_count():
> index_hit_count = 0 (or maybe 1)
>
> I realize to do this correctly, you'd need to track hit count between
> calls, which isn't realistic. However, that basic check will work for
> pack files that only have 1 or 2 keys that are found quickly. (And
> commit creates indexes with only 1 entry, so it is fairly common.) Just
> a thought, though.

This is an interesting idea. Although perhaps those look ups will
already fail very quickly because at that point that entire index has
already been read and cached? I don't think I'll do this in this patch
as I think it'd need more investigation and measurement to justify, but
it might make a good incremental improvement later.

> + # Tell all the CombinedGraphIndex objects about each other, so
> they can
> + # share hints about which pack names to search first.
> + all_combined = [agg_idx.combined_index for agg_idx in all_indices]
> + for combined_idx in all_combined:
> + combined_idx._sibling_indices =
> set(all_combined).difference([combined_idx])
> # resumed packs
>
> ^- You are accessing a private variable _sibling_indices here. I'd
> rather make it a public api that you call.

Fair enough.

> And certainly, this should have tests. I don't think it would be too
> hard to set up a case with ~3 indices, put them in a CombinedGraph,
> query for rev X, make sure it is in the front, query for Y, etc.

Yeah, I agree. Will do that now :)

-Andrew.

John A Meinel wrote:
[...]
> Thanks for doing this. I did the work in Packer, but it is nice to have
> it here.

Ah!  I thought we had something like this.  I'll see what code and ideas
I can reuse.

Hmm, ok.  If I was just targetting bzr.dev I would have done this
already, but if you're confident it won't cause unexpected issues I'm
happy to make it unconditional (including in backports).

_name is really a path for _transport, and as such includes the suffix.
So I don't think it's as clean either code-wise or conceptually to rely
on that.  And as you noticed later access_tuple is on AggregateIndex, so
it's not readily available either.

> If it isn't too much overhead, you might consider:
> 
> keys_remove = keys.remove
> # Be careful, if keys starts empty, we would break, and get a name error
> index_hit_count = 0
> for index in self._indices:
>   if not keys:
>     break
>   index_hit_count = 0
>   for node in index.iter_entries(keys):
>     keys_remove(node[1])
>     yield node
>     index_hit_count += 1
> 
> if index_hit_count:
>   hit_indices.append((index_hit_count, index))
> 
> hit_indices.sort(reverse=True)
> 
> self._move_index_order(hit_indices)
> 
> 
> The nice thing about it, is that it is adaptive. In that if you start a
> search, and hit it in index 1, then you keep looking there, but if the
> keys start showing up in index 2, then you'll switch. The only other bit
> of logic that I could think of is:

I don't quite follow you here.  Wouldn't my logic switch to preferring
index 2 anyway?

It's not clear to me that ordering the hit_indices by hit count is
better.

>  if index_hit_count == index.key_count():
>    index_hit_count = 0 (or maybe 1)
> 
> I realize to do this correctly, you'd need to track hit count between
> calls, which isn't realistic. However, that basic check will work for
> pack files that only have 1 or 2 keys that are found quickly. (And
> commit creates indexes with only 1 entry, so it is fairly common.) Just
> a thought, though.

This is an interesting idea.  Although perhaps those look ups will
already fail very quickly because at that point that entire index has
already been read and cached?  I don't think I'll do this in this patch
as I think it'd need more investigation and measurement to justify, but
it might make a good incremental improvement later.

> +        # Tell all the CombinedGraphIndex objects about each other, so
> they can
> +        # share hints about which pack names to search first.
> +        all_combined = [agg_idx.combined_index for agg_idx in all_indices]
> +        for combined_idx in all_combined:
> +            combined_idx._sibling_indices =
> set(all_combined).difference([combined_idx])
>          # resumed packs
> 
> ^- You are accessing a private variable _sibling_indices here. I'd
> rather make it a public api that you call.

Fair enough.

Yeah, I agree.  Will do that now :)

-Andrew.

« Back to merge proposal