John A Meinel wrote:
[...]
> Thanks for doing this. I did the work in Packer, but it is nice to have
> it here.
Ah! I thought we had something like this. I'll see what code and ideas
I can reuse.
> I would tend to set "auto_reorder" to always on. I don't see it ever
> making things worse, and there should only be a small overhead for the
> reordering. As such, I would also get rid of the constructor flag.
Hmm, ok. If I was just targetting bzr.dev I would have done this
already, but if you're confident it won't cause unexpected issues I'm
happy to make it unconditional (including in backports).
> Doing it by names is ok, I think Index already has _name for most
> indices. You could also order by "access_tuple()" which I believe is a
> public API and returns (transport, name) to the original .pack file. The
> *nice* thing about doing that is that you don't have to introduce any
> new apis here. But if it isn't as clean, then don't worry too much about it.
_name is really a path for _transport, and as such includes the suffix.
So I don't think it's as clean either code-wise or conceptually to rely
on that. And as you noticed later access_tuple is on AggregateIndex, so
it's not readily available either.
> If it isn't too much overhead, you might consider:
>
> keys_remove = keys.remove
> # Be careful, if keys starts empty, we would break, and get a name error
> index_hit_count = 0
> for index in self._indices:
> if not keys:
> break
> index_hit_count = 0
> for node in index.iter_entries(keys):
> keys_remove(node[1])
> yield node
> index_hit_count += 1
>
> if index_hit_count:
> hit_indices.append((index_hit_count, index))
>
> hit_indices.sort(reverse=True)
>
> self._move_index_order(hit_indices)
>
>
> The nice thing about it, is that it is adaptive. In that if you start a
> search, and hit it in index 1, then you keep looking there, but if the
> keys start showing up in index 2, then you'll switch. The only other bit
> of logic that I could think of is:
I don't quite follow you here. Wouldn't my logic switch to preferring
index 2 anyway?
It's not clear to me that ordering the hit_indices by hit count is
better.
> if index_hit_count == index.key_count():
> index_hit_count = 0 (or maybe 1)
>
> I realize to do this correctly, you'd need to track hit count between
> calls, which isn't realistic. However, that basic check will work for
> pack files that only have 1 or 2 keys that are found quickly. (And
> commit creates indexes with only 1 entry, so it is fairly common.) Just
> a thought, though.
This is an interesting idea. Although perhaps those look ups will
already fail very quickly because at that point that entire index has
already been read and cached? I don't think I'll do this in this patch
as I think it'd need more investigation and measurement to justify, but
it might make a good incremental improvement later.
> + # Tell all the CombinedGraphIndex objects about each other, so
> they can
> + # share hints about which pack names to search first.
> + all_combined = [agg_idx.combined_index for agg_idx in all_indices]
> + for combined_idx in all_combined:
> + combined_idx._sibling_indices =
> set(all_combined).difference([combined_idx])
> # resumed packs
>
> ^- You are accessing a private variable _sibling_indices here. I'd
> rather make it a public api that you call.
Fair enough.
> And certainly, this should have tests. I don't think it would be too
> hard to set up a case with ~3 indices, put them in a CombinedGraph,
> query for rev X, make sure it is in the front, query for Y, etc.
John A Meinel wrote:
[...]
> Thanks for doing this. I did the work in Packer, but it is nice to have
> it here.
Ah! I thought we had something like this. I'll see what code and ideas
I can reuse.
> I would tend to set "auto_reorder" to always on. I don't see it ever
> making things worse, and there should only be a small overhead for the
> reordering. As such, I would also get rid of the constructor flag.
Hmm, ok. If I was just targetting bzr.dev I would have done this
already, but if you're confident it won't cause unexpected issues I'm
happy to make it unconditional (including in backports).
> Doing it by names is ok, I think Index already has _name for most
> indices. You could also order by "access_tuple()" which I believe is a
> public API and returns (transport, name) to the original .pack file. The
> *nice* thing about doing that is that you don't have to introduce any
> new apis here. But if it isn't as clean, then don't worry too much about it.
_name is really a path for _transport, and as such includes the suffix.
So I don't think it's as clean either code-wise or conceptually to rely
on that. And as you noticed later access_tuple is on AggregateIndex, so
it's not readily available either.
> If it isn't too much overhead, you might consider: entries( keys): node[1] ) append( (index_ hit_count, index)) sort(reverse= True) index_order( hit_indices)
>
> keys_remove = keys.remove
> # Be careful, if keys starts empty, we would break, and get a name error
> index_hit_count = 0
> for index in self._indices:
> if not keys:
> break
> index_hit_count = 0
> for node in index.iter_
> keys_remove(
> yield node
> index_hit_count += 1
>
> if index_hit_count:
> hit_indices.
>
> hit_indices.
>
> self._move_
>
>
> The nice thing about it, is that it is adaptive. In that if you start a
> search, and hit it in index 1, then you keep looking there, but if the
> keys start showing up in index 2, then you'll switch. The only other bit
> of logic that I could think of is:
I don't quite follow you here. Wouldn't my logic switch to preferring
index 2 anyway?
It's not clear to me that ordering the hit_indices by hit count is
better.
> if index_hit_count == index.key_count():
> index_hit_count = 0 (or maybe 1)
>
> I realize to do this correctly, you'd need to track hit count between
> calls, which isn't realistic. However, that basic check will work for
> pack files that only have 1 or 2 keys that are found quickly. (And
> commit creates indexes with only 1 entry, so it is fairly common.) Just
> a thought, though.
This is an interesting idea. Although perhaps those look ups will
already fail very quickly because at that point that entire index has
already been read and cached? I don't think I'll do this in this patch
as I think it'd need more investigation and measurement to justify, but
it might make a good incremental improvement later.
> + # Tell all the CombinedGraphIndex objects about each other, so combined_ index for agg_idx in all_indices] idx._sibling_ indices = combined) .difference( [combined_ idx])
> they can
> + # share hints about which pack names to search first.
> + all_combined = [agg_idx.
> + for combined_idx in all_combined:
> + combined_
> set(all_
> # resumed packs
>
> ^- You are accessing a private variable _sibling_indices here. I'd
> rather make it a public api that you call.
Fair enough.
> And certainly, this should have tests. I don't think it would be too
> hard to set up a case with ~3 indices, put them in a CombinedGraph,
> query for rev X, make sure it is in the front, query for Y, etc.
Yeah, I agree. Will do that now :)
-Andrew.