Bazaar

Code review comment for lp:~spiv/bzr/smarter-index-search

smarter-index-search
Merge into bzr.dev

Revision history for this message

John A Meinel (jameinel) wrote on 2010-03-17:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
> John A Meinel wrote:
> [...]
>> Thanks for doing this. I did the work in Packer, but it is nice to have
>> it here.
>
> Ah! I thought we had something like this. I'll see what code and ideas
> I can reuse.
>
>> I would tend to set "auto_reorder" to always on. I don't see it ever
>> making things worse, and there should only be a small overhead for the
>> reordering. As such, I would also get rid of the constructor flag.
>
> Hmm, ok. If I was just targetting bzr.dev I would have done this
> already, but if you're confident it won't cause unexpected issues I'm
> happy to make it unconditional (including in backports).
>
>> Doing it by names is ok, I think Index already has _name for most
>> indices. You could also order by "access_tuple()" which I believe is a
>> public API and returns (transport, name) to the original .pack file. The
>> *nice* thing about doing that is that you don't have to introduce any
>> new apis here. But if it isn't as clean, then don't worry too much about it.
>
> _name is really a path for _transport, and as such includes the suffix.
> So I don't think it's as clean either code-wise or conceptually to rely
> on that. And as you noticed later access_tuple is on AggregateIndex, so
> it's not readily available either.
>
>> If it isn't too much overhead, you might consider:
>>
>> keys_remove = keys.remove
>> # Be careful, if keys starts empty, we would break, and get a name error
>> index_hit_count = 0
>> for index in self._indices:
>> if not keys:
>> break
>> index_hit_count = 0
>> for node in index.iter_entries(keys):
>> keys_remove(node[1])
>> yield node
>> index_hit_count += 1
>>
>> if index_hit_count:
>> hit_indices.append((index_hit_count, index))
>>
>> hit_indices.sort(reverse=True)
>>
>> self._move_index_order(hit_indices)
>>
>>
>> The nice thing about it, is that it is adaptive. In that if you start a
>> search, and hit it in index 1, then you keep looking there, but if the
>> keys start showing up in index 2, then you'll switch. The only other bit
>> of logic that I could think of is:
>
> I don't quite follow you here. Wouldn't my logic switch to preferring
> index 2 anyway?
>
> It's not clear to me that ordering the hit_indices by hit count is
> better.
>

The point is that if the first index in the list gets 1 key hit, but the
second gets 10, and the third gets 5, then it will get sorted into
(index2, index3, index1) rather than (index1, index2, index3). At least,
what I saw with your code was that it is just 'preserve existing order,
but move the ones hit to the front'. Put another way, say the hits are:

5 index1
10 index2
0 index3
4 index5
0 index6

your version would sort to

index1
index2
index5
index3
index6

mine would sort to

index2
index1
index5
index3
index6

I think the latter is slightly preferred. I don't have any numbers to
back that up, just a feeling. (Most likely in this case index2 is a
'bigger' index, and thus has more to *be* hit.)

As for small indices getting rejected quickly, it is true they won't
encounter any I/O overhead, which may be sufficient. It is a function
call, but that is probably tiny overhead, and they'll get sorted late
quickly anyway.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkug6fEACgkQJdeBCYSNAAMFDwCeNFunVP9tcpRfDfJBVMki8W7O
pe0AoJRHXzjuKaU0a0I98H7GKuplAaBu
=gpIB
-----END PGP SIGNATURE-----

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
> John A Meinel wrote:
> [...]
>> Thanks for doing this. I did the work in Packer, but it is nice to have
>> it here.
> 
> Ah!  I thought we had something like this.  I'll see what code and ideas
> I can reuse.
> 
>> I would tend to set "auto_reorder" to always on. I don't see it ever
>> making things worse, and there should only be a small overhead for the
>> reordering. As such, I would also get rid of the constructor flag.
> 
> Hmm, ok.  If I was just targetting bzr.dev I would have done this
> already, but if you're confident it won't cause unexpected issues I'm
> happy to make it unconditional (including in backports).
> 
>> Doing it by names is ok, I think Index already has _name for most
>> indices. You could also order by "access_tuple()" which I believe is a
>> public API and returns (transport, name) to the original .pack file. The
>> *nice* thing about doing that is that you don't have to introduce any
>> new apis here. But if it isn't as clean, then don't worry too much about it.
> 
> _name is really a path for _transport, and as such includes the suffix.
> So I don't think it's as clean either code-wise or conceptually to rely
> on that.  And as you noticed later access_tuple is on AggregateIndex, so
> it's not readily available either.
> 
>> If it isn't too much overhead, you might consider:
>>
>> keys_remove = keys.remove
>> # Be careful, if keys starts empty, we would break, and get a name error
>> index_hit_count = 0
>> for index in self._indices:
>>   if not keys:
>>     break
>>   index_hit_count = 0
>>   for node in index.iter_entries(keys):
>>     keys_remove(node[1])
>>     yield node
>>     index_hit_count += 1
>>
>> if index_hit_count:
>>   hit_indices.append((index_hit_count, index))
>>
>> hit_indices.sort(reverse=True)
>>
>> self._move_index_order(hit_indices)
>>
>>
>> The nice thing about it, is that it is adaptive. In that if you start a
>> search, and hit it in index 1, then you keep looking there, but if the
>> keys start showing up in index 2, then you'll switch. The only other bit
>> of logic that I could think of is:
> 
> I don't quite follow you here.  Wouldn't my logic switch to preferring
> index 2 anyway?
> 
> It's not clear to me that ordering the hit_indices by hit count is
> better.
>

5 index1
10 index2
0 index3
4 index5
0 index6

your version would sort to

index1
index2
index5
index3
index6

mine would sort to

index2
index1
index5
index3
index6

I think the latter is slightly preferred. I don't have any numbers to
back that up, just a feeling. (Most likely in this case index2 is a
'bigger' index, and thus has more to *be* hit.)

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkug6fEACgkQJdeBCYSNAAMFDwCeNFunVP9tcpRfDfJBVMki8W7O
pe0AoJRHXzjuKaU0a0I98H7GKuplAaBu
=gpIB
-----END PGP SIGNATURE-----

« Back to merge proposal