loggerhead

Code review comment for lp:~mnordhoff/loggerhead/statictuples

statictuples
Merge into trunk-rich

Revision history for this message

Matt Nordhoff (mnordhoff) wrote on 2009-10-22:

Michael Hudson wrote:
> Matt Nordhoff wrote:
>> Matt Nordhoff has proposed merging lp:~mnordhoff/loggerhead/statictuples into lp:loggerhead.
>>
>> Requested reviews:
>> Loggerhead-team (loggerhead-team)
>>
>>
>> This totally isn't ready to merge (though it does work 100%, AFAICT), but I wanted to bring it to your attention.
>>
>> This branch changes Loggerhead to make use of StaticTuples, jam's recent addition to bzrlib. (If you don't know, they're very similar to tuples, but they are faster and save memory by only holding certain types of objects that can't get in reference cycles, so they can avoid the GC overhead.)
>
> Thanks for poking at this.
>
>> I'm bundling a copy of the Python version of StaticTuple for compatibility with older bzr versions. This has a performance cost, but hopefully it's minor, and worth it.
>>
>> I don't really have information on how much of a savings this is; I haven't been running it long enough for memory usage to stabilize. But I figure it should reduce the number of tuples by 50-75%, which is 100,000-300,000 for me.
>>
>> BTW, I fixed a missing import since the last time Launchpad mirrored this branch.
>>
>> Also, you need to run it against lp:~jameinel/bzr/2.1-st-concat, which has a proposal up for merging to bzr.dev.
>>
>> What's left to do:
>>
>> 1.) It's not possible to marshal StaticTuples, so RevInfoDiskCache currently uses listcomps to convert to/from regular tuples when serializing/deserializing. This probably has a performance impact, but all I cared about at the time was getting it working.
>
> I'd definitely want to measure the performance impact of this on a
> launchpad-sized history before landing this.

I don't have a copy of Launchpad, but I'd be happy to do some sort of
simple, mostly-good-enough test (:-P) on something less scary-big like
bzr.dev or Python.

>> It's not possible to pickle them either, so we can't switch to that. Besides, pickling uses more disk space than marshal. (As a quick test, I took a random row from revinfo.sql. Without changing from tuple to StaticTuple, changing it from marshal+zlib to pickle+zlib (using pickle protocol version 2) went from 39 KB to 54 KB.)
>
> FWIW, the reason I went with marshal is that it's so much quicker than
> anything else I tried. I have this recollection that cPickle was a
> bunch slower.

Oh, thanks for the explanation.

> Really though, we need to find a tech that allows us to not deserialize
> the whole graph all the time -- a more sophisticated schema or
> something. I have other reasons to want to use a cache that can work
> over the network -- maybe couch? -- but I don't know when I'll get time
> to work on it :(

That sounds good, but it's over my head. It's definitely not going to be
a part of this patch. :-P

>> 2.) Currently I work around StaticTuple.from_sequence() only supporting sequences, not generators. I'll poke jam about that next time I see him.
>>
>> 3.) There weren't very many of them, but I saw some tuples like ('replace', 0, 1, 0, 1) in memory, and want to StaticTupleify them too, but don't know where they come from.
>
> I'd guess that's to do with diff generation.

I just tracked it down. It comes from the sequencematcher in
bzrlib.patiencediff. PatienceDiff could probably be changed to use
StaticTuples, then, not that it's necessary.

It disturbs me that such things are sticking around in memory at all.
Shouldn't they be garbage collected as soon as the diff is generated?

>> 4.) I convert the results of iter_ancestry to StaticTuples. I should probably send a patch to bzr.dev, or maybe just leave them as normal tuples, to save the cost of converting them (though it's probably very minor).
>>
>> 4.) Maybe some things should be interned? I dunno.
>
> I guess this might help if you are looking at branches that share ancestry.

Yeah, probably. I doubt I'll work on it, though.

>> 5.) Some of the things I StaticTupled probably have a negligible impact on memory usage, but hey, why not?
>>
>> 6.) After this, the only other interesting tuples I see in Dozer are related to Subversion exceptions (bug #436406) and a few like ('sha1:...',), which I haven't investigated yet. Plus a couple ('$some_revid', [$graph_cache]).

I just checked out the sha1 tuples. Looks like they're from
bzrlib.chk_map, which jam is working on converting to use StaticTuples.
So I guess I doubly don't need to do anything there.

> The latter are probably to do with the sqlite cache?

Ah, I bet it's RevInfoMemoryCache:

self._cache[key] = (revid, data)

>> 7.) It might be possible for some of Loggerhead's many lists to be turned into StaticTuples. In particular, the graph cache is a list of lists of 3 tuples. (They actually are sometimes modified where they're built, wholehistory.compute_whole_history_data(), but it may be worth the inconvenience of turning them into StaticTuples anyway.)
>
> In general I think this should land -- but only if it doesn't hurt the
> loading of a Launchpad-sized history from the cache too much.

Awesomesocks. :-)

What do you think of switching to cPickle, if it proves to be faster
than the listcomp stuff? Are you okay with the performance and disk
space cost?

> Cheers,
> mwh
--

Michael Hudson wrote:
> Matt Nordhoff wrote:
>> Matt Nordhoff has proposed merging lp:~mnordhoff/loggerhead/statictuples into lp:loggerhead.
>>
>>     Requested reviews:
>>     Loggerhead-team (loggerhead-team)
>>
>>
>> This totally isn't ready to merge (though it does work 100%, AFAICT), but I wanted to bring it to your attention.
>>
>> This branch changes Loggerhead to make use of StaticTuples, jam's recent addition to bzrlib. (If you don't know, they're very similar to tuples, but they are faster and save memory by only holding certain types of objects that can't get in reference cycles, so they can avoid the GC overhead.)
> 
> Thanks for poking at this.
> 
>> I'm bundling a copy of the Python version of StaticTuple for compatibility with older bzr versions. This has a performance cost, but hopefully it's minor, and worth it.
>>
>> I don't really have information on how much of a savings this is; I haven't been running it long enough for memory usage to stabilize. But I figure it should reduce the number of tuples by 50-75%, which is 100,000-300,000 for me.
>>
>> BTW, I fixed a missing import since the last time Launchpad mirrored this branch.
>>
>> Also, you need to run it against lp:~jameinel/bzr/2.1-st-concat, which has a proposal up for merging to bzr.dev.
>>
>> What's left to do:
>>
>> 1.) It's not possible to marshal StaticTuples, so RevInfoDiskCache currently uses listcomps to convert to/from regular tuples when serializing/deserializing. This probably has a performance impact, but all I cared about at the time was getting it working.
> 
> I'd definitely want to measure the performance impact of this on a
> launchpad-sized history before landing this.

I don't have a copy of Launchpad, but I'd be happy to do some sort of
simple, mostly-good-enough test (:-P) on something less scary-big like
bzr.dev or Python.

>>     It's not possible to pickle them either, so we can't switch to that. Besides, pickling uses more disk space than marshal. (As a quick test, I took a random row from revinfo.sql. Without changing from tuple to StaticTuple, changing it from marshal+zlib to pickle+zlib (using pickle protocol version 2) went from 39 KB to 54 KB.)
> 
> FWIW, the reason I went with marshal is that it's so much quicker than
> anything else I tried.  I have this recollection that cPickle was a
> bunch slower.

Oh, thanks for the explanation.

> Really though, we need to find a tech that allows us to not deserialize
> the whole graph all the time -- a more sophisticated schema or
> something.  I have other reasons to want to use a cache that can work
> over the network -- maybe couch? -- but I don't know when I'll get time
> to work on it :(

That sounds good, but it's over my head. It's definitely not going to be
a part of this patch. :-P

I just tracked it down. It comes from the sequencematcher in
bzrlib.patiencediff. PatienceDiff could probably be changed to use
StaticTuples, then, not that it's necessary.

It disturbs me that such things are sticking around in memory at all.
Shouldn't they be garbage collected as soon as the diff is generated?

>> 4.) I convert the results of iter_ancestry to StaticTuples. I should probably send a patch to bzr.dev, or maybe just leave them as normal tuples, to save the cost of converting them (though it's probably very minor).
>>
>> 4.) Maybe some things should be interned? I dunno.
> 
> I guess this might help if you are looking at branches that share ancestry.

Yeah, probably. I doubt I'll work on it, though.

I just checked out the sha1 tuples. Looks like they're from
bzrlib.chk_map, which jam is working on converting to use StaticTuples.
So I guess I doubly don't need to do anything there.

> The latter are probably to do with the sqlite cache?

Ah, I bet it's RevInfoMemoryCache:

self._cache[key] = (revid, data)

>> 7.) It might be possible for some of Loggerhead's many lists to be turned into StaticTuples. In particular, the graph cache is a list of lists of 3 tuples. (They actually are sometimes modified where they're built, wholehistory.compute_whole_history_data(), but it may be worth the inconvenience of turning them into StaticTuples anyway.)
> 
> In general I think this should land -- but only if it doesn't hurt the
> loading of a Launchpad-sized history from the cache too much.

Awesomesocks. :-)

What do you think of switching to cPickle, if it proves to be faster
than the listcomp stuff? Are you okay with the performance and disk
space cost?

> Cheers,
> mwh
--

« Back to merge proposal