Bazaar

Code review comment for lp:~jameinel/bzr/1.17-rework-annotate

1.17-rework-annotate
Merge into trunk-old

Revision history for this message

John A Meinel (jameinel) wrote on 2009-07-06:

This is a fairly major overhaul of the annotate code, with an eye on improving annotations overall. In the short term, it just makes it faster (~9s => ~7s for NEWS)

Overview of changes
1) Some small tweaks to BranchBuilder that I needed to write some test cases.
2) Changing from a bunch of loose functions in 'bzrlib.annotate.reannotate*' to a class Annotator.
3) Re-implement _KnitAnnotator as an implementation of this class. I didn't change much about how the texts were extracted and then compared, but there is a much better test suite against it now. It also vetted the design a bit, to ensure that the Annotator could be properly subclassed to do specialized extraction. (Knits and knitpack gives us hints as to what our delta should be, so that we don't have to re-delta every text versus all of its parents. Timing shows this to be rather significant.)
4) Implement a pyrex version of Annotator, to handle some inner-loop functionality. (Nicely enough, you can still subclass from it.)
5) This also includes a fairly fundamental change to how the annotation is produced.
   a) Switch from [(ann1, line1), (ann2, line2)] to ([ann1, ann2], [line1, line2]) style of tracking annotations. This means fewer times when we have to re-cast the data from a list of annotated lines into a list of plain lines.
   b) When computing the delta, compare all plain lines first. The prior code used to compare annotated lines, because it made computing overall annotations faster. However, that introduced bug #387294.
   c) Start tracking *multiple* sources for a given line. This means that rather than resolving 'heads()' and collisions at every revision, we wait to do the resolution on the *final* text. This removes a lot of heads() calls for stuff like NEWS (and is the primary performance improvement). I don't think this gets us a lot when dealing with a command line interface, but it has lots of potential for a GUI (which could then show all sources that introduced a line, etc.)

Still to do:
1) I'd like to break it up a bit more, and allow you to pass some sort of Policy object into the code. That would let you do things like ignore whitespace changes, only annotate based on mainline changes, etc.

2) Support _break_annotation_tie again. This is really something I'd like to fold into Policy, but I guess MySQL wanted a custom implementation. I don't really like the current api, as it is probably fixed at 2 revisions, and it passes in the lines along with the revisions. But I can certainly adapt what I need to the old api. Note that technically the api supports > 2, but I doubt that is actually supported anywhere, but I haven't seen the MySQL implementation as a comparison point.

3) GUI support. I don't really know how to expose the Annotator functionality to a GUI, or if the api really works well there until I've actually written some code. However, this patch has gotten too big already, so I'd like to get it reviewed as is.

This is a fairly major overhaul of the annotate code, with an eye on improving annotations overall. In the short term, it just makes it faster (~9s => ~7s for NEWS)

Overview of changes
1) Some small tweaks to BranchBuilder that I needed to write some test cases.
2) Changing from a bunch of loose functions in 'bzrlib.annotate.reannotate*' to a class Annotator.
3) Re-implement _KnitAnnotator as an implementation of this class. I didn't change much about how the texts were extracted and then compared, but there is a much better test suite against it now. It also vetted the design a bit, to ensure that the Annotator could be properly subclassed to do specialized extraction. (Knits and knitpack gives us hints as to what our delta should be, so that we don't have to re-delta every text versus all of its parents. Timing shows this to be rather significant.)
4) Implement a pyrex version of Annotator, to handle some inner-loop functionality. (Nicely enough, you can still subclass from it.)
5) This also includes a fairly fundamental change to how the annotation is produced. 
   a) Switch from [(ann1, line1), (ann2, line2)] to ([ann1, ann2], [line1, line2]) style of tracking annotations. This means fewer times when we have to re-cast the data from a list of annotated lines into a list of plain lines.
   b) When computing the delta, compare all plain lines first. The prior code used to compare annotated lines, because it made computing overall annotations faster. However, that introduced bug #387294.
   c) Start tracking *multiple* sources for a given line. This means that rather than resolving 'heads()' and collisions at every revision, we wait to do the resolution on the *final* text. This removes a lot of heads() calls for stuff like NEWS (and is the primary performance improvement). I don't think this gets us a lot when dealing with a command line interface, but it has lots of potential for a GUI (which could then show all sources that introduced a line, etc.)

« Back to merge proposal