Bazaar

Merge lp:~ian-clatworthy/bzr/faster-log-file into lp:~bzr/bzr/trunk-old

faster-log-file
Merge into trunk-old

Proposed by Ian Clatworthy on 2009-06-17

Status:	Work in progress
Proposed branch:	lp:~ian-clatworthy/bzr/faster-log-file
Merge into:	lp:~bzr/bzr/trunk-old
Diff against target:	220 lines (has conflicts) Text conflict in NEWS
To merge this branch:	bzr merge lp:~ian-clatworthy/bzr/faster-log-file
Related bugs:	Link a bug report

Reviewer	Review Type	Date Requested	Status
Martin Pool		2009-06-17	Needs Information on 2009-07-24
Review via email: mp+7535@code.launchpad.net

This proposal supersedes a proposal from 2009-05-27.

Revision history for this message

Ian Clatworthy (ian-clatworthy) wrote on 2009-05-27: Posted in a previous version of this proposal

This patch speeds up 'bzr log FILE' on flat-ish histories, as commonly found after an import from svn, cvs and other central VCS tools. On OOo, it drops the time taken down from 29 seconds to 1.5 seconds for logging a typical file.

The key to this improvement is starting with the per-file graph and searching the mainline until the revisions of interest are found. That works very well when the history of a project is flat or mostly flat, because it avoids the 27 seconds required to calculate the full revision graph. In a nutshell, the algorithm changes from O(full-history) to O(file-life).

There's certainly room for further smarts here but this is a useful step forward as it stands I feel.

Revision history for this message

John A Meinel (jameinel) wrote on 2009-06-01: Posted in a previous version of this proposal

So I see that you avoided "incorrect" results on non-linear ancestries by including checks for this case. However

1) I'm not sure that the checks are complete. For example, it doesn't matter whether the per-file graph has merges or not, as to how the 'include-merges' flag should be handled. Consider the case:

  :
  A
  |\
  | B # Mod foo
  |/
  C # Merge B's changes

In that case we want to see both revisions B and C in the "bzr log foo" output. Even though the per-file graph in this case looks simply like:

:
B # Mod foo

2) I'm a bit concerned that we do all of this work with _linear_view_revisions which in the common case for OOo will have to walk the *entire* history (assuming 'bzr log foo' with no -r specified), which we then throw away.

At least, I'm expecting that once a project like OOo changes to a DVCS, they will actually start including merges. Which means that they'll still have 200k revisions in the mainline, but then *also* have all sorts of merge revisions after that 200k...

I guess, I'm mostly worried that while this makes some bits much faster for your OOo testing, it will actually cause regressions in a lot of other cases.

Consider 'bzr log bzrlib/builtins.py', how much time will be spent in this code, just to have it end up deciding to return None?

review: Needs Information

Revision history for this message

Ian Clatworthy (ian-clatworthy) wrote on 2009-06-17: Posted in a previous version of this proposal

> 1) I'm not sure that the checks are complete. For example, it doesn't matter
> whether the per-file graph has merges or not, as to how the 'include-merges'
> flag should be handled. Consider the case:
>
> :
> A
> |\
> | B # Mod foo
> |/
> C # Merge B's changes
>
> In that case we want to see both revisions B and C in the "bzr log foo"
> output. Even though the per-file graph in this case looks simply like:
>
> :
> B # Mod foo

I think the code handles this ok. If log --include-merges (or -n0) is given, it uses the old algorithm immediately. Otherwise, it will just show B.

> 2) I'm a bit concerned that we do all of this work with _linear_view_revisions
> which in the common case for OOo will have to walk the *entire* history
> (assuming 'bzr log foo' with no -r specified), which we then throw away.
>
> At least, I'm expecting that once a project like OOo changes to a DVCS, they
> will actually start including merges. Which means that they'll still have 200k
> revisions in the mainline, but then *also* have all sorts of merge revisions
> after that 200k...
>
> I guess, I'm mostly worried that while this makes some bits much faster for
> your OOo testing, it will actually cause regressions in a lot of other cases.
>
> Consider 'bzr log bzrlib/builtins.py', how much time will be spent in this
> code, just to have it end up deciding to return None?

Good points. I ran the benchmark you suggested and it did indeed indicate a problem. I'll push an updated patch.

Revision history for this message

Ian Clatworthy (ian-clatworthy) wrote on 2009-06-17:

This patch speeds up log FILE on flat-ish histories: 29 secs => 1 sec for OOo. As noted in the previous merge proposal, it does this by starting from the per-file graph and looking for the revisions of interest along the mainline, instead of always assuming that the full revision graph is required. This win is achieved for files where the history is flat or mostly flat during the time between last edit and creation.

In response to John's feedback on the earlier proposal, this version now has some additional checks to ensure that we don't walk the whole mainline only to come up short. In particular, it uses revision timestamps as a sanity check every now and then. It also bails out if the revision tree is clearly dense during the file's edit life-time. So even as flat histories mutate into dense histories, the win will remain for most files.

Collectively, these new checks maintain the original win while keeping the overhead on dense trees to a small enough (IMO) amount: ~ 10%. For example, 'bzr log NEWS' on Bazaar's trunk goes from 2.7 to 3.0 seconds while 'bzr log bzrlib/builtins.py' goes from 2.4 to 2.6 seconds. That 0.2-0.3 second increase isn't noticeable in practice, while the reduction from 29 to 1 second obviously is.

Revision history for this message

Martin Pool (mbp) wrote on 2009-07-24:

Download full text (9.6 KiB)

I think you should test this on some files from MySQL or Launchpad,
which will have a longer but still very bushy history?

This looks fairly plausible to me, but I think there should be a unit
test for the fairly nontrivial function you've added. The approach here
of falling back to the old code means that any bugs here may not
actually be exercised by the test suite.

=== modified file 'NEWS'
--- NEWS 2009-06-16 09:05:34 +0000
+++ NEWS 2009-06-17 05:48:37 +0000
@@ -29,6 +29,10 @@
Improvements
************

+* ``bzr log FILE`` is now substantially faster on flat-ish histories.
+ On OpenOffice.org for example, logging a typical file now takes
+ a second or so instead of 29 seconds. (Ian Clatworthy)
+
* Resolving a revno to a revision id on a branch accessed via ``bzr://``
or ``bzr+ssh://`` is now much faster and involves no VFS operations.
This speeds up commands like ``bzr pull -r 123``. (Andrew Bennetts)

=== modified file 'bzrlib/builtins.py'
--- bzrlib/builtins.py 2009-06-15 06:47:14 +0000
+++ bzrlib/builtins.py 2009-06-16 12:46:50 +0000
@@ -2237,16 +2237,14 @@
             # the underlying repository format is faster at generating
             # deltas or can provide everything we need from the indices.
             # The default algorithm - match-using-deltas - works for
- # multiple files and directories and is faster for small
- # amounts of history (200 revisions say). However, it's too
+ # multiple files and directories. However, it's too
             # slow for logging a single file in a repository with deep
             # history, i.e. > 10K revisions. In the spirit of "do no
             # evil when adding features", we continue to use the
             # original algorithm - per-file-graph - for the "single
             # file that isn't a directory without showing a delta" case.
- partial_history = revision and b.repository._format.supports_chks
             match_using_deltas = (len(file_ids) != 1 or filter_by_dir
- or delta_type or partial_history)
+ or delta_type)

# Build the LogRequest and execute it
if len(file_ids) == 0:

def _log_revision_iterator_using_per_file_graph(self):
- # Get the base revisions, filtering by the revision range.
- # Note that we always generate the merge revisions because
- # filter_revisions_touching_file_id() requires them ...
rqst = self.rqst
- view_revisions = _calc_view_revisions(self.branch, self.start_rev_id,
- self.end_rev_id, rqst.get('direction'), True)
- if not isinstance(view_revisions, list):
- view_revisions = list(view_revisions)
- view_revisions = _filter_revisions_touching_file_id(self.branch,
- rqst.get('specific_fileids')[0], view_revisio...

I think you should test this on some files from MySQL or Launchpad,
which will have a longer but still very bushy history?

This looks fairly plausible to me, but I think there should be a unit
test for the fairly nontrivial function you've added.  The approach here
of falling back to the old code means that any bugs here may not
actually be exercised by the test suite.

=== modified file 'NEWS'
--- NEWS	2009-06-16 09:05:34 +0000
+++ NEWS	2009-06-17 05:48:37 +0000
@@ -29,6 +29,10 @@
 Improvements
 ************
 
+* ``bzr log FILE`` is now substantially faster on flat-ish histories.
+  On OpenOffice.org for example, logging a typical file now takes
+  a second or so instead of 29 seconds. (Ian Clatworthy)
+
 * Resolving a revno to a revision id on a branch accessed via ``bzr://``
   or ``bzr+ssh://`` is now much faster and involves no VFS operations.
   This speeds up commands like ``bzr pull -r 123``.  (Andrew Bennetts)

=== modified file 'bzrlib/builtins.py'
--- bzrlib/builtins.py	2009-06-15 06:47:14 +0000
+++ bzrlib/builtins.py	2009-06-16 12:46:50 +0000
@@ -2237,16 +2237,14 @@
             # the underlying repository format is faster at generating
             # deltas or can provide everything we need from the indices.
             # The default algorithm - match-using-deltas - works for
-            # multiple files and directories and is faster for small
-            # amounts of history (200 revisions say). However, it's too
+            # multiple files and directories. However, it's too
             # slow for logging a single file in a repository with deep
             # history, i.e. > 10K revisions. In the spirit of "do no
             # evil when adding features", we continue to use the
             # original algorithm - per-file-graph - for the "single
             # file that isn't a directory without showing a delta" case.
-            partial_history = revision and b.repository._format.supports_chks
             match_using_deltas = (len(file_ids) != 1 or filter_by_dir
-                or delta_type or partial_history)
+                or delta_type)
 
             # Build the LogRequest and execute it
             if len(file_ids) == 0:

=== modified file 'bzrlib/log.py'
--- bzrlib/log.py	2009-06-10 03:56:49 +0000
+++ bzrlib/log.py	2009-06-17 05:35:10 +0000
@@ -70,6 +70,7 @@
     diff,
     errors,
     foreign,
+    graph,
     repository as _mod_repository,
     revision as _mod_revision,
     revisionspec,
@@ -460,21 +461,150 @@
             direction=rqst.get('direction'))
 
     def _log_revision_iterator_using_per_file_graph(self):
-        # Get the base revisions, filtering by the revision range.
-        # Note that we always generate the merge revisions because
-        # filter_revisions_touching_file_id() requires them ...
         rqst = self.rqst
-        view_revisions = _calc_view_revisions(self.branch, self.start_rev_id,
-            self.end_rev_id, rqst.get('direction'), True)
-        if not isinstance(view_revisions, list):
-            view_revisions = list(view_revisions)
-        view_revisions = _filter_revisions_touching_file_id(self.branch,
-            rqst.get('specific_fileids')[0], view_revisions,
-            include_merges=rqst.get('levels') != 1)
+        direction = rqst.get('direction')
+        file_id = rqst.get('specific_fileids')[0]
+        multi_level = rqst.get('levels') != 1
+        try:
+            file_graph, graph_tip = _per_file_graph(self.branch, file_id,
+                self.end_rev_id)
+        except errors.NoSuchId:
+            # File doesn't exist at end of range - fall back to old algorithm
+            view_revisions = None
+        else:
+            # Try iterating over the revisions given by the per-file graph.
+            # This returns None if it fails.
+            view_revisions = _calc_view_revisions_for_file(self.branch,
+                file_graph, graph_tip, self.start_rev_id, self.end_rev_id,
+                direction, multi_level)
+
+        if view_revisions is None:
+            # Get the base revisions, filtering by the revision range.
+            # Note that we always generate the merge revisions because
+            # filter_revisions_touching_file_id() requires them ...
+            view_revisions = _calc_view_revisions(self.branch,
+                self.start_rev_id, self.end_rev_id, direction, True)
+            if not isinstance(view_revisions, list):
+                view_revisions = list(view_revisions)
+            # TODO: pass in the already calculated file graph and re-use it
+            view_revisions = _filter_revisions_touching_file_id(self.branch,
+                file_id, view_revisions, include_merges=multi_level)
         return make_log_rev_iterator(self.branch, view_revisions,
             rqst.get('delta_type'), rqst.get('message_search'))

+def _per_file_graph(branch, file_id, end_rev_id):
+    """Get the per file graph.
+
+    :param end_rev_id: the last interesting revision-id or None to use
+      the basis tree. If non-None, the file must exist in that revision
+      or NoSuchId will be raised.
+    :return: graph, tip where
+      graph is a Graph with (file_id,rev_id) tuple keys and
+      tip is the graph tip
+    """
+    # Find when the file was last modified
+    if end_rev_id is None:
+        rev_tree = branch.basis_tree()
+    else:
+        rev_tree = branch.repository.revision_tree(end_rev_id)
+    last_modified = rev_tree.inventory[file_id].revision
+
+    # Return the result
+    tip = (file_id, last_modified)
+    return graph.Graph(branch.repository.texts), tip

Should this be a branch method?

+
+
+def _calc_view_revisions_for_file(branch, file_graph, graph_tip, start_rev_id,
+    end_rev_id, direction, include_merges):
+    """Calculate the revisions to view for a file.
+
+    :param file_graph: the per-file graph
+    :param graph_tip: the tip of the per-file graph
+    :param include_merges: if True, include all revisions, not just the top
+      level
+    :return: An list of (revision_id, dotted_revno, merge_depth) tuples OR
+      None if the algorithm fails (and another one should be used).
+    """
+    br_revno, br_rev_id = branch.last_revision_info()
+    if br_revno == 0:
+        return []
+
+    # Find the revisions where the file was changed and merged
+    file_rev_ids = []
+    file_merges = []
+    for (_, rev_id), parents in file_graph.iter_ancestry([graph_tip]):
+        file_rev_ids.append(rev_id)
+        if len(parents) > 1:
+            file_merges.append(rev_id)
+
+    # Handle the simple cases
+    if len(file_rev_ids) == 1:
+        return _generate_one_revision(branch, file_rev_ids[0], br_rev_id,
+            br_revno)
+    elif len(file_rev_ids) == 0:
+        # Should this ever happen?
+        return []
+    elif file_merges and include_merges:
+        # Fall back to the old algorithm for now
+        return None
+
+    # Find all the revisions we can using a linear search
+    result = []
+    missing = set(file_rev_ids)
+    merges_to_search = 0
+    created_timestamp = None
+    try:
+        candidates = _linear_view_revisions(branch, start_rev_id, end_rev_id)
+        for index, (rev_id, revno, depth) in enumerate(candidates):
+            if rev_id in missing:
+                result.append((rev_id, revno, depth))
+                missing.remove(rev_id)
+                if len(missing) == 0:
+                    break
+            else:
+                if _has_merges(branch, rev_id):
+                    merges_to_search += 1
+                    # If this is a dense tree, this optimisation is unlikely
+                    # to result in a net win - fall back to old algorithm.
+                    if merges_to_search > 100:
+                        return None
+                # Check the timestamp to avoid going back too far on the
+                # mainline for files created in merge revisions. We don't
+                # do this every revision, just regularly, to minimise the
+                # number of revisions that we load at this point.
+                if index and index % 500 == 0:
+                    if created_timestamp is None:
+                        created_rev = branch.repository.get_revision(
+                            file_rev_ids[-1])
+                        created_timestamp = created_rev.timestamp
+                    rev = branch.repository.get_revision(rev_id)
+                    if created_timestamp > rev.timestamp:
+                        return None
+
+    except _StartNotLinearAncestor:
+        raise errors.BzrCommandError('Start revision not found in'
+            ' left-hand history of end revision.')

If this causes a bug report, it'll be easier to debug if you include
some of the relevant variables in the message.

+
+    # If no merges were found in the revision range, then we can be
+    # certain that we've found all the revisions we care about.
+    if missing and merges_to_search:
+        # TODO: search the deltas of the merges, splicing successful
+        # matches into their rightful spots. That should work well on
+        # chk repositories for typical histories but we need to benchmark
+        # it to confirm. There's most likely a sweet spot above which
+        # the O(history) traditional way - generating the full graph of
+        # history and post-filtering - remains the best performer.
+        trace.mutter("log file fastpath failed to find %d revisions" %
+            len(missing))
+        return None
+
+    # We came, we saw, we walked away victorious ...
+    if direction == 'forward':
+        result = reversed(result)
+    return result
+
+
 def _calc_view_revisions(branch, start_rev_id, end_rev_id, direction,
     generate_merge_revisions, delayed_graph_generation=False):
     """Calculate the revisions to view.

review: Needs Information

Revision history for this message

John A Meinel (jameinel) wrote on 2009-10-06:

What is the status of this submission? How was it impacted by my improvements to "bzr log DIR" ?

It seems a bit stale, so I'd like to either mark it as "Work in Progress" or "Rejected" or heck, even "Approved" so long as we get some motion here.

Revision history for this message

Parth Malwankar (parthm) wrote on 2010-05-04:

I just ran a basic benchmark for this branch on the emacs trunk. The performance 'bzr log FILE' takes ~11s for trunk and ~24s for this branch. Maybe something has changed as this branch is quite old or perhaps emacs development model is not flat-ish.
'bzr log' performance is about the same.
Maybe someone familiar with log can comment or mark this as rejected to keep wip queue size down?

[emacs-bzr]% time ~/src/bzr.dev/faster-log-file/bzr --no-plugins log > /dev/null
~/src/bzr.dev/faster-log-file/bzr --no-plugins log > /dev/null 34.97s user 1.76s system 97% cpu 37.860 total
[emacs-bzr]% time ~/src/bzr.dev/faster-log-file/bzr --no-plugins log > /dev/null
~/src/bzr.dev/faster-log-file/bzr --no-plugins log > /dev/null 35.11s user 1.66s system 99% cpu 36.997 total
[emacs-bzr]% time bzr --no-plugins log > /dev/null
bzr --no-plugins log > /dev/null 35.19s user 1.72s system 99% cpu 37.082 total
[emacs-bzr]% time bzr log autogen.sh > /dev/null
bzr log autogen.sh > /dev/null 11.26s user 0.25s system 96% cpu 11.884 total
[emacs-bzr]% time bzr log autogen.sh > /dev/null
bzr log autogen.sh > /dev/null 11.07s user 0.24s system 100% cpu 11.305 total
[emacs-bzr]% time ~/src/bzr.dev/faster-log-file/bzr --no-plugins log autogen.sh > /dev/null
~/src/bzr.dev/faster-log-file/bzr --no-plugins log autogen.sh > /dev/null 23.66s user 0.22s system 99% cpu 24.075 total
[emacs-bzr]% time ~/src/bzr.dev/faster-log-file/bzr --no-plugins log autogen.sh > /dev/null
~/src/bzr.dev/faster-log-file/bzr --no-plugins log autogen.sh > /dev/null 25.39s user 0.28s system 99% cpu 25.756 total
[emacs-bzr]%

Revision history for this message

John A Meinel (jameinel) wrote on 2010-05-04:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Parth Malwankar wrote:
> I just ran a basic benchmark for this branch on the emacs trunk. The performance 'bzr log FILE' takes ~11s for trunk and ~24s for this branch. Maybe something has changed as this branch is quite old or perhaps emacs development model is not flat-ish.
> 'bzr log' performance is about the same.
> Maybe someone familiar with log can comment or mark this as rejected to keep wip queue size down?
>
emacs used to be almost completely flat, and I think Ian's work was
trying to tune for that. However, they now have a few merge revisions,
and this may get tripped up. I seem to remember that it did the work
twice. Once trying to be 'fast' when there are no merges, and then fall
back to a slower path. However, if there are merges it was 2x slower
because it did the work twice.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkvgMXgACgkQJdeBCYSNAAMSTACgwePYUHGysV4xbUoFLLS5+BCl
LbUAoJAh1h7QGDg1Xv+GQEpIsUFbyv2w
=VRot
-----END PGP SIGNATURE-----

Revision history for this message

Robert Collins (lifeless) wrote on 2010-06-23:

So, is this really in progress, or halted() ? Perhaps MP's should have an 'idle' state for not-moving, not-rejected. Or something.

Unmerged revisions

4387. By Ian Clatworthy on 2009-06-17: add NEWS item
4386. By Ian Clatworthy on 2009-06-17: avoid looking back too far for files created in merge revisions
4385. By Ian Clatworthy on 2009-06-16: merge bzr.dev r4446
4384. By Ian Clatworthy on 2009-05-27: faster log file -n0 for flat file history
4383. By Ian Clatworthy on 2009-05-26: speed up log file on flat-ish histories
4382. By Canonical.com Patch Queue Manager <email address hidden> on 2009-05-26: (vila) Fix blatant performance regression for annotate in gc repos
4381. By Canonical.com Patch Queue Manager <email address hidden> on 2009-05-26: (Jelmer) Add registry for the 'bzr serve' protocol.
4380. By Canonical.com Patch Queue Manager <email address hidden> on 2009-05-25: (igc) two simple log dotted revno tests (Marius Kruger)
4379. By Canonical.com Patch Queue Manager <email address hidden> on 2009-05-23: (tanner) merge 1.15final back to trunk
4378. By Canonical.com Patch Queue Manager <email address hidden> on 2009-05-22: (igc) faster branch in a shared repo for dev6rr format (Ian
Clatworthy)

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Aaron Bentley

Denys Duchier

Eric Siegerman

Gary van der Merwe

Ian Clatworthy

Jelmer Vernooij

John Szakmeister

Jonathan Lange

Marius Kruger

Martin Albisetti

Matt Nordhoff

Paul Hummer

SuperMMX

Talden

Yoshinori Sano

to status/vote changes:

Alexander Belchenko

Martin Eisenhardt

Tim Penhey

Vincent Ladeuil

Bazaar

Merge lp:~ian-clatworthy/bzr/faster-log-file into lp:~bzr/bzr/trunk-old

Commit message

Description of the change

Unmerged revisions

Preview Diff

Subscribers

 === modified file 'NEWS'
 --- NEWS	2009-08-30 23:51:10 +0000
 +++ NEWS	2009-08-31 04:37:43 +0000
@@ -771,6 +771,7 @@
  Improvements
  ************
++<<<<<<< TREE
  * ``bzr annotate`` can now be significantly faster. The time for
    ``bzr annotate NEWS`` is down to 7s from 22s in 1.16. Files with long
    histories and lots of 'duplicate insertions' will be improved more than
@@ -786,6 +787,12 @@
  * Initial commit performance in ``--2a`` repositories has been improved by
    making it cheaper to build the initial CHKMap. (John Arbash Meinel)
++=======
++* ``bzr log FILE`` is now substantially faster on flat-ish histories.
++  On OpenOffice.org for example, logging a typical file now takes
++  a second or so instead of 29 seconds. (Ian Clatworthy)
++
++>>>>>>> MERGE-SOURCE
  * Resolving a revno to a revision id on a branch accessed via ``bzr://``
    or ``bzr+ssh://`` is now much faster and involves no VFS operations.
    This speeds up commands like ``bzr pull -r 123``.  (Andrew Bennetts)
 === modified file 'bzrlib/builtins.py'
 --- bzrlib/builtins.py	2009-08-28 05:00:33 +0000
 +++ bzrlib/builtins.py	2009-08-31 04:37:44 +0000
@@ -2332,16 +2332,14 @@
              # the underlying repository format is faster at generating
              # deltas or can provide everything we need from the indices.
              # The default algorithm - match-using-deltas - works for
--            # multiple files and directories and is faster for small
--            # amounts of history (200 revisions say). However, it's too
++            # multiple files and directories. However, it's too
              # slow for logging a single file in a repository with deep
              # history, i.e. > 10K revisions. In the spirit of "do no
              # evil when adding features", we continue to use the
              # original algorithm - per-file-graph - for the "single
              # file that isn't a directory without showing a delta" case.
--            partial_history = revision and b.repository._format.supports_chks
              match_using_deltas = (len(file_ids) != 1 or filter_by_dir
--                or delta_type or partial_history)
++                or delta_type)
              # Build the LogRequest and execute it
              if len(file_ids) == 0:
 === modified file 'bzrlib/log.py'
 --- bzrlib/log.py	2009-06-10 03:56:49 +0000
 +++ bzrlib/log.py	2009-08-31 04:37:44 +0000
@@ -70,6 +70,7 @@
      diff,
      errors,
      foreign,
++    graph,
      repository as _mod_repository,
      revision as _mod_revision,
      revisionspec,
@@ -460,21 +461,150 @@
              direction=rqst.get('direction'))
      def _log_revision_iterator_using_per_file_graph(self):
--        # Get the base revisions, filtering by the revision range.
--        # Note that we always generate the merge revisions because
--        # filter_revisions_touching_file_id() requires them ...
          rqst = self.rqst
--        view_revisions = _calc_view_revisions(self.branch, self.start_rev_id,
--            self.end_rev_id, rqst.get('direction'), True)
--        if not isinstance(view_revisions, list):
--            view_revisions = list(view_revisions)
--        view_revisions = _filter_revisions_touching_file_id(self.branch,
--            rqst.get('specific_fileids')[0], view_revisions,
--            include_merges=rqst.get('levels') != 1)
++        direction = rqst.get('direction')
++        file_id = rqst.get('specific_fileids')[0]
++        multi_level = rqst.get('levels') != 1
++        try:
++            file_graph, graph_tip = _per_file_graph(self.branch, file_id,
++                self.end_rev_id)
++        except errors.NoSuchId:
++            # File doesn't exist at end of range - fall back to old algorithm
++            view_revisions = None
++        else:
++            # Try iterating over the revisions given by the per-file graph.
++            # This returns None if it fails.
++            view_revisions = _calc_view_revisions_for_file(self.branch,
++                file_graph, graph_tip, self.start_rev_id, self.end_rev_id,
++                direction, multi_level)
++
++        if view_revisions is None:
++            # Get the base revisions, filtering by the revision range.
++            # Note that we always generate the merge revisions because
++            # filter_revisions_touching_file_id() requires them ...
++            view_revisions = _calc_view_revisions(self.branch,
++                self.start_rev_id, self.end_rev_id, direction, True)
++            if not isinstance(view_revisions, list):
++                view_revisions = list(view_revisions)
++            # TODO: pass in the already calculated file graph and re-use it
++            view_revisions = _filter_revisions_touching_file_id(self.branch,
++                file_id, view_revisions, include_merges=multi_level)
          return make_log_rev_iterator(self.branch, view_revisions,
              rqst.get('delta_type'), rqst.get('message_search'))
++def _per_file_graph(branch, file_id, end_rev_id):
++    """Get the per file graph.
++
++    :param end_rev_id: the last interesting revision-id or None to use
++      the basis tree. If non-None, the file must exist in that revision
++      or NoSuchId will be raised.
++    :return: graph, tip where
++      graph is a Graph with (file_id,rev_id) tuple keys and
++      tip is the graph tip
++    """
++    # Find when the file was last modified
++    if end_rev_id is None:
++        rev_tree = branch.basis_tree()
++    else:
++        rev_tree = branch.repository.revision_tree(end_rev_id)
++    last_modified = rev_tree.inventory[file_id].revision
++
++    # Return the result
++    tip = (file_id, last_modified)
++    return graph.Graph(branch.repository.texts), tip
++
++
++def _calc_view_revisions_for_file(branch, file_graph, graph_tip, start_rev_id,
++    end_rev_id, direction, include_merges):
++    """Calculate the revisions to view for a file.
++
++    :param file_graph: the per-file graph
++    :param graph_tip: the tip of the per-file graph
++    :param include_merges: if True, include all revisions, not just the top
++      level
++    :return: An list of (revision_id, dotted_revno, merge_depth) tuples OR
++      None if the algorithm fails (and another one should be used).
++    """
++    br_revno, br_rev_id = branch.last_revision_info()
++    if br_revno == 0:
++        return []
++
++    # Find the revisions where the file was changed and merged
++    file_rev_ids = []
++    file_merges = []
++    for (_, rev_id), parents in file_graph.iter_ancestry([graph_tip]):
++        file_rev_ids.append(rev_id)
++        if len(parents) > 1:
++            file_merges.append(rev_id)
++
++    # Handle the simple cases
++    if len(file_rev_ids) == 1:
++        return _generate_one_revision(branch, file_rev_ids[0], br_rev_id,
++            br_revno)
++    elif len(file_rev_ids) == 0:
++        # Should this ever happen?
++        return []
++    elif file_merges and include_merges:
++        # Fall back to the old algorithm for now
++        return None
++
++    # Find all the revisions we can using a linear search
++    result = []
++    missing = set(file_rev_ids)
++    merges_to_search = 0
++    created_timestamp = None
++    try:
++        candidates = _linear_view_revisions(branch, start_rev_id, end_rev_id)
++        for index, (rev_id, revno, depth) in enumerate(candidates):
++            if rev_id in missing:
++                result.append((rev_id, revno, depth))
++                missing.remove(rev_id)
++                if len(missing) == 0:
++                    break
++            else:
++                if _has_merges(branch, rev_id):
++                    merges_to_search += 1
++                    # If this is a dense tree, this optimisation is unlikely
++                    # to result in a net win - fall back to old algorithm.
++                    if merges_to_search > 100:
++                        return None
++                # Check the timestamp to avoid going back too far on the
++                # mainline for files created in merge revisions. We don't
++                # do this every revision, just regularly, to minimise the
++                # number of revisions that we load at this point.
++                if index and index % 500 == 0:
++                    if created_timestamp is None:
++                        created_rev = branch.repository.get_revision(
++                            file_rev_ids[-1])
++                        created_timestamp = created_rev.timestamp
++                    rev = branch.repository.get_revision(rev_id)
++                    if created_timestamp > rev.timestamp:
++                        return None
++
++    except _StartNotLinearAncestor:
++        raise errors.BzrCommandError('Start revision not found in'
++            ' left-hand history of end revision.')
++
++    # If no merges were found in the revision range, then we can be
++    # certain that we've found all the revisions we care about.
++    if missing and merges_to_search:
++        # TODO: search the deltas of the merges, splicing successful
++        # matches into their rightful spots. That should work well on
++        # chk repositories for typical histories but we need to benchmark
++        # it to confirm. There's most likely a sweet spot above which
++        # the O(history) traditional way - generating the full graph of
++        # history and post-filtering - remains the best performer.
++        trace.mutter("log file fastpath failed to find %d revisions" %
++            len(missing))
++        return None
++
++    # We came, we saw, we walked away victorious ...
++    if direction == 'forward':
++        result = reversed(result)
++    return result
++
++
  def _calc_view_revisions(branch, start_rev_id, end_rev_id, direction,
      generate_merge_revisions, delayed_graph_generation=False):
      """Calculate the revisions to view.