Bazaar

Merge lp:~ian-clatworthy/bzr/faster-log-file into lp:~bzr/bzr/trunk-old

faster-log-file
Merge into trunk-old

Proposed by Ian Clatworthy on 2009-05-27

Status:	Superseded
Proposed branch:	lp:~ian-clatworthy/bzr/faster-log-file
Merge into:	lp:~bzr/bzr/trunk-old
Diff against target:	180 lines (has conflicts) Text conflict in bzrlib/log.py
To merge this branch:	bzr merge lp:~ian-clatworthy/bzr/faster-log-file
Related bugs:	Link a bug report

Reviewer	Review Type	Date Requested	Status
John A Meinel		2009-05-27	Needs Information on 2009-06-01
Review via email: mp+6805@code.launchpad.net

This proposal has been superseded by a proposal from 2009-06-17.

Revision history for this message

Ian Clatworthy (ian-clatworthy) wrote on 2009-05-27:

This patch speeds up 'bzr log FILE' on flat-ish histories, as commonly found after an import from svn, cvs and other central VCS tools. On OOo, it drops the time taken down from 29 seconds to 1.5 seconds for logging a typical file.

The key to this improvement is starting with the per-file graph and searching the mainline until the revisions of interest are found. That works very well when the history of a project is flat or mostly flat, because it avoids the 27 seconds required to calculate the full revision graph. In a nutshell, the algorithm changes from O(full-history) to O(file-life).

There's certainly room for further smarts here but this is a useful step forward as it stands I feel.

Revision history for this message

John A Meinel (jameinel) wrote on 2009-06-01:

So I see that you avoided "incorrect" results on non-linear ancestries by including checks for this case. However

1) I'm not sure that the checks are complete. For example, it doesn't matter whether the per-file graph has merges or not, as to how the 'include-merges' flag should be handled. Consider the case:

  :
  A
  |\
  | B # Mod foo
  |/
  C # Merge B's changes

In that case we want to see both revisions B and C in the "bzr log foo" output. Even though the per-file graph in this case looks simply like:

:
B # Mod foo

2) I'm a bit concerned that we do all of this work with _linear_view_revisions which in the common case for OOo will have to walk the *entire* history (assuming 'bzr log foo' with no -r specified), which we then throw away.

At least, I'm expecting that once a project like OOo changes to a DVCS, they will actually start including merges. Which means that they'll still have 200k revisions in the mainline, but then *also* have all sorts of merge revisions after that 200k...

I guess, I'm mostly worried that while this makes some bits much faster for your OOo testing, it will actually cause regressions in a lot of other cases.

Consider 'bzr log bzrlib/builtins.py', how much time will be spent in this code, just to have it end up deciding to return None?

review: Needs Information

lp:~ian-clatworthy/bzr/faster-log-file updated on 2009-06-17

4385. By Ian Clatworthy on 2009-06-16: merge bzr.dev r4446
4386. By Ian Clatworthy on 2009-06-17: avoid looking back too far for files created in merge revisions

Revision history for this message

Ian Clatworthy (ian-clatworthy) wrote on 2009-06-17:

> 1) I'm not sure that the checks are complete. For example, it doesn't matter
> whether the per-file graph has merges or not, as to how the 'include-merges'
> flag should be handled. Consider the case:
>
> :
> A
> |\
> | B # Mod foo
> |/
> C # Merge B's changes
>
> In that case we want to see both revisions B and C in the "bzr log foo"
> output. Even though the per-file graph in this case looks simply like:
>
> :
> B # Mod foo

I think the code handles this ok. If log --include-merges (or -n0) is given, it uses the old algorithm immediately. Otherwise, it will just show B.

> 2) I'm a bit concerned that we do all of this work with _linear_view_revisions
> which in the common case for OOo will have to walk the *entire* history
> (assuming 'bzr log foo' with no -r specified), which we then throw away.
>
> At least, I'm expecting that once a project like OOo changes to a DVCS, they
> will actually start including merges. Which means that they'll still have 200k
> revisions in the mainline, but then *also* have all sorts of merge revisions
> after that 200k...
>
> I guess, I'm mostly worried that while this makes some bits much faster for
> your OOo testing, it will actually cause regressions in a lot of other cases.
>
> Consider 'bzr log bzrlib/builtins.py', how much time will be spent in this
> code, just to have it end up deciding to return None?

Good points. I ran the benchmark you suggested and it did indeed indicate a problem. I'll push an updated patch.

lp:~ian-clatworthy/bzr/faster-log-file updated on 2009-06-17

4387. By Ian Clatworthy on 2009-06-17: add NEWS item

Unmerged revisions

4387. By Ian Clatworthy on 2009-06-17: add NEWS item
4386. By Ian Clatworthy on 2009-06-17: avoid looking back too far for files created in merge revisions
4385. By Ian Clatworthy on 2009-06-16: merge bzr.dev r4446
4384. By Ian Clatworthy on 2009-05-27: faster log file -n0 for flat file history
4383. By Ian Clatworthy on 2009-05-26: speed up log file on flat-ish histories
4382. By Canonical.com Patch Queue Manager <email address hidden> on 2009-05-26: (vila) Fix blatant performance regression for annotate in gc repos
4381. By Canonical.com Patch Queue Manager <email address hidden> on 2009-05-26: (Jelmer) Add registry for the 'bzr serve' protocol.
4380. By Canonical.com Patch Queue Manager <email address hidden> on 2009-05-25: (igc) two simple log dotted revno tests (Marius Kruger)
4379. By Canonical.com Patch Queue Manager <email address hidden> on 2009-05-23: (tanner) merge 1.15final back to trunk
4378. By Canonical.com Patch Queue Manager <email address hidden> on 2009-05-22: (igc) faster branch in a shared repo for dev6rr format (Ian
Clatworthy)

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Aaron Bentley

Denys Duchier

Eric Siegerman

Gary van der Merwe

Ian Clatworthy

Jelmer Vernooij

John Szakmeister

Jonathan Lange

Marius Kruger

Martin Albisetti

Matt Nordhoff

Paul Hummer

SuperMMX

Talden

Yoshinori Sano

to status/vote changes:

Alexander Belchenko

Martin Eisenhardt

Tim Penhey

Vincent Ladeuil

 === modified file 'bzrlib/builtins.py'
 --- bzrlib/builtins.py	2009-06-11 06:54:33 +0000
 +++ bzrlib/builtins.py	2009-06-16 02:37:12 +0000
@@ -2230,16 +2230,14 @@
              # the underlying repository format is faster at generating
              # deltas or can provide everything we need from the indices.
              # The default algorithm - match-using-deltas - works for
--            # multiple files and directories and is faster for small
--            # amounts of history (200 revisions say). However, it's too
++            # multiple files and directories. However, it's too
              # slow for logging a single file in a repository with deep
              # history, i.e. > 10K revisions. In the spirit of "do no
              # evil when adding features", we continue to use the
              # original algorithm - per-file-graph - for the "single
              # file that isn't a directory without showing a delta" case.
--            partial_history = revision and b.repository._format.supports_chks
              match_using_deltas = (len(file_ids) != 1 or filter_by_dir
--                or delta_type or partial_history)
++                or delta_type)
              # Build the LogRequest and execute it
              if len(file_ids) == 0:
 === modified file 'bzrlib/log.py'
 --- bzrlib/log.py	2009-06-10 03:56:49 +0000
 +++ bzrlib/log.py	2009-06-16 02:37:12 +0000
@@ -69,7 +69,11 @@
      config,
      diff,
      errors,
++<<<<<<< TREE
      foreign,
++=======
++    graph,
++>>>>>>> MERGE-SOURCE
      repository as _mod_repository,
      revision as _mod_revision,
      revisionspec,
@@ -460,21 +464,131 @@
              direction=rqst.get('direction'))
      def _log_revision_iterator_using_per_file_graph(self):
--        # Get the base revisions, filtering by the revision range.
--        # Note that we always generate the merge revisions because
--        # filter_revisions_touching_file_id() requires them ...
          rqst = self.rqst
--        view_revisions = _calc_view_revisions(self.branch, self.start_rev_id,
--            self.end_rev_id, rqst.get('direction'), True)
--        if not isinstance(view_revisions, list):
--            view_revisions = list(view_revisions)
--        view_revisions = _filter_revisions_touching_file_id(self.branch,
--            rqst.get('specific_fileids')[0], view_revisions,
--            include_merges=rqst.get('levels') != 1)
++        direction = rqst.get('direction')
++        file_id = rqst.get('specific_fileids')[0]
++        multi_level = rqst.get('levels') != 1
++        try:
++            file_graph, graph_tip = _per_file_graph(self.branch, file_id,
++                self.end_rev_id)
++        except errors.NoSuchId:
++            # File doesn't exist at end of range - fall back to old algorithm
++            view_revisions = None
++        else:
++            # Try iterating over the revisions given by the per-file graph.
++            # This returns None if it fails.
++            view_revisions = _calc_view_revisions_for_file(self.branch,
++                file_graph, graph_tip, self.start_rev_id, self.end_rev_id,
++                direction, multi_level)
++
++        if view_revisions is None:
++            # Get the base revisions, filtering by the revision range.
++            # Note that we always generate the merge revisions because
++            # filter_revisions_touching_file_id() requires them ...
++            view_revisions = _calc_view_revisions(self.branch,
++                self.start_rev_id, self.end_rev_id, direction, True)
++            if not isinstance(view_revisions, list):
++                view_revisions = list(view_revisions)
++            # TODO: pass in the already calculated file graph and re-use it
++            view_revisions = _filter_revisions_touching_file_id(self.branch,
++                file_id, view_revisions, include_merges=multi_level)
          return make_log_rev_iterator(self.branch, view_revisions,
              rqst.get('delta_type'), rqst.get('message_search'))
++def _per_file_graph(branch, file_id, end_rev_id):
++    """Get the per file graph.
++
++    :param end_rev_id: the last interesting revision-id or None to use
++      the basis tree. If non-None, the file must exist in that revision
++      or NoSuchId will be raised.
++    :return: graph, tip where
++      graph is a Graph with (file_id,rev_id) tuple keys and
++      tip is the graph tip
++    """
++    # Find when the file was last modified
++    if end_rev_id is None:
++        rev_tree = branch.basis_tree()
++    else:
++        rev_tree = branch.repository.revision_tree(end_rev_id)
++    last_modified = rev_tree.inventory[file_id].revision
++
++    # Return the result
++    tip = (file_id, last_modified)
++    return graph.Graph(branch.repository.texts), tip
++
++
++def _calc_view_revisions_for_file(branch, file_graph, graph_tip, start_rev_id,
++    end_rev_id, direction, include_merges):
++    """Calculate the revisions to view for a file.
++
++    :param file_graph: the per-file graph
++    :param graph_tip: the tip of the per-file graph
++    :param include_merges: if True, include all revisions, not just the top
++      level
++    :return: An list of (revision_id, dotted_revno, merge_depth) tuples OR
++      None if the algorithm fails (and another one should be used).
++    """
++    br_revno, br_rev_id = branch.last_revision_info()
++    if br_revno == 0:
++        return []
++
++    # Find when the file was changed and merged
++    file_rev_ids = []
++    file_merges = []
++    for (_, rev_id), parents in file_graph.iter_ancestry([graph_tip]):
++        file_rev_ids.append(rev_id)
++        if len(parents) > 1:
++            file_merges.append(rev_id)
++
++    # Handle the simple cases
++    if len(file_rev_ids) == 1:
++        return _generate_one_revision(branch, file_rev_ids[0], br_rev_id,
++            br_revno)
++    elif len(file_rev_ids) == 0:
++        # Should this ever happen?
++        return []
++    elif file_merges and include_merges:
++        # Fall back to the old algorithm for now
++        return None
++
++    # Find all the revisions we can using a linear search
++    result = []
++    missing = set(file_rev_ids)
++    merges_to_search = 0
++    try:
++        candidates = _linear_view_revisions(branch, start_rev_id, end_rev_id)
++        for rev_id, revno, depth in candidates:
++            if rev_id in missing:
++                result.append((rev_id, revno, depth))
++                missing.remove(rev_id)
++                if len(missing) == 0:
++                    break
++            if _has_merges(branch, rev_id):
++                merges_to_search += 1
++    except _StartNotLinearAncestor:
++        raise errors.BzrCommandError('Start revision not found in'
++            ' left-hand history of end revision.')
++
++    # If no merges were found in the revision range, then we can be
++    # certain that we've found all the revisions we care about.
++    if missing and merges_to_search:
++        # TODO: search the deltas of the merges, splicing successful
++        # matches into their rightful spots. That should work well on
++        # chk repositories for typical histories but we need to benchmark
++        # it to confirm. There's most likely a sweet spot above which
++        # the O(history) traditional way - generating the full graph of
++        # history and post-filtering - remains the best performer.
++        trace.mutter("log file fastpath failed to find %d revisions" %
++            len(missing))
++        return None
++
++    # We came, we saw, we walked away victorious ...
++    if direction == 'forward':
++        result = reversed(result)
++    return result
++
++
  def _calc_view_revisions(branch, start_rev_id, end_rev_id, direction,
      generate_merge_revisions, delayed_graph_generation=False):
      """Calculate the revisions to view.

Bazaar

Merge lp:~ian-clatworthy/bzr/faster-log-file into lp:~bzr/bzr/trunk-old

Commit message

Description of the change

Unmerged revisions

Preview Diff

Subscribers