Merge lp:~ian-clatworthy/bzr/faster-log-file into lp:~bzr/bzr/trunk-old

Proposed by Ian Clatworthy
Status: Work in progress
Proposed branch: lp:~ian-clatworthy/bzr/faster-log-file
Merge into: lp:~bzr/bzr/trunk-old
Diff against target: 220 lines (has conflicts)
Text conflict in NEWS
To merge this branch: bzr merge lp:~ian-clatworthy/bzr/faster-log-file
Reviewer Review Type Date Requested Status
Martin Pool Needs Information
Review via email: mp+7535@code.launchpad.net

This proposal supersedes a proposal from 2009-05-27.

To post a comment you must log in.
Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote : Posted in a previous version of this proposal

This patch speeds up 'bzr log FILE' on flat-ish histories, as commonly found after an import from svn, cvs and other central VCS tools. On OOo, it drops the time taken down from 29 seconds to 1.5 seconds for logging a typical file.

The key to this improvement is starting with the per-file graph and searching the mainline until the revisions of interest are found. That works very well when the history of a project is flat or mostly flat, because it avoids the 27 seconds required to calculate the full revision graph. In a nutshell, the algorithm changes from O(full-history) to O(file-life).

There's certainly room for further smarts here but this is a useful step forward as it stands I feel.

Revision history for this message
John A Meinel (jameinel) wrote : Posted in a previous version of this proposal

So I see that you avoided "incorrect" results on non-linear ancestries by including checks for this case. However

1) I'm not sure that the checks are complete. For example, it doesn't matter whether the per-file graph has merges or not, as to how the 'include-merges' flag should be handled. Consider the case:

  :
  A
  |\
  | B # Mod foo
  |/
  C # Merge B's changes

In that case we want to see both revisions B and C in the "bzr log foo" output. Even though the per-file graph in this case looks simply like:

 :
 B # Mod foo

2) I'm a bit concerned that we do all of this work with _linear_view_revisions which in the common case for OOo will have to walk the *entire* history (assuming 'bzr log foo' with no -r specified), which we then throw away.

At least, I'm expecting that once a project like OOo changes to a DVCS, they will actually start including merges. Which means that they'll still have 200k revisions in the mainline, but then *also* have all sorts of merge revisions after that 200k...

I guess, I'm mostly worried that while this makes some bits much faster for your OOo testing, it will actually cause regressions in a lot of other cases.

Consider 'bzr log bzrlib/builtins.py', how much time will be spent in this code, just to have it end up deciding to return None?

review: Needs Information
Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote : Posted in a previous version of this proposal

> 1) I'm not sure that the checks are complete. For example, it doesn't matter
> whether the per-file graph has merges or not, as to how the 'include-merges'
> flag should be handled. Consider the case:
>
> :
> A
> |\
> | B # Mod foo
> |/
> C # Merge B's changes
>
> In that case we want to see both revisions B and C in the "bzr log foo"
> output. Even though the per-file graph in this case looks simply like:
>
> :
> B # Mod foo

I think the code handles this ok. If log --include-merges (or -n0) is given, it uses the old algorithm immediately. Otherwise, it will just show B.

> 2) I'm a bit concerned that we do all of this work with _linear_view_revisions
> which in the common case for OOo will have to walk the *entire* history
> (assuming 'bzr log foo' with no -r specified), which we then throw away.
>
> At least, I'm expecting that once a project like OOo changes to a DVCS, they
> will actually start including merges. Which means that they'll still have 200k
> revisions in the mainline, but then *also* have all sorts of merge revisions
> after that 200k...
>
> I guess, I'm mostly worried that while this makes some bits much faster for
> your OOo testing, it will actually cause regressions in a lot of other cases.
>
> Consider 'bzr log bzrlib/builtins.py', how much time will be spent in this
> code, just to have it end up deciding to return None?

Good points. I ran the benchmark you suggested and it did indeed indicate a problem. I'll push an updated patch.

Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote :

This patch speeds up log FILE on flat-ish histories: 29 secs => 1 sec for OOo. As noted in the previous merge proposal, it does this by starting from the per-file graph and looking for the revisions of interest along the mainline, instead of always assuming that the full revision graph is required. This win is achieved for files where the history is flat or mostly flat during the time between last edit and creation.

In response to John's feedback on the earlier proposal, this version now has some additional checks to ensure that we don't walk the whole mainline only to come up short. In particular, it uses revision timestamps as a sanity check every now and then. It also bails out if the revision tree is clearly dense during the file's edit life-time. So even as flat histories mutate into dense histories, the win will remain for most files.

Collectively, these new checks maintain the original win while keeping the overhead on dense trees to a small enough (IMO) amount: ~ 10%. For example, 'bzr log NEWS' on Bazaar's trunk goes from 2.7 to 3.0 seconds while 'bzr log bzrlib/builtins.py' goes from 2.4 to 2.6 seconds. That 0.2-0.3 second increase isn't noticeable in practice, while the reduction from 29 to 1 second obviously is.

Revision history for this message
Martin Pool (mbp) wrote :
Download full text (9.6 KiB)

I think you should test this on some files from MySQL or Launchpad,
which will have a longer but still very bushy history?

This looks fairly plausible to me, but I think there should be a unit
test for the fairly nontrivial function you've added. The approach here
of falling back to the old code means that any bugs here may not
actually be exercised by the test suite.

=== modified file 'NEWS'
--- NEWS 2009-06-16 09:05:34 +0000
+++ NEWS 2009-06-17 05:48:37 +0000
@@ -29,6 +29,10 @@
 Improvements
 ************

+* ``bzr log FILE`` is now substantially faster on flat-ish histories.
+ On OpenOffice.org for example, logging a typical file now takes
+ a second or so instead of 29 seconds. (Ian Clatworthy)
+
 * Resolving a revno to a revision id on a branch accessed via ``bzr://``
   or ``bzr+ssh://`` is now much faster and involves no VFS operations.
   This speeds up commands like ``bzr pull -r 123``. (Andrew Bennetts)

=== modified file 'bzrlib/builtins.py'
--- bzrlib/builtins.py 2009-06-15 06:47:14 +0000
+++ bzrlib/builtins.py 2009-06-16 12:46:50 +0000
@@ -2237,16 +2237,14 @@
             # the underlying repository format is faster at generating
             # deltas or can provide everything we need from the indices.
             # The default algorithm - match-using-deltas - works for
- # multiple files and directories and is faster for small
- # amounts of history (200 revisions say). However, it's too
+ # multiple files and directories. However, it's too
             # slow for logging a single file in a repository with deep
             # history, i.e. > 10K revisions. In the spirit of "do no
             # evil when adding features", we continue to use the
             # original algorithm - per-file-graph - for the "single
             # file that isn't a directory without showing a delta" case.
- partial_history = revision and b.repository._format.supports_chks
             match_using_deltas = (len(file_ids) != 1 or filter_by_dir
- or delta_type or partial_history)
+ or delta_type)

             # Build the LogRequest and execute it
             if len(file_ids) == 0:

=== modified file 'bzrlib/log.py'
--- bzrlib/log.py 2009-06-10 03:56:49 +0000
+++ bzrlib/log.py 2009-06-17 05:35:10 +0000
@@ -70,6 +70,7 @@
     diff,
     errors,
     foreign,
+ graph,
     repository as _mod_repository,
     revision as _mod_revision,
     revisionspec,
@@ -460,21 +461,150 @@
             direction=rqst.get('direction'))

     def _log_revision_iterator_using_per_file_graph(self):
- # Get the base revisions, filtering by the revision range.
- # Note that we always generate the merge revisions because
- # filter_revisions_touching_file_id() requires them ...
         rqst = self.rqst
- view_revisions = _calc_view_revisions(self.branch, self.start_rev_id,
- self.end_rev_id, rqst.get('direction'), True)
- if not isinstance(view_revisions, list):
- view_revisions = list(view_revisions)
- view_revisions = _filter_revisions_touching_file_id(self.branch,
- rqst.get('specific_fileids')[0], view_revisio...

Read more...

review: Needs Information
Revision history for this message
John A Meinel (jameinel) wrote :

What is the status of this submission? How was it impacted by my improvements to "bzr log DIR" ?

It seems a bit stale, so I'd like to either mark it as "Work in Progress" or "Rejected" or heck, even "Approved" so long as we get some motion here.

Revision history for this message
Parth Malwankar (parthm) wrote :

I just ran a basic benchmark for this branch on the emacs trunk. The performance 'bzr log FILE' takes ~11s for trunk and ~24s for this branch. Maybe something has changed as this branch is quite old or perhaps emacs development model is not flat-ish.
'bzr log' performance is about the same.
Maybe someone familiar with log can comment or mark this as rejected to keep wip queue size down?

[emacs-bzr]% time ~/src/bzr.dev/faster-log-file/bzr --no-plugins log > /dev/null
~/src/bzr.dev/faster-log-file/bzr --no-plugins log > /dev/null 34.97s user 1.76s system 97% cpu 37.860 total
[emacs-bzr]% time ~/src/bzr.dev/faster-log-file/bzr --no-plugins log > /dev/null
~/src/bzr.dev/faster-log-file/bzr --no-plugins log > /dev/null 35.11s user 1.66s system 99% cpu 36.997 total
[emacs-bzr]% time bzr --no-plugins log > /dev/null
bzr --no-plugins log > /dev/null 35.19s user 1.72s system 99% cpu 37.082 total
[emacs-bzr]% time bzr log autogen.sh > /dev/null
bzr log autogen.sh > /dev/null 11.26s user 0.25s system 96% cpu 11.884 total
[emacs-bzr]% time bzr log autogen.sh > /dev/null
bzr log autogen.sh > /dev/null 11.07s user 0.24s system 100% cpu 11.305 total
[emacs-bzr]% time ~/src/bzr.dev/faster-log-file/bzr --no-plugins log autogen.sh > /dev/null
~/src/bzr.dev/faster-log-file/bzr --no-plugins log autogen.sh > /dev/null 23.66s user 0.22s system 99% cpu 24.075 total
[emacs-bzr]% time ~/src/bzr.dev/faster-log-file/bzr --no-plugins log autogen.sh > /dev/null
~/src/bzr.dev/faster-log-file/bzr --no-plugins log autogen.sh > /dev/null 25.39s user 0.28s system 99% cpu 25.756 total
[emacs-bzr]%

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Parth Malwankar wrote:
> I just ran a basic benchmark for this branch on the emacs trunk. The performance 'bzr log FILE' takes ~11s for trunk and ~24s for this branch. Maybe something has changed as this branch is quite old or perhaps emacs development model is not flat-ish.
> 'bzr log' performance is about the same.
> Maybe someone familiar with log can comment or mark this as rejected to keep wip queue size down?
>
emacs used to be almost completely flat, and I think Ian's work was
trying to tune for that. However, they now have a few merge revisions,
and this may get tripped up. I seem to remember that it did the work
twice. Once trying to be 'fast' when there are no merges, and then fall
back to a slower path. However, if there are merges it was 2x slower
because it did the work twice.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkvgMXgACgkQJdeBCYSNAAMSTACgwePYUHGysV4xbUoFLLS5+BCl
LbUAoJAh1h7QGDg1Xv+GQEpIsUFbyv2w
=VRot
-----END PGP SIGNATURE-----

Revision history for this message
Robert Collins (lifeless) wrote :

So, is this really in progress, or halted() ? Perhaps MP's should have an 'idle' state for not-moving, not-rejected. Or something.

Unmerged revisions

4387. By Ian Clatworthy

add NEWS item

4386. By Ian Clatworthy

avoid looking back too far for files created in merge revisions

4385. By Ian Clatworthy

merge bzr.dev r4446

4384. By Ian Clatworthy

faster log file -n0 for flat file history

4383. By Ian Clatworthy

speed up log file on flat-ish histories

4382. By Canonical.com Patch Queue Manager <email address hidden>

(vila) Fix blatant performance regression for annotate in gc repos

4381. By Canonical.com Patch Queue Manager <email address hidden>

(Jelmer) Add registry for the 'bzr serve' protocol.

4380. By Canonical.com Patch Queue Manager <email address hidden>

(igc) two simple log dotted revno tests (Marius Kruger)

4379. By Canonical.com Patch Queue Manager <email address hidden>

(tanner) merge 1.15final back to trunk

4378. By Canonical.com Patch Queue Manager <email address hidden>

(igc) faster branch in a shared repo for dev6rr format (Ian
 Clatworthy)

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'NEWS'
--- NEWS 2009-08-30 23:51:10 +0000
+++ NEWS 2009-08-31 04:37:43 +0000
@@ -771,6 +771,7 @@
771Improvements771Improvements
772************772************
773773
774<<<<<<< TREE
774* ``bzr annotate`` can now be significantly faster. The time for775* ``bzr annotate`` can now be significantly faster. The time for
775 ``bzr annotate NEWS`` is down to 7s from 22s in 1.16. Files with long776 ``bzr annotate NEWS`` is down to 7s from 22s in 1.16. Files with long
776 histories and lots of 'duplicate insertions' will be improved more than777 histories and lots of 'duplicate insertions' will be improved more than
@@ -786,6 +787,12 @@
786* Initial commit performance in ``--2a`` repositories has been improved by787* Initial commit performance in ``--2a`` repositories has been improved by
787 making it cheaper to build the initial CHKMap. (John Arbash Meinel)788 making it cheaper to build the initial CHKMap. (John Arbash Meinel)
788789
790=======
791* ``bzr log FILE`` is now substantially faster on flat-ish histories.
792 On OpenOffice.org for example, logging a typical file now takes
793 a second or so instead of 29 seconds. (Ian Clatworthy)
794
795>>>>>>> MERGE-SOURCE
789* Resolving a revno to a revision id on a branch accessed via ``bzr://``796* Resolving a revno to a revision id on a branch accessed via ``bzr://``
790 or ``bzr+ssh://`` is now much faster and involves no VFS operations.797 or ``bzr+ssh://`` is now much faster and involves no VFS operations.
791 This speeds up commands like ``bzr pull -r 123``. (Andrew Bennetts)798 This speeds up commands like ``bzr pull -r 123``. (Andrew Bennetts)
792799
=== modified file 'bzrlib/builtins.py'
--- bzrlib/builtins.py 2009-08-28 05:00:33 +0000
+++ bzrlib/builtins.py 2009-08-31 04:37:44 +0000
@@ -2332,16 +2332,14 @@
2332 # the underlying repository format is faster at generating2332 # the underlying repository format is faster at generating
2333 # deltas or can provide everything we need from the indices.2333 # deltas or can provide everything we need from the indices.
2334 # The default algorithm - match-using-deltas - works for2334 # The default algorithm - match-using-deltas - works for
2335 # multiple files and directories and is faster for small2335 # multiple files and directories. However, it's too
2336 # amounts of history (200 revisions say). However, it's too
2337 # slow for logging a single file in a repository with deep2336 # slow for logging a single file in a repository with deep
2338 # history, i.e. > 10K revisions. In the spirit of "do no2337 # history, i.e. > 10K revisions. In the spirit of "do no
2339 # evil when adding features", we continue to use the2338 # evil when adding features", we continue to use the
2340 # original algorithm - per-file-graph - for the "single2339 # original algorithm - per-file-graph - for the "single
2341 # file that isn't a directory without showing a delta" case.2340 # file that isn't a directory without showing a delta" case.
2342 partial_history = revision and b.repository._format.supports_chks
2343 match_using_deltas = (len(file_ids) != 1 or filter_by_dir2341 match_using_deltas = (len(file_ids) != 1 or filter_by_dir
2344 or delta_type or partial_history)2342 or delta_type)
23452343
2346 # Build the LogRequest and execute it2344 # Build the LogRequest and execute it
2347 if len(file_ids) == 0:2345 if len(file_ids) == 0:
23482346
=== modified file 'bzrlib/log.py'
--- bzrlib/log.py 2009-06-10 03:56:49 +0000
+++ bzrlib/log.py 2009-08-31 04:37:44 +0000
@@ -70,6 +70,7 @@
70 diff,70 diff,
71 errors,71 errors,
72 foreign,72 foreign,
73 graph,
73 repository as _mod_repository,74 repository as _mod_repository,
74 revision as _mod_revision,75 revision as _mod_revision,
75 revisionspec,76 revisionspec,
@@ -460,21 +461,150 @@
460 direction=rqst.get('direction'))461 direction=rqst.get('direction'))
461462
462 def _log_revision_iterator_using_per_file_graph(self):463 def _log_revision_iterator_using_per_file_graph(self):
463 # Get the base revisions, filtering by the revision range.
464 # Note that we always generate the merge revisions because
465 # filter_revisions_touching_file_id() requires them ...
466 rqst = self.rqst464 rqst = self.rqst
467 view_revisions = _calc_view_revisions(self.branch, self.start_rev_id,465 direction = rqst.get('direction')
468 self.end_rev_id, rqst.get('direction'), True)466 file_id = rqst.get('specific_fileids')[0]
469 if not isinstance(view_revisions, list):467 multi_level = rqst.get('levels') != 1
470 view_revisions = list(view_revisions)468 try:
471 view_revisions = _filter_revisions_touching_file_id(self.branch,469 file_graph, graph_tip = _per_file_graph(self.branch, file_id,
472 rqst.get('specific_fileids')[0], view_revisions,470 self.end_rev_id)
473 include_merges=rqst.get('levels') != 1)471 except errors.NoSuchId:
472 # File doesn't exist at end of range - fall back to old algorithm
473 view_revisions = None
474 else:
475 # Try iterating over the revisions given by the per-file graph.
476 # This returns None if it fails.
477 view_revisions = _calc_view_revisions_for_file(self.branch,
478 file_graph, graph_tip, self.start_rev_id, self.end_rev_id,
479 direction, multi_level)
480
481 if view_revisions is None:
482 # Get the base revisions, filtering by the revision range.
483 # Note that we always generate the merge revisions because
484 # filter_revisions_touching_file_id() requires them ...
485 view_revisions = _calc_view_revisions(self.branch,
486 self.start_rev_id, self.end_rev_id, direction, True)
487 if not isinstance(view_revisions, list):
488 view_revisions = list(view_revisions)
489 # TODO: pass in the already calculated file graph and re-use it
490 view_revisions = _filter_revisions_touching_file_id(self.branch,
491 file_id, view_revisions, include_merges=multi_level)
474 return make_log_rev_iterator(self.branch, view_revisions,492 return make_log_rev_iterator(self.branch, view_revisions,
475 rqst.get('delta_type'), rqst.get('message_search'))493 rqst.get('delta_type'), rqst.get('message_search'))
476494
477495
496def _per_file_graph(branch, file_id, end_rev_id):
497 """Get the per file graph.
498
499 :param end_rev_id: the last interesting revision-id or None to use
500 the basis tree. If non-None, the file must exist in that revision
501 or NoSuchId will be raised.
502 :return: graph, tip where
503 graph is a Graph with (file_id,rev_id) tuple keys and
504 tip is the graph tip
505 """
506 # Find when the file was last modified
507 if end_rev_id is None:
508 rev_tree = branch.basis_tree()
509 else:
510 rev_tree = branch.repository.revision_tree(end_rev_id)
511 last_modified = rev_tree.inventory[file_id].revision
512
513 # Return the result
514 tip = (file_id, last_modified)
515 return graph.Graph(branch.repository.texts), tip
516
517
518def _calc_view_revisions_for_file(branch, file_graph, graph_tip, start_rev_id,
519 end_rev_id, direction, include_merges):
520 """Calculate the revisions to view for a file.
521
522 :param file_graph: the per-file graph
523 :param graph_tip: the tip of the per-file graph
524 :param include_merges: if True, include all revisions, not just the top
525 level
526 :return: An list of (revision_id, dotted_revno, merge_depth) tuples OR
527 None if the algorithm fails (and another one should be used).
528 """
529 br_revno, br_rev_id = branch.last_revision_info()
530 if br_revno == 0:
531 return []
532
533 # Find the revisions where the file was changed and merged
534 file_rev_ids = []
535 file_merges = []
536 for (_, rev_id), parents in file_graph.iter_ancestry([graph_tip]):
537 file_rev_ids.append(rev_id)
538 if len(parents) > 1:
539 file_merges.append(rev_id)
540
541 # Handle the simple cases
542 if len(file_rev_ids) == 1:
543 return _generate_one_revision(branch, file_rev_ids[0], br_rev_id,
544 br_revno)
545 elif len(file_rev_ids) == 0:
546 # Should this ever happen?
547 return []
548 elif file_merges and include_merges:
549 # Fall back to the old algorithm for now
550 return None
551
552 # Find all the revisions we can using a linear search
553 result = []
554 missing = set(file_rev_ids)
555 merges_to_search = 0
556 created_timestamp = None
557 try:
558 candidates = _linear_view_revisions(branch, start_rev_id, end_rev_id)
559 for index, (rev_id, revno, depth) in enumerate(candidates):
560 if rev_id in missing:
561 result.append((rev_id, revno, depth))
562 missing.remove(rev_id)
563 if len(missing) == 0:
564 break
565 else:
566 if _has_merges(branch, rev_id):
567 merges_to_search += 1
568 # If this is a dense tree, this optimisation is unlikely
569 # to result in a net win - fall back to old algorithm.
570 if merges_to_search > 100:
571 return None
572 # Check the timestamp to avoid going back too far on the
573 # mainline for files created in merge revisions. We don't
574 # do this every revision, just regularly, to minimise the
575 # number of revisions that we load at this point.
576 if index and index % 500 == 0:
577 if created_timestamp is None:
578 created_rev = branch.repository.get_revision(
579 file_rev_ids[-1])
580 created_timestamp = created_rev.timestamp
581 rev = branch.repository.get_revision(rev_id)
582 if created_timestamp > rev.timestamp:
583 return None
584
585 except _StartNotLinearAncestor:
586 raise errors.BzrCommandError('Start revision not found in'
587 ' left-hand history of end revision.')
588
589 # If no merges were found in the revision range, then we can be
590 # certain that we've found all the revisions we care about.
591 if missing and merges_to_search:
592 # TODO: search the deltas of the merges, splicing successful
593 # matches into their rightful spots. That should work well on
594 # chk repositories for typical histories but we need to benchmark
595 # it to confirm. There's most likely a sweet spot above which
596 # the O(history) traditional way - generating the full graph of
597 # history and post-filtering - remains the best performer.
598 trace.mutter("log file fastpath failed to find %d revisions" %
599 len(missing))
600 return None
601
602 # We came, we saw, we walked away victorious ...
603 if direction == 'forward':
604 result = reversed(result)
605 return result
606
607
478def _calc_view_revisions(branch, start_rev_id, end_rev_id, direction,608def _calc_view_revisions(branch, start_rev_id, end_rev_id, direction,
479 generate_merge_revisions, delayed_graph_generation=False):609 generate_merge_revisions, delayed_graph_generation=False):
480 """Calculate the revisions to view.610 """Calculate the revisions to view.