Merge lp:~ian-clatworthy/bzr/faster-log-file into lp:~bzr/bzr/trunk-old

Proposed by Ian Clatworthy
Status: Superseded
Proposed branch: lp:~ian-clatworthy/bzr/faster-log-file
Merge into: lp:~bzr/bzr/trunk-old
Diff against target: 180 lines (has conflicts)
Text conflict in bzrlib/log.py
To merge this branch: bzr merge lp:~ian-clatworthy/bzr/faster-log-file
Reviewer Review Type Date Requested Status
John A Meinel Needs Information
Review via email: mp+6805@code.launchpad.net

This proposal has been superseded by a proposal from 2009-06-17.

To post a comment you must log in.
Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote :

This patch speeds up 'bzr log FILE' on flat-ish histories, as commonly found after an import from svn, cvs and other central VCS tools. On OOo, it drops the time taken down from 29 seconds to 1.5 seconds for logging a typical file.

The key to this improvement is starting with the per-file graph and searching the mainline until the revisions of interest are found. That works very well when the history of a project is flat or mostly flat, because it avoids the 27 seconds required to calculate the full revision graph. In a nutshell, the algorithm changes from O(full-history) to O(file-life).

There's certainly room for further smarts here but this is a useful step forward as it stands I feel.

Revision history for this message
John A Meinel (jameinel) wrote :

So I see that you avoided "incorrect" results on non-linear ancestries by including checks for this case. However

1) I'm not sure that the checks are complete. For example, it doesn't matter whether the per-file graph has merges or not, as to how the 'include-merges' flag should be handled. Consider the case:

  :
  A
  |\
  | B # Mod foo
  |/
  C # Merge B's changes

In that case we want to see both revisions B and C in the "bzr log foo" output. Even though the per-file graph in this case looks simply like:

 :
 B # Mod foo

2) I'm a bit concerned that we do all of this work with _linear_view_revisions which in the common case for OOo will have to walk the *entire* history (assuming 'bzr log foo' with no -r specified), which we then throw away.

At least, I'm expecting that once a project like OOo changes to a DVCS, they will actually start including merges. Which means that they'll still have 200k revisions in the mainline, but then *also* have all sorts of merge revisions after that 200k...

I guess, I'm mostly worried that while this makes some bits much faster for your OOo testing, it will actually cause regressions in a lot of other cases.

Consider 'bzr log bzrlib/builtins.py', how much time will be spent in this code, just to have it end up deciding to return None?

review: Needs Information
4385. By Ian Clatworthy

merge bzr.dev r4446

4386. By Ian Clatworthy

avoid looking back too far for files created in merge revisions

Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote :

> 1) I'm not sure that the checks are complete. For example, it doesn't matter
> whether the per-file graph has merges or not, as to how the 'include-merges'
> flag should be handled. Consider the case:
>
> :
> A
> |\
> | B # Mod foo
> |/
> C # Merge B's changes
>
> In that case we want to see both revisions B and C in the "bzr log foo"
> output. Even though the per-file graph in this case looks simply like:
>
> :
> B # Mod foo

I think the code handles this ok. If log --include-merges (or -n0) is given, it uses the old algorithm immediately. Otherwise, it will just show B.

> 2) I'm a bit concerned that we do all of this work with _linear_view_revisions
> which in the common case for OOo will have to walk the *entire* history
> (assuming 'bzr log foo' with no -r specified), which we then throw away.
>
> At least, I'm expecting that once a project like OOo changes to a DVCS, they
> will actually start including merges. Which means that they'll still have 200k
> revisions in the mainline, but then *also* have all sorts of merge revisions
> after that 200k...
>
> I guess, I'm mostly worried that while this makes some bits much faster for
> your OOo testing, it will actually cause regressions in a lot of other cases.
>
> Consider 'bzr log bzrlib/builtins.py', how much time will be spent in this
> code, just to have it end up deciding to return None?

Good points. I ran the benchmark you suggested and it did indeed indicate a problem. I'll push an updated patch.

4387. By Ian Clatworthy

add NEWS item

Unmerged revisions

4387. By Ian Clatworthy

add NEWS item

4386. By Ian Clatworthy

avoid looking back too far for files created in merge revisions

4385. By Ian Clatworthy

merge bzr.dev r4446

4384. By Ian Clatworthy

faster log file -n0 for flat file history

4383. By Ian Clatworthy

speed up log file on flat-ish histories

4382. By Canonical.com Patch Queue Manager <email address hidden>

(vila) Fix blatant performance regression for annotate in gc repos

4381. By Canonical.com Patch Queue Manager <email address hidden>

(Jelmer) Add registry for the 'bzr serve' protocol.

4380. By Canonical.com Patch Queue Manager <email address hidden>

(igc) two simple log dotted revno tests (Marius Kruger)

4379. By Canonical.com Patch Queue Manager <email address hidden>

(tanner) merge 1.15final back to trunk

4378. By Canonical.com Patch Queue Manager <email address hidden>

(igc) faster branch in a shared repo for dev6rr format (Ian
 Clatworthy)

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'bzrlib/builtins.py'
--- bzrlib/builtins.py 2009-06-11 06:54:33 +0000
+++ bzrlib/builtins.py 2009-06-16 02:37:12 +0000
@@ -2230,16 +2230,14 @@
2230 # the underlying repository format is faster at generating2230 # the underlying repository format is faster at generating
2231 # deltas or can provide everything we need from the indices.2231 # deltas or can provide everything we need from the indices.
2232 # The default algorithm - match-using-deltas - works for2232 # The default algorithm - match-using-deltas - works for
2233 # multiple files and directories and is faster for small2233 # multiple files and directories. However, it's too
2234 # amounts of history (200 revisions say). However, it's too
2235 # slow for logging a single file in a repository with deep2234 # slow for logging a single file in a repository with deep
2236 # history, i.e. > 10K revisions. In the spirit of "do no2235 # history, i.e. > 10K revisions. In the spirit of "do no
2237 # evil when adding features", we continue to use the2236 # evil when adding features", we continue to use the
2238 # original algorithm - per-file-graph - for the "single2237 # original algorithm - per-file-graph - for the "single
2239 # file that isn't a directory without showing a delta" case.2238 # file that isn't a directory without showing a delta" case.
2240 partial_history = revision and b.repository._format.supports_chks
2241 match_using_deltas = (len(file_ids) != 1 or filter_by_dir2239 match_using_deltas = (len(file_ids) != 1 or filter_by_dir
2242 or delta_type or partial_history)2240 or delta_type)
22432241
2244 # Build the LogRequest and execute it2242 # Build the LogRequest and execute it
2245 if len(file_ids) == 0:2243 if len(file_ids) == 0:
22462244
=== modified file 'bzrlib/log.py'
--- bzrlib/log.py 2009-06-10 03:56:49 +0000
+++ bzrlib/log.py 2009-06-16 02:37:12 +0000
@@ -69,7 +69,11 @@
69 config,69 config,
70 diff,70 diff,
71 errors,71 errors,
72<<<<<<< TREE
72 foreign,73 foreign,
74=======
75 graph,
76>>>>>>> MERGE-SOURCE
73 repository as _mod_repository,77 repository as _mod_repository,
74 revision as _mod_revision,78 revision as _mod_revision,
75 revisionspec,79 revisionspec,
@@ -460,21 +464,131 @@
460 direction=rqst.get('direction'))464 direction=rqst.get('direction'))
461465
462 def _log_revision_iterator_using_per_file_graph(self):466 def _log_revision_iterator_using_per_file_graph(self):
463 # Get the base revisions, filtering by the revision range.
464 # Note that we always generate the merge revisions because
465 # filter_revisions_touching_file_id() requires them ...
466 rqst = self.rqst467 rqst = self.rqst
467 view_revisions = _calc_view_revisions(self.branch, self.start_rev_id,468 direction = rqst.get('direction')
468 self.end_rev_id, rqst.get('direction'), True)469 file_id = rqst.get('specific_fileids')[0]
469 if not isinstance(view_revisions, list):470 multi_level = rqst.get('levels') != 1
470 view_revisions = list(view_revisions)471 try:
471 view_revisions = _filter_revisions_touching_file_id(self.branch,472 file_graph, graph_tip = _per_file_graph(self.branch, file_id,
472 rqst.get('specific_fileids')[0], view_revisions,473 self.end_rev_id)
473 include_merges=rqst.get('levels') != 1)474 except errors.NoSuchId:
475 # File doesn't exist at end of range - fall back to old algorithm
476 view_revisions = None
477 else:
478 # Try iterating over the revisions given by the per-file graph.
479 # This returns None if it fails.
480 view_revisions = _calc_view_revisions_for_file(self.branch,
481 file_graph, graph_tip, self.start_rev_id, self.end_rev_id,
482 direction, multi_level)
483
484 if view_revisions is None:
485 # Get the base revisions, filtering by the revision range.
486 # Note that we always generate the merge revisions because
487 # filter_revisions_touching_file_id() requires them ...
488 view_revisions = _calc_view_revisions(self.branch,
489 self.start_rev_id, self.end_rev_id, direction, True)
490 if not isinstance(view_revisions, list):
491 view_revisions = list(view_revisions)
492 # TODO: pass in the already calculated file graph and re-use it
493 view_revisions = _filter_revisions_touching_file_id(self.branch,
494 file_id, view_revisions, include_merges=multi_level)
474 return make_log_rev_iterator(self.branch, view_revisions,495 return make_log_rev_iterator(self.branch, view_revisions,
475 rqst.get('delta_type'), rqst.get('message_search'))496 rqst.get('delta_type'), rqst.get('message_search'))
476497
477498
499def _per_file_graph(branch, file_id, end_rev_id):
500 """Get the per file graph.
501
502 :param end_rev_id: the last interesting revision-id or None to use
503 the basis tree. If non-None, the file must exist in that revision
504 or NoSuchId will be raised.
505 :return: graph, tip where
506 graph is a Graph with (file_id,rev_id) tuple keys and
507 tip is the graph tip
508 """
509 # Find when the file was last modified
510 if end_rev_id is None:
511 rev_tree = branch.basis_tree()
512 else:
513 rev_tree = branch.repository.revision_tree(end_rev_id)
514 last_modified = rev_tree.inventory[file_id].revision
515
516 # Return the result
517 tip = (file_id, last_modified)
518 return graph.Graph(branch.repository.texts), tip
519
520
521def _calc_view_revisions_for_file(branch, file_graph, graph_tip, start_rev_id,
522 end_rev_id, direction, include_merges):
523 """Calculate the revisions to view for a file.
524
525 :param file_graph: the per-file graph
526 :param graph_tip: the tip of the per-file graph
527 :param include_merges: if True, include all revisions, not just the top
528 level
529 :return: An list of (revision_id, dotted_revno, merge_depth) tuples OR
530 None if the algorithm fails (and another one should be used).
531 """
532 br_revno, br_rev_id = branch.last_revision_info()
533 if br_revno == 0:
534 return []
535
536 # Find when the file was changed and merged
537 file_rev_ids = []
538 file_merges = []
539 for (_, rev_id), parents in file_graph.iter_ancestry([graph_tip]):
540 file_rev_ids.append(rev_id)
541 if len(parents) > 1:
542 file_merges.append(rev_id)
543
544 # Handle the simple cases
545 if len(file_rev_ids) == 1:
546 return _generate_one_revision(branch, file_rev_ids[0], br_rev_id,
547 br_revno)
548 elif len(file_rev_ids) == 0:
549 # Should this ever happen?
550 return []
551 elif file_merges and include_merges:
552 # Fall back to the old algorithm for now
553 return None
554
555 # Find all the revisions we can using a linear search
556 result = []
557 missing = set(file_rev_ids)
558 merges_to_search = 0
559 try:
560 candidates = _linear_view_revisions(branch, start_rev_id, end_rev_id)
561 for rev_id, revno, depth in candidates:
562 if rev_id in missing:
563 result.append((rev_id, revno, depth))
564 missing.remove(rev_id)
565 if len(missing) == 0:
566 break
567 if _has_merges(branch, rev_id):
568 merges_to_search += 1
569 except _StartNotLinearAncestor:
570 raise errors.BzrCommandError('Start revision not found in'
571 ' left-hand history of end revision.')
572
573 # If no merges were found in the revision range, then we can be
574 # certain that we've found all the revisions we care about.
575 if missing and merges_to_search:
576 # TODO: search the deltas of the merges, splicing successful
577 # matches into their rightful spots. That should work well on
578 # chk repositories for typical histories but we need to benchmark
579 # it to confirm. There's most likely a sweet spot above which
580 # the O(history) traditional way - generating the full graph of
581 # history and post-filtering - remains the best performer.
582 trace.mutter("log file fastpath failed to find %d revisions" %
583 len(missing))
584 return None
585
586 # We came, we saw, we walked away victorious ...
587 if direction == 'forward':
588 result = reversed(result)
589 return result
590
591
478def _calc_view_revisions(branch, start_rev_id, end_rev_id, direction,592def _calc_view_revisions(branch, start_rev_id, end_rev_id, direction,
479 generate_merge_revisions, delayed_graph_generation=False):593 generate_merge_revisions, delayed_graph_generation=False):
480 """Calculate the revisions to view.594 """Calculate the revisions to view.