Merge lp:~ian-clatworthy/bzr/faster-log-file into lp:~bzr/bzr/trunk-old

Proposed by Ian Clatworthy
Status: Superseded
Proposed branch: lp:~ian-clatworthy/bzr/faster-log-file
Merge into: lp:~bzr/bzr/trunk-old
Diff against target: 180 lines (has conflicts)
Text conflict in bzrlib/log.py
To merge this branch: bzr merge lp:~ian-clatworthy/bzr/faster-log-file
Reviewer Review Type Date Requested Status
John A Meinel Needs Information
Review via email: mp+6805@code.launchpad.net

This proposal has been superseded by a proposal from 2009-06-17.

To post a comment you must log in.
Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote :

This patch speeds up 'bzr log FILE' on flat-ish histories, as commonly found after an import from svn, cvs and other central VCS tools. On OOo, it drops the time taken down from 29 seconds to 1.5 seconds for logging a typical file.

The key to this improvement is starting with the per-file graph and searching the mainline until the revisions of interest are found. That works very well when the history of a project is flat or mostly flat, because it avoids the 27 seconds required to calculate the full revision graph. In a nutshell, the algorithm changes from O(full-history) to O(file-life).

There's certainly room for further smarts here but this is a useful step forward as it stands I feel.

Revision history for this message
John A Meinel (jameinel) wrote :

So I see that you avoided "incorrect" results on non-linear ancestries by including checks for this case. However

1) I'm not sure that the checks are complete. For example, it doesn't matter whether the per-file graph has merges or not, as to how the 'include-merges' flag should be handled. Consider the case:

  :
  A
  |\
  | B # Mod foo
  |/
  C # Merge B's changes

In that case we want to see both revisions B and C in the "bzr log foo" output. Even though the per-file graph in this case looks simply like:

 :
 B # Mod foo

2) I'm a bit concerned that we do all of this work with _linear_view_revisions which in the common case for OOo will have to walk the *entire* history (assuming 'bzr log foo' with no -r specified), which we then throw away.

At least, I'm expecting that once a project like OOo changes to a DVCS, they will actually start including merges. Which means that they'll still have 200k revisions in the mainline, but then *also* have all sorts of merge revisions after that 200k...

I guess, I'm mostly worried that while this makes some bits much faster for your OOo testing, it will actually cause regressions in a lot of other cases.

Consider 'bzr log bzrlib/builtins.py', how much time will be spent in this code, just to have it end up deciding to return None?

review: Needs Information
4385. By Ian Clatworthy

merge bzr.dev r4446

4386. By Ian Clatworthy

avoid looking back too far for files created in merge revisions

Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote :

> 1) I'm not sure that the checks are complete. For example, it doesn't matter
> whether the per-file graph has merges or not, as to how the 'include-merges'
> flag should be handled. Consider the case:
>
> :
> A
> |\
> | B # Mod foo
> |/
> C # Merge B's changes
>
> In that case we want to see both revisions B and C in the "bzr log foo"
> output. Even though the per-file graph in this case looks simply like:
>
> :
> B # Mod foo

I think the code handles this ok. If log --include-merges (or -n0) is given, it uses the old algorithm immediately. Otherwise, it will just show B.

> 2) I'm a bit concerned that we do all of this work with _linear_view_revisions
> which in the common case for OOo will have to walk the *entire* history
> (assuming 'bzr log foo' with no -r specified), which we then throw away.
>
> At least, I'm expecting that once a project like OOo changes to a DVCS, they
> will actually start including merges. Which means that they'll still have 200k
> revisions in the mainline, but then *also* have all sorts of merge revisions
> after that 200k...
>
> I guess, I'm mostly worried that while this makes some bits much faster for
> your OOo testing, it will actually cause regressions in a lot of other cases.
>
> Consider 'bzr log bzrlib/builtins.py', how much time will be spent in this
> code, just to have it end up deciding to return None?

Good points. I ran the benchmark you suggested and it did indeed indicate a problem. I'll push an updated patch.

4387. By Ian Clatworthy

add NEWS item

Unmerged revisions

4387. By Ian Clatworthy

add NEWS item

4386. By Ian Clatworthy

avoid looking back too far for files created in merge revisions

4385. By Ian Clatworthy

merge bzr.dev r4446

4384. By Ian Clatworthy

faster log file -n0 for flat file history

4383. By Ian Clatworthy

speed up log file on flat-ish histories

4382. By Canonical.com Patch Queue Manager <email address hidden>

(vila) Fix blatant performance regression for annotate in gc repos

4381. By Canonical.com Patch Queue Manager <email address hidden>

(Jelmer) Add registry for the 'bzr serve' protocol.

4380. By Canonical.com Patch Queue Manager <email address hidden>

(igc) two simple log dotted revno tests (Marius Kruger)

4379. By Canonical.com Patch Queue Manager <email address hidden>

(tanner) merge 1.15final back to trunk

4378. By Canonical.com Patch Queue Manager <email address hidden>

(igc) faster branch in a shared repo for dev6rr format (Ian
 Clatworthy)

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'bzrlib/builtins.py'
2--- bzrlib/builtins.py 2009-06-11 06:54:33 +0000
3+++ bzrlib/builtins.py 2009-06-16 02:37:12 +0000
4@@ -2230,16 +2230,14 @@
5 # the underlying repository format is faster at generating
6 # deltas or can provide everything we need from the indices.
7 # The default algorithm - match-using-deltas - works for
8- # multiple files and directories and is faster for small
9- # amounts of history (200 revisions say). However, it's too
10+ # multiple files and directories. However, it's too
11 # slow for logging a single file in a repository with deep
12 # history, i.e. > 10K revisions. In the spirit of "do no
13 # evil when adding features", we continue to use the
14 # original algorithm - per-file-graph - for the "single
15 # file that isn't a directory without showing a delta" case.
16- partial_history = revision and b.repository._format.supports_chks
17 match_using_deltas = (len(file_ids) != 1 or filter_by_dir
18- or delta_type or partial_history)
19+ or delta_type)
20
21 # Build the LogRequest and execute it
22 if len(file_ids) == 0:
23
24=== modified file 'bzrlib/log.py'
25--- bzrlib/log.py 2009-06-10 03:56:49 +0000
26+++ bzrlib/log.py 2009-06-16 02:37:12 +0000
27@@ -69,7 +69,11 @@
28 config,
29 diff,
30 errors,
31+<<<<<<< TREE
32 foreign,
33+=======
34+ graph,
35+>>>>>>> MERGE-SOURCE
36 repository as _mod_repository,
37 revision as _mod_revision,
38 revisionspec,
39@@ -460,21 +464,131 @@
40 direction=rqst.get('direction'))
41
42 def _log_revision_iterator_using_per_file_graph(self):
43- # Get the base revisions, filtering by the revision range.
44- # Note that we always generate the merge revisions because
45- # filter_revisions_touching_file_id() requires them ...
46 rqst = self.rqst
47- view_revisions = _calc_view_revisions(self.branch, self.start_rev_id,
48- self.end_rev_id, rqst.get('direction'), True)
49- if not isinstance(view_revisions, list):
50- view_revisions = list(view_revisions)
51- view_revisions = _filter_revisions_touching_file_id(self.branch,
52- rqst.get('specific_fileids')[0], view_revisions,
53- include_merges=rqst.get('levels') != 1)
54+ direction = rqst.get('direction')
55+ file_id = rqst.get('specific_fileids')[0]
56+ multi_level = rqst.get('levels') != 1
57+ try:
58+ file_graph, graph_tip = _per_file_graph(self.branch, file_id,
59+ self.end_rev_id)
60+ except errors.NoSuchId:
61+ # File doesn't exist at end of range - fall back to old algorithm
62+ view_revisions = None
63+ else:
64+ # Try iterating over the revisions given by the per-file graph.
65+ # This returns None if it fails.
66+ view_revisions = _calc_view_revisions_for_file(self.branch,
67+ file_graph, graph_tip, self.start_rev_id, self.end_rev_id,
68+ direction, multi_level)
69+
70+ if view_revisions is None:
71+ # Get the base revisions, filtering by the revision range.
72+ # Note that we always generate the merge revisions because
73+ # filter_revisions_touching_file_id() requires them ...
74+ view_revisions = _calc_view_revisions(self.branch,
75+ self.start_rev_id, self.end_rev_id, direction, True)
76+ if not isinstance(view_revisions, list):
77+ view_revisions = list(view_revisions)
78+ # TODO: pass in the already calculated file graph and re-use it
79+ view_revisions = _filter_revisions_touching_file_id(self.branch,
80+ file_id, view_revisions, include_merges=multi_level)
81 return make_log_rev_iterator(self.branch, view_revisions,
82 rqst.get('delta_type'), rqst.get('message_search'))
83
84
85+def _per_file_graph(branch, file_id, end_rev_id):
86+ """Get the per file graph.
87+
88+ :param end_rev_id: the last interesting revision-id or None to use
89+ the basis tree. If non-None, the file must exist in that revision
90+ or NoSuchId will be raised.
91+ :return: graph, tip where
92+ graph is a Graph with (file_id,rev_id) tuple keys and
93+ tip is the graph tip
94+ """
95+ # Find when the file was last modified
96+ if end_rev_id is None:
97+ rev_tree = branch.basis_tree()
98+ else:
99+ rev_tree = branch.repository.revision_tree(end_rev_id)
100+ last_modified = rev_tree.inventory[file_id].revision
101+
102+ # Return the result
103+ tip = (file_id, last_modified)
104+ return graph.Graph(branch.repository.texts), tip
105+
106+
107+def _calc_view_revisions_for_file(branch, file_graph, graph_tip, start_rev_id,
108+ end_rev_id, direction, include_merges):
109+ """Calculate the revisions to view for a file.
110+
111+ :param file_graph: the per-file graph
112+ :param graph_tip: the tip of the per-file graph
113+ :param include_merges: if True, include all revisions, not just the top
114+ level
115+ :return: An list of (revision_id, dotted_revno, merge_depth) tuples OR
116+ None if the algorithm fails (and another one should be used).
117+ """
118+ br_revno, br_rev_id = branch.last_revision_info()
119+ if br_revno == 0:
120+ return []
121+
122+ # Find when the file was changed and merged
123+ file_rev_ids = []
124+ file_merges = []
125+ for (_, rev_id), parents in file_graph.iter_ancestry([graph_tip]):
126+ file_rev_ids.append(rev_id)
127+ if len(parents) > 1:
128+ file_merges.append(rev_id)
129+
130+ # Handle the simple cases
131+ if len(file_rev_ids) == 1:
132+ return _generate_one_revision(branch, file_rev_ids[0], br_rev_id,
133+ br_revno)
134+ elif len(file_rev_ids) == 0:
135+ # Should this ever happen?
136+ return []
137+ elif file_merges and include_merges:
138+ # Fall back to the old algorithm for now
139+ return None
140+
141+ # Find all the revisions we can using a linear search
142+ result = []
143+ missing = set(file_rev_ids)
144+ merges_to_search = 0
145+ try:
146+ candidates = _linear_view_revisions(branch, start_rev_id, end_rev_id)
147+ for rev_id, revno, depth in candidates:
148+ if rev_id in missing:
149+ result.append((rev_id, revno, depth))
150+ missing.remove(rev_id)
151+ if len(missing) == 0:
152+ break
153+ if _has_merges(branch, rev_id):
154+ merges_to_search += 1
155+ except _StartNotLinearAncestor:
156+ raise errors.BzrCommandError('Start revision not found in'
157+ ' left-hand history of end revision.')
158+
159+ # If no merges were found in the revision range, then we can be
160+ # certain that we've found all the revisions we care about.
161+ if missing and merges_to_search:
162+ # TODO: search the deltas of the merges, splicing successful
163+ # matches into their rightful spots. That should work well on
164+ # chk repositories for typical histories but we need to benchmark
165+ # it to confirm. There's most likely a sweet spot above which
166+ # the O(history) traditional way - generating the full graph of
167+ # history and post-filtering - remains the best performer.
168+ trace.mutter("log file fastpath failed to find %d revisions" %
169+ len(missing))
170+ return None
171+
172+ # We came, we saw, we walked away victorious ...
173+ if direction == 'forward':
174+ result = reversed(result)
175+ return result
176+
177+
178 def _calc_view_revisions(branch, start_rev_id, end_rev_id, direction,
179 generate_merge_revisions, delayed_graph_generation=False):
180 """Calculate the revisions to view.