Merge lp:~ian-clatworthy/bzr/faster-log-file into lp:~bzr/bzr/trunk-old

Proposed by Ian Clatworthy
Status: Work in progress
Proposed branch: lp:~ian-clatworthy/bzr/faster-log-file
Merge into: lp:~bzr/bzr/trunk-old
Diff against target: 220 lines (has conflicts)
Text conflict in NEWS
To merge this branch: bzr merge lp:~ian-clatworthy/bzr/faster-log-file
Reviewer Review Type Date Requested Status
Martin Pool Needs Information
Review via email: mp+7535@code.launchpad.net

This proposal supersedes a proposal from 2009-05-27.

To post a comment you must log in.
Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote : Posted in a previous version of this proposal

This patch speeds up 'bzr log FILE' on flat-ish histories, as commonly found after an import from svn, cvs and other central VCS tools. On OOo, it drops the time taken down from 29 seconds to 1.5 seconds for logging a typical file.

The key to this improvement is starting with the per-file graph and searching the mainline until the revisions of interest are found. That works very well when the history of a project is flat or mostly flat, because it avoids the 27 seconds required to calculate the full revision graph. In a nutshell, the algorithm changes from O(full-history) to O(file-life).

There's certainly room for further smarts here but this is a useful step forward as it stands I feel.

Revision history for this message
John A Meinel (jameinel) wrote : Posted in a previous version of this proposal

So I see that you avoided "incorrect" results on non-linear ancestries by including checks for this case. However

1) I'm not sure that the checks are complete. For example, it doesn't matter whether the per-file graph has merges or not, as to how the 'include-merges' flag should be handled. Consider the case:

  :
  A
  |\
  | B # Mod foo
  |/
  C # Merge B's changes

In that case we want to see both revisions B and C in the "bzr log foo" output. Even though the per-file graph in this case looks simply like:

 :
 B # Mod foo

2) I'm a bit concerned that we do all of this work with _linear_view_revisions which in the common case for OOo will have to walk the *entire* history (assuming 'bzr log foo' with no -r specified), which we then throw away.

At least, I'm expecting that once a project like OOo changes to a DVCS, they will actually start including merges. Which means that they'll still have 200k revisions in the mainline, but then *also* have all sorts of merge revisions after that 200k...

I guess, I'm mostly worried that while this makes some bits much faster for your OOo testing, it will actually cause regressions in a lot of other cases.

Consider 'bzr log bzrlib/builtins.py', how much time will be spent in this code, just to have it end up deciding to return None?

review: Needs Information
Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote : Posted in a previous version of this proposal

> 1) I'm not sure that the checks are complete. For example, it doesn't matter
> whether the per-file graph has merges or not, as to how the 'include-merges'
> flag should be handled. Consider the case:
>
> :
> A
> |\
> | B # Mod foo
> |/
> C # Merge B's changes
>
> In that case we want to see both revisions B and C in the "bzr log foo"
> output. Even though the per-file graph in this case looks simply like:
>
> :
> B # Mod foo

I think the code handles this ok. If log --include-merges (or -n0) is given, it uses the old algorithm immediately. Otherwise, it will just show B.

> 2) I'm a bit concerned that we do all of this work with _linear_view_revisions
> which in the common case for OOo will have to walk the *entire* history
> (assuming 'bzr log foo' with no -r specified), which we then throw away.
>
> At least, I'm expecting that once a project like OOo changes to a DVCS, they
> will actually start including merges. Which means that they'll still have 200k
> revisions in the mainline, but then *also* have all sorts of merge revisions
> after that 200k...
>
> I guess, I'm mostly worried that while this makes some bits much faster for
> your OOo testing, it will actually cause regressions in a lot of other cases.
>
> Consider 'bzr log bzrlib/builtins.py', how much time will be spent in this
> code, just to have it end up deciding to return None?

Good points. I ran the benchmark you suggested and it did indeed indicate a problem. I'll push an updated patch.

Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote :

This patch speeds up log FILE on flat-ish histories: 29 secs => 1 sec for OOo. As noted in the previous merge proposal, it does this by starting from the per-file graph and looking for the revisions of interest along the mainline, instead of always assuming that the full revision graph is required. This win is achieved for files where the history is flat or mostly flat during the time between last edit and creation.

In response to John's feedback on the earlier proposal, this version now has some additional checks to ensure that we don't walk the whole mainline only to come up short. In particular, it uses revision timestamps as a sanity check every now and then. It also bails out if the revision tree is clearly dense during the file's edit life-time. So even as flat histories mutate into dense histories, the win will remain for most files.

Collectively, these new checks maintain the original win while keeping the overhead on dense trees to a small enough (IMO) amount: ~ 10%. For example, 'bzr log NEWS' on Bazaar's trunk goes from 2.7 to 3.0 seconds while 'bzr log bzrlib/builtins.py' goes from 2.4 to 2.6 seconds. That 0.2-0.3 second increase isn't noticeable in practice, while the reduction from 29 to 1 second obviously is.

Revision history for this message
Martin Pool (mbp) wrote :
Download full text (9.6 KiB)

I think you should test this on some files from MySQL or Launchpad,
which will have a longer but still very bushy history?

This looks fairly plausible to me, but I think there should be a unit
test for the fairly nontrivial function you've added. The approach here
of falling back to the old code means that any bugs here may not
actually be exercised by the test suite.

=== modified file 'NEWS'
--- NEWS 2009-06-16 09:05:34 +0000
+++ NEWS 2009-06-17 05:48:37 +0000
@@ -29,6 +29,10 @@
 Improvements
 ************

+* ``bzr log FILE`` is now substantially faster on flat-ish histories.
+ On OpenOffice.org for example, logging a typical file now takes
+ a second or so instead of 29 seconds. (Ian Clatworthy)
+
 * Resolving a revno to a revision id on a branch accessed via ``bzr://``
   or ``bzr+ssh://`` is now much faster and involves no VFS operations.
   This speeds up commands like ``bzr pull -r 123``. (Andrew Bennetts)

=== modified file 'bzrlib/builtins.py'
--- bzrlib/builtins.py 2009-06-15 06:47:14 +0000
+++ bzrlib/builtins.py 2009-06-16 12:46:50 +0000
@@ -2237,16 +2237,14 @@
             # the underlying repository format is faster at generating
             # deltas or can provide everything we need from the indices.
             # The default algorithm - match-using-deltas - works for
- # multiple files and directories and is faster for small
- # amounts of history (200 revisions say). However, it's too
+ # multiple files and directories. However, it's too
             # slow for logging a single file in a repository with deep
             # history, i.e. > 10K revisions. In the spirit of "do no
             # evil when adding features", we continue to use the
             # original algorithm - per-file-graph - for the "single
             # file that isn't a directory without showing a delta" case.
- partial_history = revision and b.repository._format.supports_chks
             match_using_deltas = (len(file_ids) != 1 or filter_by_dir
- or delta_type or partial_history)
+ or delta_type)

             # Build the LogRequest and execute it
             if len(file_ids) == 0:

=== modified file 'bzrlib/log.py'
--- bzrlib/log.py 2009-06-10 03:56:49 +0000
+++ bzrlib/log.py 2009-06-17 05:35:10 +0000
@@ -70,6 +70,7 @@
     diff,
     errors,
     foreign,
+ graph,
     repository as _mod_repository,
     revision as _mod_revision,
     revisionspec,
@@ -460,21 +461,150 @@
             direction=rqst.get('direction'))

     def _log_revision_iterator_using_per_file_graph(self):
- # Get the base revisions, filtering by the revision range.
- # Note that we always generate the merge revisions because
- # filter_revisions_touching_file_id() requires them ...
         rqst = self.rqst
- view_revisions = _calc_view_revisions(self.branch, self.start_rev_id,
- self.end_rev_id, rqst.get('direction'), True)
- if not isinstance(view_revisions, list):
- view_revisions = list(view_revisions)
- view_revisions = _filter_revisions_touching_file_id(self.branch,
- rqst.get('specific_fileids')[0], view_revisio...

Read more...

review: Needs Information
Revision history for this message
John A Meinel (jameinel) wrote :

What is the status of this submission? How was it impacted by my improvements to "bzr log DIR" ?

It seems a bit stale, so I'd like to either mark it as "Work in Progress" or "Rejected" or heck, even "Approved" so long as we get some motion here.

Revision history for this message
Parth Malwankar (parthm) wrote :

I just ran a basic benchmark for this branch on the emacs trunk. The performance 'bzr log FILE' takes ~11s for trunk and ~24s for this branch. Maybe something has changed as this branch is quite old or perhaps emacs development model is not flat-ish.
'bzr log' performance is about the same.
Maybe someone familiar with log can comment or mark this as rejected to keep wip queue size down?

[emacs-bzr]% time ~/src/bzr.dev/faster-log-file/bzr --no-plugins log > /dev/null
~/src/bzr.dev/faster-log-file/bzr --no-plugins log > /dev/null 34.97s user 1.76s system 97% cpu 37.860 total
[emacs-bzr]% time ~/src/bzr.dev/faster-log-file/bzr --no-plugins log > /dev/null
~/src/bzr.dev/faster-log-file/bzr --no-plugins log > /dev/null 35.11s user 1.66s system 99% cpu 36.997 total
[emacs-bzr]% time bzr --no-plugins log > /dev/null
bzr --no-plugins log > /dev/null 35.19s user 1.72s system 99% cpu 37.082 total
[emacs-bzr]% time bzr log autogen.sh > /dev/null
bzr log autogen.sh > /dev/null 11.26s user 0.25s system 96% cpu 11.884 total
[emacs-bzr]% time bzr log autogen.sh > /dev/null
bzr log autogen.sh > /dev/null 11.07s user 0.24s system 100% cpu 11.305 total
[emacs-bzr]% time ~/src/bzr.dev/faster-log-file/bzr --no-plugins log autogen.sh > /dev/null
~/src/bzr.dev/faster-log-file/bzr --no-plugins log autogen.sh > /dev/null 23.66s user 0.22s system 99% cpu 24.075 total
[emacs-bzr]% time ~/src/bzr.dev/faster-log-file/bzr --no-plugins log autogen.sh > /dev/null
~/src/bzr.dev/faster-log-file/bzr --no-plugins log autogen.sh > /dev/null 25.39s user 0.28s system 99% cpu 25.756 total
[emacs-bzr]%

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Parth Malwankar wrote:
> I just ran a basic benchmark for this branch on the emacs trunk. The performance 'bzr log FILE' takes ~11s for trunk and ~24s for this branch. Maybe something has changed as this branch is quite old or perhaps emacs development model is not flat-ish.
> 'bzr log' performance is about the same.
> Maybe someone familiar with log can comment or mark this as rejected to keep wip queue size down?
>
emacs used to be almost completely flat, and I think Ian's work was
trying to tune for that. However, they now have a few merge revisions,
and this may get tripped up. I seem to remember that it did the work
twice. Once trying to be 'fast' when there are no merges, and then fall
back to a slower path. However, if there are merges it was 2x slower
because it did the work twice.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkvgMXgACgkQJdeBCYSNAAMSTACgwePYUHGysV4xbUoFLLS5+BCl
LbUAoJAh1h7QGDg1Xv+GQEpIsUFbyv2w
=VRot
-----END PGP SIGNATURE-----

Revision history for this message
Robert Collins (lifeless) wrote :

So, is this really in progress, or halted() ? Perhaps MP's should have an 'idle' state for not-moving, not-rejected. Or something.

Unmerged revisions

4387. By Ian Clatworthy

add NEWS item

4386. By Ian Clatworthy

avoid looking back too far for files created in merge revisions

4385. By Ian Clatworthy

merge bzr.dev r4446

4384. By Ian Clatworthy

faster log file -n0 for flat file history

4383. By Ian Clatworthy

speed up log file on flat-ish histories

4382. By Canonical.com Patch Queue Manager <email address hidden>

(vila) Fix blatant performance regression for annotate in gc repos

4381. By Canonical.com Patch Queue Manager <email address hidden>

(Jelmer) Add registry for the 'bzr serve' protocol.

4380. By Canonical.com Patch Queue Manager <email address hidden>

(igc) two simple log dotted revno tests (Marius Kruger)

4379. By Canonical.com Patch Queue Manager <email address hidden>

(tanner) merge 1.15final back to trunk

4378. By Canonical.com Patch Queue Manager <email address hidden>

(igc) faster branch in a shared repo for dev6rr format (Ian
 Clatworthy)

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'NEWS'
2--- NEWS 2009-08-30 23:51:10 +0000
3+++ NEWS 2009-08-31 04:37:43 +0000
4@@ -771,6 +771,7 @@
5 Improvements
6 ************
7
8+<<<<<<< TREE
9 * ``bzr annotate`` can now be significantly faster. The time for
10 ``bzr annotate NEWS`` is down to 7s from 22s in 1.16. Files with long
11 histories and lots of 'duplicate insertions' will be improved more than
12@@ -786,6 +787,12 @@
13 * Initial commit performance in ``--2a`` repositories has been improved by
14 making it cheaper to build the initial CHKMap. (John Arbash Meinel)
15
16+=======
17+* ``bzr log FILE`` is now substantially faster on flat-ish histories.
18+ On OpenOffice.org for example, logging a typical file now takes
19+ a second or so instead of 29 seconds. (Ian Clatworthy)
20+
21+>>>>>>> MERGE-SOURCE
22 * Resolving a revno to a revision id on a branch accessed via ``bzr://``
23 or ``bzr+ssh://`` is now much faster and involves no VFS operations.
24 This speeds up commands like ``bzr pull -r 123``. (Andrew Bennetts)
25
26=== modified file 'bzrlib/builtins.py'
27--- bzrlib/builtins.py 2009-08-28 05:00:33 +0000
28+++ bzrlib/builtins.py 2009-08-31 04:37:44 +0000
29@@ -2332,16 +2332,14 @@
30 # the underlying repository format is faster at generating
31 # deltas or can provide everything we need from the indices.
32 # The default algorithm - match-using-deltas - works for
33- # multiple files and directories and is faster for small
34- # amounts of history (200 revisions say). However, it's too
35+ # multiple files and directories. However, it's too
36 # slow for logging a single file in a repository with deep
37 # history, i.e. > 10K revisions. In the spirit of "do no
38 # evil when adding features", we continue to use the
39 # original algorithm - per-file-graph - for the "single
40 # file that isn't a directory without showing a delta" case.
41- partial_history = revision and b.repository._format.supports_chks
42 match_using_deltas = (len(file_ids) != 1 or filter_by_dir
43- or delta_type or partial_history)
44+ or delta_type)
45
46 # Build the LogRequest and execute it
47 if len(file_ids) == 0:
48
49=== modified file 'bzrlib/log.py'
50--- bzrlib/log.py 2009-06-10 03:56:49 +0000
51+++ bzrlib/log.py 2009-08-31 04:37:44 +0000
52@@ -70,6 +70,7 @@
53 diff,
54 errors,
55 foreign,
56+ graph,
57 repository as _mod_repository,
58 revision as _mod_revision,
59 revisionspec,
60@@ -460,21 +461,150 @@
61 direction=rqst.get('direction'))
62
63 def _log_revision_iterator_using_per_file_graph(self):
64- # Get the base revisions, filtering by the revision range.
65- # Note that we always generate the merge revisions because
66- # filter_revisions_touching_file_id() requires them ...
67 rqst = self.rqst
68- view_revisions = _calc_view_revisions(self.branch, self.start_rev_id,
69- self.end_rev_id, rqst.get('direction'), True)
70- if not isinstance(view_revisions, list):
71- view_revisions = list(view_revisions)
72- view_revisions = _filter_revisions_touching_file_id(self.branch,
73- rqst.get('specific_fileids')[0], view_revisions,
74- include_merges=rqst.get('levels') != 1)
75+ direction = rqst.get('direction')
76+ file_id = rqst.get('specific_fileids')[0]
77+ multi_level = rqst.get('levels') != 1
78+ try:
79+ file_graph, graph_tip = _per_file_graph(self.branch, file_id,
80+ self.end_rev_id)
81+ except errors.NoSuchId:
82+ # File doesn't exist at end of range - fall back to old algorithm
83+ view_revisions = None
84+ else:
85+ # Try iterating over the revisions given by the per-file graph.
86+ # This returns None if it fails.
87+ view_revisions = _calc_view_revisions_for_file(self.branch,
88+ file_graph, graph_tip, self.start_rev_id, self.end_rev_id,
89+ direction, multi_level)
90+
91+ if view_revisions is None:
92+ # Get the base revisions, filtering by the revision range.
93+ # Note that we always generate the merge revisions because
94+ # filter_revisions_touching_file_id() requires them ...
95+ view_revisions = _calc_view_revisions(self.branch,
96+ self.start_rev_id, self.end_rev_id, direction, True)
97+ if not isinstance(view_revisions, list):
98+ view_revisions = list(view_revisions)
99+ # TODO: pass in the already calculated file graph and re-use it
100+ view_revisions = _filter_revisions_touching_file_id(self.branch,
101+ file_id, view_revisions, include_merges=multi_level)
102 return make_log_rev_iterator(self.branch, view_revisions,
103 rqst.get('delta_type'), rqst.get('message_search'))
104
105
106+def _per_file_graph(branch, file_id, end_rev_id):
107+ """Get the per file graph.
108+
109+ :param end_rev_id: the last interesting revision-id or None to use
110+ the basis tree. If non-None, the file must exist in that revision
111+ or NoSuchId will be raised.
112+ :return: graph, tip where
113+ graph is a Graph with (file_id,rev_id) tuple keys and
114+ tip is the graph tip
115+ """
116+ # Find when the file was last modified
117+ if end_rev_id is None:
118+ rev_tree = branch.basis_tree()
119+ else:
120+ rev_tree = branch.repository.revision_tree(end_rev_id)
121+ last_modified = rev_tree.inventory[file_id].revision
122+
123+ # Return the result
124+ tip = (file_id, last_modified)
125+ return graph.Graph(branch.repository.texts), tip
126+
127+
128+def _calc_view_revisions_for_file(branch, file_graph, graph_tip, start_rev_id,
129+ end_rev_id, direction, include_merges):
130+ """Calculate the revisions to view for a file.
131+
132+ :param file_graph: the per-file graph
133+ :param graph_tip: the tip of the per-file graph
134+ :param include_merges: if True, include all revisions, not just the top
135+ level
136+ :return: An list of (revision_id, dotted_revno, merge_depth) tuples OR
137+ None if the algorithm fails (and another one should be used).
138+ """
139+ br_revno, br_rev_id = branch.last_revision_info()
140+ if br_revno == 0:
141+ return []
142+
143+ # Find the revisions where the file was changed and merged
144+ file_rev_ids = []
145+ file_merges = []
146+ for (_, rev_id), parents in file_graph.iter_ancestry([graph_tip]):
147+ file_rev_ids.append(rev_id)
148+ if len(parents) > 1:
149+ file_merges.append(rev_id)
150+
151+ # Handle the simple cases
152+ if len(file_rev_ids) == 1:
153+ return _generate_one_revision(branch, file_rev_ids[0], br_rev_id,
154+ br_revno)
155+ elif len(file_rev_ids) == 0:
156+ # Should this ever happen?
157+ return []
158+ elif file_merges and include_merges:
159+ # Fall back to the old algorithm for now
160+ return None
161+
162+ # Find all the revisions we can using a linear search
163+ result = []
164+ missing = set(file_rev_ids)
165+ merges_to_search = 0
166+ created_timestamp = None
167+ try:
168+ candidates = _linear_view_revisions(branch, start_rev_id, end_rev_id)
169+ for index, (rev_id, revno, depth) in enumerate(candidates):
170+ if rev_id in missing:
171+ result.append((rev_id, revno, depth))
172+ missing.remove(rev_id)
173+ if len(missing) == 0:
174+ break
175+ else:
176+ if _has_merges(branch, rev_id):
177+ merges_to_search += 1
178+ # If this is a dense tree, this optimisation is unlikely
179+ # to result in a net win - fall back to old algorithm.
180+ if merges_to_search > 100:
181+ return None
182+ # Check the timestamp to avoid going back too far on the
183+ # mainline for files created in merge revisions. We don't
184+ # do this every revision, just regularly, to minimise the
185+ # number of revisions that we load at this point.
186+ if index and index % 500 == 0:
187+ if created_timestamp is None:
188+ created_rev = branch.repository.get_revision(
189+ file_rev_ids[-1])
190+ created_timestamp = created_rev.timestamp
191+ rev = branch.repository.get_revision(rev_id)
192+ if created_timestamp > rev.timestamp:
193+ return None
194+
195+ except _StartNotLinearAncestor:
196+ raise errors.BzrCommandError('Start revision not found in'
197+ ' left-hand history of end revision.')
198+
199+ # If no merges were found in the revision range, then we can be
200+ # certain that we've found all the revisions we care about.
201+ if missing and merges_to_search:
202+ # TODO: search the deltas of the merges, splicing successful
203+ # matches into their rightful spots. That should work well on
204+ # chk repositories for typical histories but we need to benchmark
205+ # it to confirm. There's most likely a sweet spot above which
206+ # the O(history) traditional way - generating the full graph of
207+ # history and post-filtering - remains the best performer.
208+ trace.mutter("log file fastpath failed to find %d revisions" %
209+ len(missing))
210+ return None
211+
212+ # We came, we saw, we walked away victorious ...
213+ if direction == 'forward':
214+ result = reversed(result)
215+ return result
216+
217+
218 def _calc_view_revisions(branch, start_rev_id, end_rev_id, direction,
219 generate_merge_revisions, delayed_graph_generation=False):
220 """Calculate the revisions to view.