Merge lp:~jameinel/loggerhead/known_graph into lp:loggerhead

Proposed by John A Meinel
Status: Work in progress
Proposed branch: lp:~jameinel/loggerhead/known_graph
Merge into: lp:loggerhead
Diff against target: 148 lines (+71/-35)
3 files modified
.bzrignore (+1/-0)
__init__.py (+2/-2)
loggerhead/wholehistory.py (+68/-33)
To merge this branch: bzr merge lp:~jameinel/loggerhead/known_graph
Reviewer Review Type Date Requested Status
Ian Clatworthy (community) Needs Information
Review via email: mp+22394@code.launchpad.net

Commit message

Use bzrlib.graph.KnownGraph when possible, to improve the time to build the revno and merge information.

Description of the change

Changes one of the internal 'compute' steps into using some of the faster apis that we've put into bzrlib.

I tried to keep the change fairly self contained, the results seem to show it being worthwhile.

Basically, instead of using 'branch.repository.get_graph()' and performing operations on that, we use 'branch.repository.revisions.get_known_graph_ancestry()' and perform ops on that. Which gives us:

1) The fast-path code for loading ancestry out of the btree indexes (loads multiple revisions at a
   time, rather than iterating over get_parent_map())
2) Uses the faster KnownGraph api rather than merge_sort, et al. topo_sort.merge_sort is currently
   *not* compile-optimized. (ATM because of a recursive definition between topo_sort.merge_sort and
   then non-compiled python code.)
   This also should improve things wrt getting child keys, as the KnownGraph already has computed
   all the children. (Rather than iterating over the graph again and rebuilding the child tuples.)

3) Uses gc.disable()/gc.enable() around the bulk of the processing, as this is a "known issue"
   with some of the graph stuff. We build a lot of python objects, but they all stay around, and
   don't participate in cycles. So gc.collect() running just slows us down.

4) Results:
    # orig known_graph gc.disable
    # bzr.dev 2.357 0.900 0.700
    # mysql 4.353 2.563 1.634

So the final result is that 'compute_whole_history_data()' on my machine drops from 4.3s to 1.6s on a MySQL branch. Of course, the real fix is figuring out how to do some of this stuff without loading the whole graph, but until we get there, this should be a fair improvement.

To post a comment you must log in.
Revision history for this message
Matt Nordhoff (mnordhoff) wrote :

You're no longer using _strip_NULL_ghosts. Either it's a bug, or you can remove that function from wholehistory.py. :D

(I'll try this out, but there's no way I'm smart enough to really review it.)

Revision history for this message
Matt Nordhoff (mnordhoff) wrote :

Do we mind that, without this, Loggerhead should be compatible with bzrlib all the way back to 1.13? (Not that it's well-tested anymore...)

I'm okay with it -- we'll be able to get rid of some hacks -- but, well, I don't maintain any old RHEL systems. :P

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Matt Nordhoff wrote:
> You're no longer using _strip_NULL_ghosts. Either it's a bug, or you can remove that function from wholehistory.py. :D
>
> (I'll try this out, but there's no way I'm smart enough to really review it.)

Try it, I'm hoping it can just be removed. (The old tsort.merge_sort
didn't handle ghosts. The new one handles them just fine.)

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkuxmUoACgkQJdeBCYSNAAPuBACfUv/YNh6AXW6YFQ9zoTnhluN+
mNEAoM6RWp9XTXFKohiCIN3MUZ6eSIgF
=/cZb
-----END PGP SIGNATURE-----

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Matt Nordhoff wrote:
> Do we mind that, without this, Loggerhead should be compatible with bzrlib all the way back to 1.13? (Not that it's well-tested anymore...)
>
> I'm okay with it -- we'll be able to get rid of some hacks -- but, well, I don't maintain any old RHEL systems. :P

If you *really* wanted, we could leave the old code in place, and add a
"if getattr(branch.repository.revisions, 'get_known_graph_ancestry',
None) is not None:"

check.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkuxmakACgkQJdeBCYSNAAOZ+QCfefnOkomRUB6RYsVsCUddoiV4
YhsAn2ODj4WK8XDW7LBbGDx27LvtKwXl
=VUFd
-----END PGP SIGNATURE-----

Revision history for this message
Jelmer Vernooij (jelmer) wrote :

This breaks support for foreign branches, so a getattr would be nice. Ideally loggerhead should use Repository.get_known_graph_ancestry() when that lands.

Revision history for this message
Matt Nordhoff (mnordhoff) wrote :

Also -- does Loggerhead's multithreading cause any sort of weirdness with gc.disable/enable?

Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote :

John,

This looks ok to me. I'm not sure why it breaks support for foreign branches though. Could you check with jelmer re that and perhaps add the getattr check as you suggested to address it?

review: Needs Information
Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ian Clatworthy wrote:
> Review: Needs Information
> John,
>
> This looks ok to me. I'm not sure why it breaks support for foreign branches though. Could you check with jelmer re that and perhaps add the getattr check as you suggested to address it?

This will probably be superseded by my other work anyway.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkvTAbsACgkQJdeBCYSNAAPhCQCg1MEMTtuJU3JmlLzsWqsjDzcr
+CcAoIASaSAR0tcqKxIPhfjYDkFPkQ3P
=eU9/
-----END PGP SIGNATURE-----

Unmerged revisions

411. By John A Meinel

More comment cleanup.

410. By John A Meinel

Clean out some of the extra cruft. Leaving in some as comments.

409. By John A Meinel

Some timing and memory perf from using StaticTuple.

408. By John A Meinel

Some timing results using mysql.

Shows a total speed up of around 2.5x => 3.3x. Ideally we'd do better,
but at least this is better than what we had.

407. By John A Meinel

Significant wins by using get_known_graph_ancestry.

406. By John A Meinel

We don't really need nanosecond resolution.

405. By John A Meinel

A few little tweaks.

Move it out into a separate function so we can profile the work done,
disabling gc while running drops it from 900ms down to around 700ms.

404. By John A Meinel

Fix a bunch of typos, etc. It at least seems to be working.

403. By John A Meinel

An initial attempt at using KnownGraph as the graph workhorse.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file '.bzrignore'
2--- .bzrignore 2009-04-01 14:40:05 +0000
3+++ .bzrignore 2010-03-29 18:52:17 +0000
4@@ -7,3 +7,4 @@
5 *.log
6 _trial_temp
7 loggerhead-memprofile
8+./tags
9
10=== modified file '__init__.py'
11--- __init__.py 2010-03-29 06:45:54 +0000
12+++ __init__.py 2010-03-29 18:52:17 +0000
13@@ -1,4 +1,4 @@
14-# Copyright 2009 Canonical Ltd
15+# Copyright 2009, 2010 Canonical Ltd
16 #
17 # This program is free software; you can redistribute it and/or modify
18 # it under the terms of the GNU General Public License as published by
19@@ -65,7 +65,7 @@
20 # loggerhead internal code will try to 'import loggerhead', so
21 # let's put it on the path if we can't find it in the existing path
22 try:
23- import loggerhead
24+ import loggerhead.apps.branch
25 except ImportError:
26 import os.path, sys
27 sys.path.append(os.path.dirname(__file__))
28
29=== modified file 'loggerhead/wholehistory.py'
30--- loggerhead/wholehistory.py 2009-10-21 14:40:23 +0000
31+++ loggerhead/wholehistory.py 2010-03-29 18:52:17 +0000
32@@ -17,11 +17,11 @@
33 #
34 """Cache the whole history data needed by loggerhead about a branch."""
35
36+import gc
37 import logging
38 import time
39
40 from bzrlib.revision import is_null, NULL_REVISION
41-from bzrlib.tsort import merge_sort
42
43
44 def _strip_NULL_ghosts(revision_graph):
45@@ -50,36 +50,71 @@
46 log = logging.getLogger('loggerhead.%s' %
47 (branch.get_config().get_nickname(),))
48
49- graph = branch.repository.get_graph()
50- parent_map = dict((key, value) for key, value in
51- graph.iter_ancestry([last_revid]) if value is not None)
52-
53- _revision_graph = _strip_NULL_ghosts(parent_map)
54-
55- _rev_info = []
56- _rev_indices = {}
57-
58 if is_null(last_revid):
59- _merge_sort = []
60- else:
61- _merge_sort = merge_sort(
62- _revision_graph, last_revid, generate_revno=True)
63-
64- for info in _merge_sort:
65- seq, revid, merge_depth, revno, end_of_merge = info
66- revno_str = '.'.join(str(n) for n in revno)
67- parents = _revision_graph[revid]
68- _rev_indices[revid] = len(_rev_info)
69- _rev_info.append([(seq, revid, merge_depth, revno_str, end_of_merge), (), parents])
70-
71- for revid in _revision_graph.iterkeys():
72- if _rev_info[_rev_indices[revid]][0][2] == 0:
73- continue
74- for parent in _revision_graph[revid]:
75- c = _rev_info[_rev_indices[parent]]
76- if revid not in c[1]:
77- c[1] = c[1] + (revid,)
78-
79- log.info('built revision graph cache: %r secs' % (time.time() - z,))
80-
81- return (_rev_info, _rev_indices)
82+ return ([], {})
83+
84+ gc_enabled = gc.isenabled()
85+ if gc_enabled:
86+ gc.disable()
87+ try:
88+ # from bzrlib import commands
89+ # Profiling shows that the bulk of the time spent here is reading the
90+ # data out of the indexes, rather than time building and sorting the
91+ # graph. At least we're using code paths that can be optimized if
92+ # possible. Of course, ideally we wouldn't be
93+ # loading-the-whole-graph...
94+ # rev_info, rev_indices = commands.apply_lsprofiled(',,prof.txt',
95+ # _compute_graph, branch, last_revid)
96+ rev_info, rev_indices = _compute_graph(branch, last_revid)
97+ finally:
98+ if gc_enabled:
99+ gc.enable()
100+
101+ log.info('built revision graph cache: %.3fs' % (time.time() - z,))
102+ return (rev_info, rev_indices)
103+
104+
105+def _compute_graph(branch, last_revid):
106+ """Do the actual work of computing the graph information."""
107+ # Using get_known_graph_ancestry drops us from 2.3s on bzr.dev down to
108+ # 0.9s. Wrapping this with a gc.disable call, drops us further to 0.7s.
109+ # This shows even better with mysql.
110+ # orig known_graph gc.disable
111+ # bzr.dev 2.357 0.900 0.700
112+ # mysql 4.353 2.563 1.634
113+ last_key = (last_revid,)
114+ graph = branch.repository.revisions.get_known_graph_ancestry([last_key])
115+ # What about ghosts?
116+ merge_sorted = graph.merge_sort(last_key)
117+
118+ rev_info = []
119+ rev_indices = {}
120+
121+ get_parent_keys = graph.get_parent_keys
122+ get_child_keys = graph.get_child_keys
123+ # TODO: Use StaticTuple
124+ # Using StaticTuple does show a memory reduction (85.6MB => 81.1MB
125+ # peak on a MySQL branch). There doesn't seem to be a time-difference
126+ # wrt how long it takes to build (probably because we have gc
127+ # disabled?). StaticTuple should help in 'unrelated' code, since it
128+ # reduces overall gc overhead. StaticTuple isn't trivial, as it
129+ # interacts with the marshalling code.
130+ for seq, info in enumerate(merge_sorted):
131+ #seq, revid, merge_depth, revno, end_of_merge = info
132+ # Switch back from a tuple key to a simple string rev_id
133+ rev_id = info.key[-1]
134+ revno_str = '.'.join(map(str, info.revno))
135+ parent_ids = tuple([p[-1] for p in get_parent_keys(info.key)])
136+ rev_indices[rev_id] = len(rev_info)
137+ # TODO: Try using the original merge_sorted object. Gives us a nice
138+ # Object.foo rather than entry[0][1] syntax. However would need
139+ # special handling for the caching layer
140+ basic_info = (seq, rev_id, info.merge_depth, revno_str,
141+ info.end_of_merge)
142+ if info.merge_depth != 0:
143+ # Find the children of this revision
144+ child_ids = tuple([c[-1] for c in get_child_keys(info.key)])
145+ else:
146+ child_ids = ()
147+ rev_info.append((basic_info, child_ids, parent_ids))
148+ return rev_info, rev_indices

Subscribers

People subscribed via source and target branches