loggerhead

Merge lp:~jameinel/loggerhead/known_graph into lp:loggerhead

known_graph
Merge into trunk-rich

Proposed by John A Meinel on 2010-03-29

Status:	Work in progress
Proposed branch:	lp:~jameinel/loggerhead/known_graph
Merge into:	lp:loggerhead
Diff against target:	148 lines (+71/-35) 3 files modified .bzrignore (+1/-0) __init__.py (+2/-2) loggerhead/wholehistory.py (+68/-33)
To merge this branch:	bzr merge lp:~jameinel/loggerhead/known_graph
Related bugs:	Link a bug report

Reviewer	Review Type	Date Requested	Status
Ian Clatworthy (community)		2010-03-29	Needs Information on 2010-04-23
Review via email: mp+22394@code.launchpad.net

Commit message

Use bzrlib.graph.KnownGraph when possible, to improve the time to build the revno and merge information.

Description of the change

Changes one of the internal 'compute' steps into using some of the faster apis that we've put into bzrlib.

I tried to keep the change fairly self contained, the results seem to show it being worthwhile.

Basically, instead of using 'branch.repository.get_graph()' and performing operations on that, we use 'branch.repository.revisions.get_known_graph_ancestry()' and perform ops on that. Which gives us:

1) The fast-path code for loading ancestry out of the btree indexes (loads multiple revisions at a
   time, rather than iterating over get_parent_map())
2) Uses the faster KnownGraph api rather than merge_sort, et al. topo_sort.merge_sort is currently
   *not* compile-optimized. (ATM because of a recursive definition between topo_sort.merge_sort and
   then non-compiled python code.)
   This also should improve things wrt getting child keys, as the KnownGraph already has computed
   all the children. (Rather than iterating over the graph again and rebuilding the child tuples.)

3) Uses gc.disable()/gc.enable() around the bulk of the processing, as this is a "known issue"
with some of the graph stuff. We build a lot of python objects, but they all stay around, and
don't participate in cycles. So gc.collect() running just slows us down.

4) Results:
    # orig known_graph gc.disable
    # bzr.dev 2.357 0.900 0.700
    # mysql 4.353 2.563 1.634

So the final result is that 'compute_whole_history_data()' on my machine drops from 4.3s to 1.6s on a MySQL branch. Of course, the real fix is figuring out how to do some of this stuff without loading the whole graph, but until we get there, this should be a fair improvement.

Revision history for this message

Matt Nordhoff (mnordhoff) wrote on 2010-03-30:

You're no longer using _strip_NULL_ghosts. Either it's a bug, or you can remove that function from wholehistory.py. :D

(I'll try this out, but there's no way I'm smart enough to really review it.)

Revision history for this message

Matt Nordhoff (mnordhoff) wrote on 2010-03-30:

Do we mind that, without this, Loggerhead should be compatible with bzrlib all the way back to 1.13? (Not that it's well-tested anymore...)

I'm okay with it -- we'll be able to get rid of some hacks -- but, well, I don't maintain any old RHEL systems. :P

Revision history for this message

John A Meinel (jameinel) wrote on 2010-03-30:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Matt Nordhoff wrote:
> You're no longer using _strip_NULL_ghosts. Either it's a bug, or you can remove that function from wholehistory.py. :D
>
> (I'll try this out, but there's no way I'm smart enough to really review it.)

Try it, I'm hoping it can just be removed. (The old tsort.merge_sort
didn't handle ghosts. The new one handles them just fine.)

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkuxmUoACgkQJdeBCYSNAAPuBACfUv/YNh6AXW6YFQ9zoTnhluN+
mNEAoM6RWp9XTXFKohiCIN3MUZ6eSIgF
=/cZb
-----END PGP SIGNATURE-----

Revision history for this message

John A Meinel (jameinel) wrote on 2010-03-30:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Matt Nordhoff wrote:
> Do we mind that, without this, Loggerhead should be compatible with bzrlib all the way back to 1.13? (Not that it's well-tested anymore...)
>
> I'm okay with it -- we'll be able to get rid of some hacks -- but, well, I don't maintain any old RHEL systems. :P

If you *really* wanted, we could leave the old code in place, and add a
"if getattr(branch.repository.revisions, 'get_known_graph_ancestry',
None) is not None:"

check.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkuxmakACgkQJdeBCYSNAAOZ+QCfefnOkomRUB6RYsVsCUddoiV4
YhsAn2ODj4WK8XDW7LBbGDx27LvtKwXl
=VUFd
-----END PGP SIGNATURE-----

Revision history for this message

Jelmer Vernooij (jelmer) wrote on 2010-04-07:

This breaks support for foreign branches, so a getattr would be nice. Ideally loggerhead should use Repository.get_known_graph_ancestry() when that lands.

Revision history for this message

Matt Nordhoff (mnordhoff) wrote on 2010-04-07:

Also -- does Loggerhead's multithreading cause any sort of weirdness with gc.disable/enable?

Revision history for this message

Ian Clatworthy (ian-clatworthy) wrote on 2010-04-23:

John,

This looks ok to me. I'm not sure why it breaks support for foreign branches though. Could you check with jelmer re that and perhaps add the getattr check as you suggested to address it?

review: Needs Information

Revision history for this message

John A Meinel (jameinel) wrote on 2010-04-24:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ian Clatworthy wrote:
> Review: Needs Information
> John,
>
> This looks ok to me. I'm not sure why it breaks support for foreign branches though. Could you check with jelmer re that and perhaps add the getattr check as you suggested to address it?

This will probably be superseded by my other work anyway.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkvTAbsACgkQJdeBCYSNAAPhCQCg1MEMTtuJU3JmlLzsWqsjDzcr
+CcAoIASaSAR0tcqKxIPhfjYDkFPkQ3P
=eU9/
-----END PGP SIGNATURE-----

Unmerged revisions

411. By John A Meinel on 2010-03-29

More comment cleanup.

410. By John A Meinel on 2010-03-29

Clean out some of the extra cruft. Leaving in some as comments.

409. By John A Meinel on 2010-03-29

Some timing and memory perf from using StaticTuple.

408. By John A Meinel on 2010-03-29

Some timing results using mysql.

Shows a total speed up of around 2.5x => 3.3x. Ideally we'd do better,
but at least this is better than what we had.

407. By John A Meinel on 2010-03-29

Significant wins by using get_known_graph_ancestry.

406. By John A Meinel on 2010-03-29

We don't really need nanosecond resolution.

405. By John A Meinel on 2010-03-29

A few little tweaks.

Move it out into a separate function so we can profile the work done,
disabling gc while running drops it from 900ms down to around 700ms.

404. By John A Meinel on 2010-03-29

Fix a bunch of typos, etc. It at least seems to be working.

403. By John A Meinel on 2010-03-29

An initial attempt at using KnownGraph as the graph workhorse.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Colin Watson

James Barlow

John A Meinel

Matt Nordhoff

Michael Hudson-Doyle

Paul Hummer

jay ham

 === modified file '.bzrignore'
 --- .bzrignore	2009-04-01 14:40:05 +0000
 +++ .bzrignore	2010-03-29 18:52:17 +0000
@@ -7,3 +7,4 @@
  *.log
  _trial_temp
  loggerhead-memprofile
++./tags
 === modified file '__init__.py'
 --- __init__.py	2010-03-29 06:45:54 +0000
 +++ __init__.py	2010-03-29 18:52:17 +0000
@@ -1,4 +1,4 @@
--# Copyright 2009 Canonical Ltd
++# Copyright 2009, 2010 Canonical Ltd
+ #
  # This program is free software; you can redistribute it and/or modify
  # it under the terms of the GNU General Public License as published by
@@ -65,7 +65,7 @@
          # loggerhead internal code will try to 'import loggerhead', so
          # let's put it on the path if we can't find it in the existing path
          try:
--            import loggerhead
++            import loggerhead.apps.branch
          except ImportError:
              import os.path, sys
              sys.path.append(os.path.dirname(__file__))
 === modified file 'loggerhead/wholehistory.py'
 --- loggerhead/wholehistory.py	2009-10-21 14:40:23 +0000
 +++ loggerhead/wholehistory.py	2010-03-29 18:52:17 +0000
@@ -17,11 +17,11 @@
+ #
  """Cache the whole history data needed by loggerhead about a branch."""
++import gc
  import logging
  import time
  from bzrlib.revision import is_null, NULL_REVISION
--from bzrlib.tsort import merge_sort
  def _strip_NULL_ghosts(revision_graph):
@@ -50,36 +50,71 @@
      log = logging.getLogger('loggerhead.%s' %
                              (branch.get_config().get_nickname(),))
--    graph = branch.repository.get_graph()
--    parent_map = dict((key, value) for key, value in
--        graph.iter_ancestry([last_revid]) if value is not None)
--
--    _revision_graph = _strip_NULL_ghosts(parent_map)
--
--    _rev_info = []
--    _rev_indices = {}
--
      if is_null(last_revid):
--        _merge_sort = []
--    else:
--        _merge_sort = merge_sort(
--            _revision_graph, last_revid, generate_revno=True)
--
--    for info in _merge_sort:
--        seq, revid, merge_depth, revno, end_of_merge = info
--        revno_str = '.'.join(str(n) for n in revno)
--        parents = _revision_graph[revid]
--        _rev_indices[revid] = len(_rev_info)
--        _rev_info.append([(seq, revid, merge_depth, revno_str, end_of_merge), (), parents])
--
--    for revid in _revision_graph.iterkeys():
--        if _rev_info[_rev_indices[revid]][0][2] == 0:
--            continue
--        for parent in _revision_graph[revid]:
--            c = _rev_info[_rev_indices[parent]]
--            if revid not in c[1]:
--                c[1] = c[1] + (revid,)
--
--    log.info('built revision graph cache: %r secs' % (time.time() - z,))
--
--    return (_rev_info, _rev_indices)
++        return ([], {})
++
++    gc_enabled = gc.isenabled()
++    if gc_enabled:
++        gc.disable()
++    try:
++        # from bzrlib import commands
++        # Profiling shows that the bulk of the time spent here is reading the
++        # data out of the indexes, rather than time building and sorting the
++        # graph. At least we're using code paths that can be optimized if
++        # possible. Of course, ideally we wouldn't be
++        # loading-the-whole-graph...
++        # rev_info, rev_indices = commands.apply_lsprofiled(',,prof.txt',
++        #   _compute_graph, branch, last_revid)
++        rev_info, rev_indices = _compute_graph(branch, last_revid)
++    finally:
++        if gc_enabled:
++            gc.enable()
++
++    log.info('built revision graph cache: %.3fs' % (time.time() - z,))
++    return (rev_info, rev_indices)
++
++
++def _compute_graph(branch, last_revid):
++    """Do the actual work of computing the graph information."""
++    # Using get_known_graph_ancestry drops us from 2.3s on bzr.dev down to
++    # 0.9s. Wrapping this with a gc.disable call, drops us further to 0.7s.
++    # This shows even better with mysql.
++    #           orig    known_graph     gc.disable
++    # bzr.dev   2.357   0.900           0.700
++    # mysql     4.353   2.563           1.634
++    last_key = (last_revid,)
++    graph = branch.repository.revisions.get_known_graph_ancestry([last_key])
++    # What about ghosts?
++    merge_sorted = graph.merge_sort(last_key)
++
++    rev_info = []
++    rev_indices = {}
++
++    get_parent_keys = graph.get_parent_keys
++    get_child_keys = graph.get_child_keys
++    # TODO: Use StaticTuple
++    #       Using StaticTuple does show a memory reduction (85.6MB => 81.1MB
++    #       peak on a MySQL branch). There doesn't seem to be a time-difference
++    #       wrt how long it takes to build (probably because we have gc
++    #       disabled?). StaticTuple should help in 'unrelated' code, since it
++    #       reduces overall gc overhead. StaticTuple isn't trivial, as it
++    #       interacts with the marshalling code.
++    for seq, info in enumerate(merge_sorted):
++        #seq, revid, merge_depth, revno, end_of_merge = info
++        # Switch back from a tuple key to a simple string rev_id
++        rev_id = info.key[-1]
++        revno_str = '.'.join(map(str, info.revno))
++        parent_ids = tuple([p[-1] for p in get_parent_keys(info.key)])
++        rev_indices[rev_id] = len(rev_info)
++        # TODO: Try using the original merge_sorted object. Gives us a nice
++        #       Object.foo rather than entry[0][1] syntax. However would need
++        #       special handling for the caching layer
++        basic_info = (seq, rev_id, info.merge_depth, revno_str,
++                      info.end_of_merge)
++        if info.merge_depth != 0:
++            # Find the children of this revision
++            child_ids = tuple([c[-1] for c in get_child_keys(info.key)])
++        else:
++            child_ids = ()
++        rev_info.append((basic_info, child_ids, parent_ids))
++    return rev_info, rev_indices