Merge lp:~jameinel/loggerhead/history_db into lp:loggerhead

Proposed by John A Meinel
Status: Merged
Merged at revision: 424
Proposed branch: lp:~jameinel/loggerhead/history_db
Merge into: lp:loggerhead
Diff against target: 0 lines
To merge this branch: bzr merge lp:~jameinel/loggerhead/history_db
Reviewer Review Type Date Requested Status
Michael Hudson-Doyle Approve
Review via email: mp+24637@code.launchpad.net

This proposal supersedes a proposal from 2010-05-03.

Description of the change

This makes bzr-history-db a strict dependency for Loggerhead. However, it does result in some nice benefits.

The key points are:

1) Strict dependency. However bzr-history-db is pretty small. If people prefer I can merge the code into the loggerhead codebase. (I can preserve file-ids so that bringing in any new changes should be trivially easy.)

2) The very first time a branch is viewed which has never been seen before and is unrelated to any other branches is slower. For example[1]:
    uncached disk cached memory cached peakmem path
old 12.5s 0.618s 0.126s 77MB /emacs/trunk/changes
new 28.5s 0.064s 0.047s 24MB[2] /emacs/trunk/changes

old 12.9s 0.895s 0.475s 93.6MB /emacs/trunk/files
new 28.8s 0.944s 0.359s 45MB[3] /emacs/trunk/files

3) The 'first time' is a lot better when the second branch is related to the first. In my testing of a 100rev old branch of emacs, the time to import it into history-db is about 200ms. So the 64ms to display from disk becomes only 260ms.

    uncached path
old 12.5s /emacs/yamaoka/changes
new 0.254s /emacs/yamaoka/changes

Also note that because the memory cache is gone, and the disk cache is shared, loading both trunk/changes an yamaoka/changes has:

old 128MB
new 24MB

Loggerhead seems to use an LRUCache with a max of 10 entries. So peak memory is potentially ~50MB*10=500MB, vs a flat 24MB.

4) The size of the disk cache increases. So it is even more important that we delete any temp dirs we create. Size with one branch:

old 2.9MB
new 31.0MB

However, the new cache does shared the data better. With 2 branches imported:

old 5.7MB
new 31.0MB

At 10x the size, after ~10 branches of the same project, the new form will actually save disk space. (After just 8 branches of bzr, I was at 8.7MB new vs 9.7MB old.)

5) The code uses a threading.Lock during import. But the code already did that. It should be multithreaded for read access. It is possible that we'll have higher sqlite concurrency. For something like codebrowse, we may want to switch the back-end to Postgres.

Overall, I think this can be a big win for loggerhead, but I'm expecting to need a little bit of polishing until it can land. (Given that my other 4 branches haven't landed yet, either...)

[1] Time is just to get changes, w/o rendering. This is also my 'integration' branch which has all of the perf improvements (tag cache, only load 2 pages, etc.)

[2] Peak memory for initial import is 120MB, and there seems to be about 80MB that is 'left' in memory. I need to debug that part of bzr-history-db.

[3] There seems to be a 'soft' memory leak on the 'files' page. If I reload, memory goes up by about 30MB, until I reload a few times and it drops dramatically. My guess is that there is a cycle, preventing garbage collection. And my guess is the 'History._inventory_cache' is the culprit. But I haven't proven that. However, the leak exists with and without my changes, so that isn't specifically relevant. So numbers are given after manually running "gc.collect()" using breakin.

To post a comment you must log in.
Revision history for this message
John A Meinel (jameinel) wrote :

Sorry about the noise, I must have somehow hit submit twice, and it kicked it back at me, so I thought it was an old merge proposal.

Revision history for this message
Matt Nordhoff (mnordhoff) wrote :

Small tidbit, but you can remove revision_graph_locks and revision_graph_check_lock around line 265 of history.py, since they're no longer used.

There might be other code that can be torn out, too.

Revision history for this message
Matt Nordhoff (mnordhoff) wrote :
Download full text (3.6 KiB)

Tried this out (see lp:~mnordhoff/loggerhead/cheezum). Got this:

> 2010-05-04 02:38:18,853 DEBUG paste.httpserver.ThreadPool: Added task (0 tasks queued)
> 2010-05-04 02:38:20,627 ERROR loggerhead./bzr-svn/trivial/check_versions_nameerror: exceptions.TypeError: list objects are unhashable
> Traceback (most recent call last):
> File "/home/mnordhoff/loggerhead/loggerhead/apps/error.py", line 31, in __call__
> return self.application(environ, start_response)
> File "/var/lib/python-support/python2.5/paste/httpexceptions.py", line 632, in __call__
> return self.application(environ, start_response)
> File "/var/lib/python-support/python2.5/paste/deploy/config.py", line 276, in __call__
> return self.app(environ, start_response)
> File "/home/mnordhoff/loggerhead/loggerhead/main.py", line 165, in new_app
> return application(environ, start_response)
> File "/usr/local/lib/python2.5/site-packages/Dozer-0.1-py2.5.egg/dozer/__init__.py", line 76, in __call__
> return self.app(environ, start_response)
> File "/home/mnordhoff/loggerhead/loggerhead/main.py", line 142, in new_app
> return application(environ, start_response)
> File "/home/mnordhoff/loggerhead/loggerhead/apps/transport.py", line 169, in __call__
> transport, self)(environ, start_response)
> File "/home/mnordhoff/loggerhead/loggerhead/apps/transport.py", line 126, in __call__
> return self.app_for_non_branch(environ)(environ, start_response)
> File "/home/mnordhoff/loggerhead/loggerhead/apps/transport.py", line 126, in __call__
> return self.app_for_non_branch(environ)(environ, start_response)
> File "/home/mnordhoff/loggerhead/loggerhead/apps/transport.py", line 126, in __call__
> return self.app_for_non_branch(environ)(environ, start_response)
> File "/home/mnordhoff/loggerhead/loggerhead/apps/transport.py", line 132, in __call__
> return self.app_for_branch(b)(environ, start_response)
> File "/home/mnordhoff/loggerhead/loggerhead/apps/branch.py", line 190, in app
> val = do_stuff()
> File "/home/mnordhoff/loggerhead/loggerhead/apps/branch.py", line 182, in do_stuff
> return c(environ, start_response)
> File "/home/mnordhoff/loggerhead/loggerhead/controllers/__init__.py", line 93, in __call__
> vals.update(self.get_values(path, kwargs, headers))
> File "/home/mnordhoff/loggerhead/loggerhead/controllers/revision_ui.py", line 45, in get_values
> revid = self.get_revid()
> File "/home/mnordhoff/loggerhead/loggerhead/controllers/__init__.py", line 115, in get_revid
> return h.fix_revid(self.args[0])
> File "/home/mnordhoff/loggerhead/loggerhead/history.py", line 453, in fix_revid
> val = self.get_revid_for_revno([revid])[revid]
> File "/home/mnordhoff/loggerhead/loggerhead/history.py", line 364, in get_revid_for_revno
> revid = self._revno_revid_cache.get(revno_str)
> File "/home/mnordhoff/loggerhead/loggerhead/history.py", line 241, in get
> cached = self._cache.get((self._branch_tip, key))
> File "/usr/local/co/bzr/bzr/bzr.dev/bzrlib/lru_cache.py", line 184, in get
> node = self._cache.get(key, None)
> TypeError: list objects are unhashable
> 2010-05-04 02:38:20,6...

Read more...

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

>> File "/home/mnordhoff/loggerhead/loggerhead/history.py", line 364, in get_revid_for_revno
>> revid = self._revno_revid_cache.get(revno_str)
>> File "/home/mnordhoff/loggerhead/loggerhead/history.py", line 241, in get
>> cached = self._cache.get((self._branch_tip, key))
>> File "/usr/local/co/bzr/bzr/bzr.dev/bzrlib/lru_cache.py", line 184, in get
>> node = self._cache.get(key, None)
>> TypeError: list objects are unhashable

Fix pushed in rev 418. At one point that api was a multi- api, so I was
passing in a list. I changed it to a single-request api, and missed one
of the calling sites. (I miss the bzrlib test suite :).

Thanks for catching it.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkvfokUACgkQJdeBCYSNAAOhMACcDOQ5NXxB2bJyYfjlNHJOySw8
N7IAoIlkheyoizF/yfYAsr9q69swjzrZ
=6Sfr
-----END PGP SIGNATURE-----

lp:~jameinel/loggerhead/history_db updated
418. By John A Meinel

Use get_revid_for_revno properly.

At one point it was a multi api, but I switched it back to a single. Use it as such.

Revision history for this message
Matt Nordhoff (mnordhoff) wrote :

Thanks for fixing that. Now I've got a new one. :-D

> 2010-05-04 14:53:23,283 DEBUG paste.httpserver.ThreadPool: Added task (0 tasks queued)
> 2010-05-04 14:53:24,310 ERROR loggerhead./bzr/st-size-exceptions: exceptions.AttributeError: 'History' object has no attribute '_rev_info'
> Traceback (most recent call last):
> File "/home/mnordhoff/loggerhead/loggerhead/apps/error.py", line 31, in __call__
> return self.application(environ, start_response)
> File "/var/lib/python-support/python2.5/paste/httpexceptions.py", line 632, in __call__
> return self.application(environ, start_response)
> File "/var/lib/python-support/python2.5/paste/deploy/config.py", line 276, in __call__
> return self.app(environ, start_response)
> File "/home/mnordhoff/loggerhead/loggerhead/main.py", line 165, in new_app
> return application(environ, start_response)
> File "/usr/local/lib/python2.5/site-packages/Dozer-0.1-py2.5.egg/dozer/__init__.py", line 76, in __call__
> return self.app(environ, start_response)
> File "/home/mnordhoff/loggerhead/loggerhead/main.py", line 142, in new_app
> return application(environ, start_response)
> File "/home/mnordhoff/loggerhead/loggerhead/apps/transport.py", line 170, in __call__
> transport, self)(environ, start_response)
> File "/home/mnordhoff/loggerhead/loggerhead/apps/transport.py", line 127, in __call__
> return self.app_for_non_branch(environ)(environ, start_response)
> File "/home/mnordhoff/loggerhead/loggerhead/apps/transport.py", line 127, in __call__
> return self.app_for_non_branch(environ)(environ, start_response)
> File "/home/mnordhoff/loggerhead/loggerhead/apps/transport.py", line 133, in __call__
> return self.app_for_branch(b)(environ, start_response)
> File "/home/mnordhoff/loggerhead/loggerhead/apps/branch.py", line 190, in app
> val = do_stuff()
> File "/home/mnordhoff/loggerhead/loggerhead/apps/branch.py", line 182, in do_stuff
> return c(environ, start_response)
> File "/home/mnordhoff/loggerhead/loggerhead/controllers/__init__.py", line 93, in __call__
> vals.update(self.get_values(path, kwargs, headers))
> File "/home/mnordhoff/loggerhead/loggerhead/controllers/revision_ui.py", line 102, in get_values
> ri = self._history._rev_info
> AttributeError: 'History' object has no attribute '_rev_info'
> 2010-05-04 14:53:24,311 INFO loggerhead./bzr/st-size-exceptions: Getting information for ErrorUI: 0.000 secs
> 2010-05-04 14:53:24,311 INFO wsgi: 127.0.0.1 - - [04/May/2010:14:53:23 +0000] "GET /loggerhead/bzr/st-size-exceptions/revision/4119.3.11/bzrlib/repofmt/groupcompress_repo.py HTTP/1.0" 200 - "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
> 2010-05-04 14:53:24,317 INFO loggerhead./bzr/st-size-exceptions: Rendering ErrorUI: 0.006 secs, 3722 bytes
--
Matt Nordhoff

Revision history for this message
John A Meinel (jameinel) wrote :

> Thanks for fixing that. Now I've got a new one. :-D

...

> > ri = self._history._rev_info
> > AttributeError: 'History' object has no attribute '_rev_info'

Fixed in lp:~jameinel/loggerhead/history_db 422, but requires a newer lp:bzr-history-db (>= 124).

I could probably have done it so that it didn't. But for now, I prefer to hide db queries behind Querier, and keep the plugin and loggerhead in sync.

lp:~jameinel/loggerhead/history_db updated
419. By John A Meinel

Fix some bogus merge stuff for show_merge_points, and fix ghost handling.

420. By John A Meinel

Minor improvement.

421. By John A Meinel

revid not revids

422. By John A Meinel

Add a helper that computes where a given revision was merged.

This fixes a path where revision_ui was directly accessing the history private
variables.
It does require bzr-history-db/trunk >= 124.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Without looking into this in great depth, I'm pretty excited about being able to share the cache usefully between branches. I'm a bit less excited about the initial calculation time -- 44 seconds for Launchpad -- but I guess that's not too terrible.

Maybe, for Launchpad, we should look at keeping the cache up to date with a job that runs outside of loggerhead? It could be triggered from the branchChanged branch method.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Oh, and for launchpad would it make sense to have a database per project or something? A big database for all the branches on Launchpad would be optimal in some sense, but sqlite is pretty rubbish at concurrency :(

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

So after some conversation, I think the idea is a winner.

The code is not really ready yet, there are docstrings that are not accurate and there are mounds of dead code, but I guess this isn't really news. I wonder if we should just include the code from the history-db code plugin rather than depending on it.

I've cleaned up the code a bit at bzr+ssh://bazaar.launchpad.net/~mwhudson/loggerhead/history_db

review: Needs Fixing
lp:~jameinel/loggerhead/history_db updated
423. By John A Meinel

Bring in mwhudson's code cleanup passes.

424. By John A Meinel

Hackishly bring in history_db code. Still needs a lot of work to sort
out what needs to be done. And what needs to be put where.

425. By John A Meinel

Bring trunk back into history-db.

426. By John A Meinel

Start cleaning up the history-db code.

This includes removing parts that aren't needed for loggerhead specifically.
Stuff like Branch hooks, etc. Vs just having the Querier available.

427. By John A Meinel

Use ugly hacks to get BranchWSGIApp getting a cachepath in more cases.

428. By John A Meinel

Revert the ugly hacks.

Instead, we just use ':memory:' if cache_path is None.

429. By John A Meinel

Generally handle the NULL_REVISION branch (from history-db)

430. By John A Meinel

Merge in the code cleanups and better handling of NULL.

431. By John A Meinel

Handle KeyError when a revid isn't actually in the history.

Don't pdb trap if fix_revid() doesn't handle it. Raise an exception which
will generate a 404.

432. By John A Meinel

Found some bugs in the get_short_revision_history_by_fileid code.

Use direct access to the file graph, rather than the old log code.

433. By John A Meinel

implement a fallback if known graph is not available.

434. By John A Meinel

Bring in the new history-db work so we don't access SQL directly.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

I think this is reaching the point where it's a clear win.

Looking through the diff once more, the only thing that struck me was that there's a try:except ImportError: around the import of the FileChangeCache in history.py, which we can get rid of now we unconditionally depend on sqlite.

Otherwise, I think we should land this and maybe get Matt to run it for a few days, then we can start to think about deployment :-)

review: Approve
lp:~jameinel/loggerhead/history_db updated
435. By John A Meinel

Unconditionally depend on sqlite.

Preview Diff

Empty

Subscribers

People subscribed via source and target branches