loggerhead (codebrowse) hangs occasionally on launchpad

Bug #513044 reported by Max Kanat-Alexander
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
loggerhead
Fix Released
Critical
Max Kanat-Alexander

Bug Description

This is a spin-off of Bug 118625. We fixed one type of loggerhead hang there, but on January 19, mwhudson provided me with a log that showed a real, hard hang where the system printed nothing to the log for eight minutes.

The next time it hangs, we will need an actual GDB backtrace (preferably one that's already been run through the magic backtrace-to-python-functions translator) to see what the threads are doing, which should lead to a resolution, since this is indeed a real hang now and not just a slowdown.

Related branches

Changed in loggerhead:
status: New → Triaged
importance: Undecided → Critical
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

There's some debugging output from a hang at http://people.canonical.com/~mwh/debug-2010-02-16.tgz.

To me it just looks overloaded, but it also seems to be suffering from bug 420738. Maybe synchronizing the access to the lrucache used for the memory cache is needed.

Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

Yeah, it's not just overloaded:

INF [20100216-18:33:10.909] [1176082768] loggerhead.~sos-oren-alon/+junk/ECUproject: Rendering InventoryUI: 0.017993927001953125 secs, 7534 bytes
DEB [20100216-18:38:49.489] [47034350767344] paste.httpserver.ThreadPool: Added task (0 tasks queued)

There are five minutes between those two log messages. Was that stack trace taken in between 18:33:10 and 18:38:49? If so, it's odd that every single active thread is on a flush to a socket and *none* of the active threads are waiting or doing *anything* else.

Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

Just out of curiosity, is there anything in dmesg or /var/log/messages from around the time of the hangs? Are we hitting an open-files limit, or some OS-level error that loggerhead isn't catching that prevents us from sending output?

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

I don't think the stack trace was taken in that period, but I'm not sure. I'll make sure we know when this next happens :-)

Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

Okay. So that we have the info on the bug, here's what I need the next time there's a hang:

* A normal GDB backtrace.
* Ideally, also a Python backtrace, parsed out of the GDB backtrace. (This means causing GDB to dump core and having the script parse it, or modifying the script to include the GDB frames for non-python frames.)
* The exact time that these traces were taken at. (Ideally milliseconds would help.)

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Here's a sort of combo backtrace for a recent hang. It looks like we're deadlocking on the GIL, which is a bit surprising. Next stop the logfile, I guess.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

The log is entirely unremarkable:

INF [20100322-15:40:55.696] [1092381008] loggerhead: Starting to process http://bazaar.launchpad.net/%7Ecairo-dock-team/cairo-dock-plug-ins/plug-ins/atom
INF [20100322-15:40:55.747] [1092381008] lp-loggerhead: Using branch: ~cairo-dock-team/cairo-dock-plug-ins/plug-ins
INF [20100322-15:40:55.747] [1092381008] lp-loggerhead: branch_url: http://bazaar.launchpad.net/00/03/55/0a
INF [20100322-15:40:55.853] [1092381008] loggerhead.~cairo-dock-team/cairo-dock-plug-ins/plug-ins: Getting information for AtomUI: 0.0906219482421875 secs
ERR [20100322-15:40:55.853] [1092381008] root: ??.??.??.?? - - [22/Mar/2010:15:40:55 +0100] "GET /%7Ecairo-dock-team/cairo-dock-plug-ins/plug-ins/atom HTTP/1.1" 200 - "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 2 subscribers; feed-id=5394196865849145804)"
INF [20100322-15:40:55.873] [1092381008] loggerhead.~cairo-dock-team/cairo-dock-plug-ins/plug-ins: Rendering AtomUI: 0.01984095573425293 secs, 10310 bytes
INF [20100322-15:41:07.886] [47470987242736] loggerhead: Starting up...
INF [20100322-15:41:07.886] [47470987242736] loggerhead: Loading the bzr plugins...
DEB [20100322-15:41:10.364] [1080969552] paste.httpserver.ThreadPool: Started new worker 1080969552: Initial worker pool

Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

  Aaactually, I think the backtrace doesn't show a deadlock at all--it shows a totally normally-functioning Paste server that isn't receiving any connections and that has no work to do.

  Is there some proxy in front of loggerhead, like Apache, or something about the loggerhead-launchpad connection that does proxying but doesn't itself log? Something in front of loggerhead itself is crashing or denying connections, when this happens, it looks like.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Here's a couple more traces.

Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

Can I see the logs for the last two traces? (It would be helpful since the last traces have times, and the first trace doesn't.)

Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

Also, that log in comment #7 doesn't show a hang at all. There's activity up to *seconds* before the service was restarted, but the backtrace looks like it was taken from a totally idle server. I don't even see where there would have been time to take a core dump.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :
Revision history for this message
Steve McInerney (spm) wrote :

Max, yeah 'hang' doesn't always mean the process has stopped entirely.
What it means at the crux - the service is no longer responding to automated checks within the allotted timeframe, and thus by implication is down for end users as well. Ergo has "hung" and needs restarting.

You get the same concept on servers that go into swap death. They are still working, just not in a reasonable timeframe, so out comes the big hammer to fix the problem. :-/

Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

> Max, yeah 'hang' doesn't always mean the process has stopped entirely.

  Hey spm. I know that, but *this particular bug* is about a hard hang that causes loggerhead to actually stop logging during its occurrence.

  There may be actually two other situations that cause loggerhead to stop responding, that would be separate bugs from this one:

  1) Possibly a threading issue with loggerhead's use of the lru_cache (which could also possibly be this bug, although I have no idea yet).

  2) Loggerhead getting overloaded because it can't handle the load of codebrowse.

Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

Okay, in seg1-clean (the first crash) there is indeed a hang, but by the time the core was taken, most of the hung threads had been killed by the thread-killer:

16:07:26.423: 1101130064: loggerhead: Starting to process http://bazaar.launchpad.net/%7Evcs-imports/busybox/main/changes
16:12:55.338: Thread 1101130064 hung (working on task for 327 seconds)

There are several other hung threads that are in _parse_record_unchecked in bzrlib.knit. Each thread is at a different line in the same function, though:

File "/home/pqm/for_rollouts/production/eggs/bzr-2.1.0-py2.5-linux-x86_64.egg/bzrlib/knit.py", line 1915, in _parse_record_unchecked
File "/home/pqm/for_rollouts/production/eggs/bzr-2.1.0-py2.5-linux-x86_64.egg/bzrlib/knit.py", line 1905, in _parse_record_unchecked
File "/home/pqm/for_rollouts/production/eggs/bzr-2.1.0-py2.5-linux-x86_64.egg/bzrlib/knit.py", line 1898, in _parse_record_unchecked

The other threads, when killed are in:

  File "/usr/lib/python2.5/site-packages/pygments/lexer.py", line 163, in streamer
     yield t, v

  File "/usr/lib/python2.5/site-packages/pygments/formatters/html.py", line 565, in _format_lines
     if line and parts[-1]:

  File "/home/pqm/for_rollouts/production/eggs/bzr-2.1.0-py2.5-linux-x86_64.egg/bzrlib/pack.py", line 200, in readline

  File "/home/pqm/for_rollouts/production/eggs/bzr-2.1.0-py2.5-linux-x86_64.egg/bzrlib/btree_index.py", line 612, in __init__

Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

By the way, this is technically a different hang than the one described in this bug--the one described in this bug involves no logging, and this one has valid logs at the time it happens. But it's always possible they're different versions of the same hang.

In _parse_record_unchecked:

line 1898: record_contents = df.readlines()
line 1905: if len(record_contents) != int(rec[2]):
line 1915: df.close()

df is a tuned_gzip.GzipFile.

Other bzr files:

"/home/pqm/for_rollouts/production/eggs/bzr-2.1.0-py2.5-linux-x86_64.egg/bzrlib/pack.py", line 200:
  Literally "def readline(self):".
This is called from pack.py, line 239, in _read_line, which is:
  line = self._source.readline()

File "/home/pqm/for_rollouts/production/eggs/bzr-2.1.0-py2.5-linux-x86_64.egg/bzrlib/btree_index.py", line 612, in __init__:
        key_list = _btree_serializer._parse_leaf_lines(bytes,
            key_length, ref_list_length)
(After this one is killed, things start to work properly again.)

Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

That btree_index one is in the core from this hang, also. The top frames are:

Thread 7
#0 0x000000000041b16e in _PyObject_LengthHint (o=0x7335960) from ../Objects/abstract.c
#1 0x0000000000438e58 in listextend (self=0x6321c68, b=0x7335960) from ../Objects/listobject.c
#2 0x000000000041b7f0 in PySequence_List (v=0x7335960) from ../Objects/abstract.c
#3 0x000000000047e1d0 in builtin_sorted (self=<value optimized out>, args=0x4bed990, kwds=0x0) from ../Python/bltinmodule.c
/home/pqm/for_rollouts/production/eggs/bzr-2.1.0-py2.5-linux-x86_64.egg/bzrlib/btree_index.py (1097): _walk_through_internal_nodes
/home/pqm/for_rollouts/production/eggs/bzr-2.1.0-py2.5-linux-x86_64.egg/bzrlib/btree_index.py (1135): iter_entries

Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

Actually, here's a bit more context so that we can see where this comes from:

Thread 7
#0 0x000000000041b16e in _PyObject_LengthHint (o=0x7335960) from ../Objects/abstract.c
#1 0x0000000000438e58 in listextend (self=0x6321c68, b=0x7335960) from ../Objects/listobject.c
#2 0x000000000041b7f0 in PySequence_List (v=0x7335960) from ../Objects/abstract.c
#3 0x000000000047e1d0 in builtin_sorted (self=<value optimized out>, args=0x4bed990, kwds=0x0) from ../Python/bltinmodule.c
/home/pqm/for_rollouts/production/eggs/bzr-2.1.0-py2.5-linux-x86_64.egg/bzrlib/btree_index.py (1097): _walk_through_internal_nodes
/home/pqm/for_rollouts/production/eggs/bzr-2.1.0-py2.5-linux-x86_64.egg/bzrlib/btree_index.py (1135): iter_entries
/home/pqm/for_rollouts/production/eggs/bzr-2.1.0-py2.5-linux-x86_64.egg/bzrlib/index.py (1315): iter_entries
/home/pqm/for_rollouts/production/eggs/bzr-2.1.0-py2.5-linux-x86_64.egg/bzrlib/groupcompress.py (1938): _get_entries
/home/pqm/for_rollouts/production/eggs/bzr-2.1.0-py2.5-linux-x86_64.egg/bzrlib/groupcompress.py (1965): get_parent_map
/home/pqm/for_rollouts/production/eggs/bzr-2.1.0-py2.5-linux-x86_64.egg/bzrlib/groupcompress.py (1325): _get_parent_map_with_sources
/home/pqm/for_rollouts/production/eggs/bzr-2.1.0-py2.5-linux-x86_64.egg/bzrlib/groupcompress.py (1313): get_parent_map
/home/pqm/for_rollouts/production/eggs/bzr-2.1.0-py2.5-linux-x86_64.egg/bzrlib/repository.py (2663): get_parent_map

Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

Well, actually, I can't say whether or not the thread from the gdb trace is actually also the thread from the SystemExit exception--the core says it was taken at 16:13, and the SystemExit happens at 16:13:02, so it's somewhat possible, but I really have no idea. The two threads are in different places, and the gdb thread is in an uninteresting place while the SystemExit'ing thread is in a very interesting place, so it seems likely they're not actually the same thread.

Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

Here's a spreadsheet describing what's going on in the seg1-clean.txt log (which represents the crash from 2010-04-01). There's a lot of stuff that takes a long time to complete, and a lot of tasks build up in the queue. It could be a valid deadlock, or it could just be an overload--more investigation is required.

Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

Okay, so in analyzing the above spreadsheet, I can see that there are sort of two hangs, and each starts with a call into Pygments and ends when the thread running Pygments is killed. They were both loading the same URL. Loading that URL consistently causes codebrowse to hang for several minutes.

Locally, I can reproduce pygments being very slow on that file.

I can also reproduce pygments being very slow on a large (1.9MB) Perl file.

So, I suspect that pygments is just very slow on large files and that we should skip it over a certain threshhold.

Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

On my local machine ( Core2 Duo 2.66GHz ), here are the times to run pygmentize on a Perl file based on file size:

131K file: 1 second
261K: 2 seconds
391K: 2.8 seconds
521K: 3.8 seconds

In other words, about 1 second per 128K.

For a python file, I get these times:

115K: 0.697s
230K: 1.190s
345K: 1.773s

It continues to be linear--about 0.6 seconds for each 128K or so.

And that's on my local machine that's not really doing anything else.

So I'm guessing that 512K will be a reasonable limit for files that we syntax-highlight, and we'll put the limit in a constant in case people want to change it locally.

John A Meinel (jameinel)
Changed in loggerhead:
assignee: nobody → Max Kanat-Alexander (mkanat)
status: Triaged → Fix Released
Changed in loggerhead:
status: Fix Released → In Progress
Changed in loggerhead:
status: In Progress → Fix Committed
Revision history for this message
Robert Collins (lifeless) wrote :

We're logging hangs against this bug in an internal journal, however its considered fix committed here. Max, can you expand on the status of this bug? I suspect we should be opening a new bug and being more aggressive about sorting out the hangs - on some days its once every 90 minutes.

Revision history for this message
John A Meinel (jameinel) wrote :

This bug was the fact that Pygments is slow (qbzr has also noted that on large files it is 85% of the overhead). And as such, Max disabled pygment syntax highlighting on large files.

It sounds like you have a new bug that should be opened and investigated.

How is the hanging being monitored? Reading from a keep-alive sort of page every X minutes?

Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

Yeah, open a new bug, attach the logs, and subscribe me or assign it to me. I am the general loggerhead stability contact now, I believe.

Revision history for this message
Robert Collins (lifeless) wrote :

The hanging is being monitored by nagios noting that it gets no response, and the system as a whole being stuffed :)

Apparently there are many many threads in use though, and thats something that points to a paste issue; I'm filing a new bug on that. Separately I need to figure out if this has been deployed.

Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

I filed a bug against pygments, about its performance:

http://dev.pocoo.org/projects/pygments/ticket/508

If that ever gets resolved (perhaps by them starting to use pyrex or something) we can remove the limit we added here.

Tom Haddon (mthaddon)
tags: added: canonical-losa-lp
Changed in loggerhead:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.