Merge lp:~mbp/launchpad/690021-rlimit into lp:launchpad

Proposed by Martin Pool
Status: Merged
Approved by: Robert Collins
Approved revision: no longer in the source branch.
Merged at revision: 12991
Proposed branch: lp:~mbp/launchpad/690021-rlimit
Merge into: lp:launchpad
Diff against target: 21 lines (+5/-0)
1 file modified
cronscripts/scan_branches.py (+5/-0)
To merge this branch: bzr merge lp:~mbp/launchpad/690021-rlimit
Reviewer Review Type Date Requested Status
Robert Collins (community) Approve
Martin Pool (community) Disapprove
Tim Penhey (community) Needs Information
Michael Hudson-Doyle Approve
Review via email: mp+43733@code.launchpad.net

Commit message

[r=lifeless,mwhudson][ui=none][bug=690021][no-qa] set 2GB rlimit on scan_branches.py

Description of the change

This change limits the address space of scan_branches.py to 2GB. If it tries to allocate more, it will crash. See <https://bugs.launchpad.net/launchpad-code/+bug/690021>

After some discussion with spm and mwh on irc this seemed like a good idea. It is consistent with what we do in other scripts.

2Gb is a compromise between I hope being small enough to stop the machine bogging down, but not so small that scans that would otherwise succeed will now fail.

It would be nice if the limit was not hard coded, but this seems like the most pragmatic thing for now. I added a note on https://dev.launchpad.net/LEP/FeatureFlags that we could potentially do it there.

Obviously this will not actually fix whatever the bad scaling behaviour is in bug 690021; we should probably split that to a separate bug.

I would welcome opinions on how to test this. Do these jobs run on staging or qastaging? Perhaps we should just let it get to production and then watch for it to fail cleanly.

To post a comment you must log in.
Revision history for this message
Michael Hudson-Doyle (mwhudson) :
review: Approve
Revision history for this message
Martin Pool (mbp) wrote :

I tested this locally, and it didn't explode:

mbp@lp-lucid% ./cronscripts/scan_branches.py
2010-12-15 07:53:37 INFO Creating lockfile: /var/lock/launchpad-runscanbranches.lock
2010-12-15 07:53:43 INFO Running synchronously.
./cronscripts/scan_branches.py 7.02s user 0.55s system 71% cpu 10.547 total

I think it's ok to merge. Apparently we now have pqm permission, but I have not yet tried it.

Revision history for this message
Martin Pool (mbp) wrote :

I get an error message

PQMException: 'Failed to verify signature: gpgv exited with error code 2'

could somebody else please sponsor landing of this for me?

Revision history for this message
Aaron Bentley (abentley) wrote :

What's the impact of this on the job being run? Does it get left in a running state? Does it get re-run on the next script invocation?

Revision history for this message
Martin Pool (mbp) wrote :

On 16 December 2010 02:37, Aaron Bentley <email address hidden> wrote:
> What's the impact of this on the job being run?  Does it get left in a running state?  Does it get re-run on the next script invocation?

Great question. I had assumed that jobs that fail would get
restarted, but I don't know that to be true, and it may well not be.
If they don't, presumably that problem is happening anyhow now when
spm kills them? But automatic killing may make this more common.

--
Martin

Revision history for this message
Martin Pool (mbp) wrote :

One data point here is that spm manually killed the scanner the other day (see bug 690021), and to judge from the logs shown on <https://code.edge.launchpad.net/~vcs-imports/linux/btrfs> it did retry. (And, apparently it succeeded the second time, which is kind of interesting in a different way.)

To judge from the code and from <https://dev.launchpad.net/Foundations/JobSystem> the job that was running may be left in running state, or the transaction will be cancelled and it will stay runnable.

Ideally there would be some infrastructure that made sure that jobs that were repeatedly attempted and never completed were marked as stuck/abandoned. Apparently that does happen with imports. I don't know if branch scanners will get the same behaviour. spm says that tim has fixed thing previously such that they will.

I think I'll wait for a review from Tim next week.

Revision history for this message
Tim Penhey (thumper) wrote :

If we moved the rlimit setting to the RunScanBranches class constructor (or somewhere), we could tweak the config value and actually have tests that confirm the behaviour of jobs that fail.

Care to add a test?

review: Needs Information
Revision history for this message
Martin Pool (mbp) wrote :

On 20 December 2010 12:16, Tim Penhey <email address hidden> wrote:
> Review: Needs Information
> If we moved the rlimit setting to the RunScanBranches class constructor (or somewhere), we could tweak the config value and actually have tests that confirm the behaviour of jobs that fail.

We could move it there; putting it at a lower level would perhaps make
it easier to call from a test suite.

On the other hand, other code that sets rlimits does it from the cron
script; and I don't have a specific test in mind to run. In some ways
this should actually be at a higher level, outside of the cron script,
within whatever runs it.

> Care to add a test?

Sure. What do you think we should test? That the rlimit gets set?

I think we could do some general work across all scripts so that
 * jobs cope well with being killed off
 * limits can be controlled by losas
 * limits are set in generic job framework

but I don't want to get into that now. This was more of a drive-by
fix to get it to the same level as other jobs.

--
Martin

Revision history for this message
Martin Pool (mbp) wrote :

Robert suggests, on irc, monkey-patching out setrlimit to verify that it's called. That's one of those tests I find pretty unsatisfying: the risk here is not that straight-line code will fail to call it, but rather that we're misunderstanding the effects of calling it.

Revision history for this message
Tim Penhey (thumper) wrote :

On Tue, 21 Dec 2010 21:58:52 you wrote:
> Robert suggests, on irc, monkey-patching out setrlimit to verify that it's
> called. That's one of those tests I find pretty unsatisfying: the risk
> here is not that straight-line code will fail to call it, but rather that
> we're misunderstanding the effects of calling it.

My fear is that by adding the rlimit, we aren't solving the problem at all as
the branch will continue to fail every minute and not get marked as failing.
Until we are sure that this is not the case, I'm hesitant to land this.

Revision history for this message
Martin Pool (mbp) wrote :

> On Tue, 21 Dec 2010 21:58:52 you wrote:
> > Robert suggests, on irc, monkey-patching out setrlimit to verify that it's
> > called. That's one of those tests I find pretty unsatisfying: the risk
> > here is not that straight-line code will fail to call it, but rather that
> > we're misunderstanding the effects of calling it.
>
> My fear is that by adding the rlimit, we aren't solving the problem at all as
> the branch will continue to fail every minute and not get marked as failing.
> Until we are sure that this is not the case, I'm hesitant to land this.

I filed bug 693241 and bug 693243 for (some of) the larger work here. Given Tim's comments I'm just going to reject this.

review: Disapprove
Revision history for this message
Robert Collins (lifeless) wrote :

I'd like to see something like this, limits are good to prevent
runaway processes.

AIUI python is pretty good about converting malloc failures into
exceptions. Maybe not perfect, but certainly not terrible.

As far as repeated failure detection goes, we can and should use
opaque job failures to halt things automatically, and set a trap that
it needs looking at.

Revision history for this message
Robert Collins (lifeless) :
review: Approve
Revision history for this message
Robert Collins (lifeless) wrote :

The ayes have it = with the caveat that we need to watch for potential fail-retry-forever loops (but I think we'll be fine because we already have to deal with arbitrary errors)

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'cronscripts/scan_branches.py'
2--- cronscripts/scan_branches.py 2010-04-27 19:48:39 +0000
3+++ cronscripts/scan_branches.py 2010-12-15 05:38:55 +0000
4@@ -9,6 +9,8 @@
5
6 import _pythonpath
7
8+import resource
9+
10 from lp.services.job.runner import JobCronScript
11 from lp.code.interfaces.branchjob import IBranchScanJobSource
12
13@@ -21,5 +23,8 @@
14
15
16 if __name__ == '__main__':
17+
18+ resource.setrlimit(resource.RLIMIT_AS, (2L << 30, 2L << 30))
19+
20 script = RunScanBranches()
21 script.lock_and_run()