Launchpad itself

Merge lp:~mbp/launchpad/690021-rlimit into lp:launchpad

690021-rlimit
Merge into devel

Proposed by Martin Pool on 2010-12-15

Status:

Merged

Approved by:

Robert Collins on 2011-04-06

Approved revision:

no longer in the source branch.

Merged at revision:

12991

Proposed branch:

lp:~mbp/launchpad/690021-rlimit

Merge into:

lp:launchpad

Diff against target:

21 lines (+5/-0)

1 file modified

cronscripts/scan_branches.py (+5/-0)

To merge this branch:

bzr merge lp:~mbp/launchpad/690021-rlimit

Critical

Fix Released

Link a bug report

Reviewer	Date Requested	Status
Robert Collins (community)		Approve on 2011-04-06
Martin Pool (community)		Disapprove on 2010-12-22
Tim Penhey (community)	2010-12-16	Needs Information on 2010-12-20
Michael Hudson-Doyle	2010-12-15	Approve on 2010-12-15
Review via email: mp+43733@code.launchpad.net

Commit message

[r=lifeless,mwhudson][ui=none][bug=690021][no-qa] set 2GB rlimit on scan_branches.py

Description of the change

This change limits the address space of scan_branches.py to 2GB. If it tries to allocate more, it will crash. See <https://bugs.launchpad.net/launchpad-code/+bug/690021>

After some discussion with spm and mwh on irc this seemed like a good idea. It is consistent with what we do in other scripts.

2Gb is a compromise between I hope being small enough to stop the machine bogging down, but not so small that scans that would otherwise succeed will now fail.

It would be nice if the limit was not hard coded, but this seems like the most pragmatic thing for now. I added a note on https://dev.launchpad.net/LEP/FeatureFlags that we could potentially do it there.

Obviously this will not actually fix whatever the bad scaling behaviour is in bug 690021; we should probably split that to a separate bug.

I would welcome opinions on how to test this. Do these jobs run on staging or qastaging? Perhaps we should just let it get to production and then watch for it to fail cleanly.

Revision history for this message

Michael Hudson-Doyle (mwhudson) on 2010-12-15:

review: Approve

Revision history for this message

Martin Pool (mbp) wrote on 2010-12-15:

I tested this locally, and it didn't explode:

mbp@lp-lucid% ./cronscripts/scan_branches.py
2010-12-15 07:53:37 INFO Creating lockfile: /var/lock/launchpad-runscanbranches.lock
2010-12-15 07:53:43 INFO Running synchronously.
./cronscripts/scan_branches.py 7.02s user 0.55s system 71% cpu 10.547 total

I think it's ok to merge. Apparently we now have pqm permission, but I have not yet tried it.

Revision history for this message

Martin Pool (mbp) wrote on 2010-12-15:

I get an error message

PQMException: 'Failed to verify signature: gpgv exited with error code 2'

could somebody else please sponsor landing of this for me?

Revision history for this message

Aaron Bentley (abentley) wrote on 2010-12-15:

What's the impact of this on the job being run? Does it get left in a running state? Does it get re-run on the next script invocation?

Revision history for this message

Martin Pool (mbp) wrote on 2010-12-15:

On 16 December 2010 02:37, Aaron Bentley <email address hidden> wrote:
> What's the impact of this on the job being run? Does it get left in a running state? Does it get re-run on the next script invocation?

Great question. I had assumed that jobs that fail would get
restarted, but I don't know that to be true, and it may well not be.
If they don't, presumably that problem is happening anyhow now when
spm kills them? But automatic killing may make this more common.

--
Martin

Revision history for this message

Martin Pool (mbp) wrote on 2010-12-16:

One data point here is that spm manually killed the scanner the other day (see bug 690021), and to judge from the logs shown on <https://code.edge.launchpad.net/~vcs-imports/linux/btrfs> it did retry. (And, apparently it succeeded the second time, which is kind of interesting in a different way.)

To judge from the code and from <https://dev.launchpad.net/Foundations/JobSystem> the job that was running may be left in running state, or the transaction will be cancelled and it will stay runnable.

Ideally there would be some infrastructure that made sure that jobs that were repeatedly attempted and never completed were marked as stuck/abandoned. Apparently that does happen with imports. I don't know if branch scanners will get the same behaviour. spm says that tim has fixed thing previously such that they will.

I think I'll wait for a review from Tim next week.

Revision history for this message

Tim Penhey (thumper) wrote on 2010-12-20:

If we moved the rlimit setting to the RunScanBranches class constructor (or somewhere), we could tweak the config value and actually have tests that confirm the behaviour of jobs that fail.

Care to add a test?

review: Needs Information

Revision history for this message

Martin Pool (mbp) wrote on 2010-12-20:

On 20 December 2010 12:16, Tim Penhey <email address hidden> wrote:
> Review: Needs Information
> If we moved the rlimit setting to the RunScanBranches class constructor (or somewhere), we could tweak the config value and actually have tests that confirm the behaviour of jobs that fail.

We could move it there; putting it at a lower level would perhaps make
it easier to call from a test suite.

On the other hand, other code that sets rlimits does it from the cron
script; and I don't have a specific test in mind to run. In some ways
this should actually be at a higher level, outside of the cron script,
within whatever runs it.

> Care to add a test?

Sure. What do you think we should test? That the rlimit gets set?

I think we could do some general work across all scripts so that
* jobs cope well with being killed off
* limits can be controlled by losas
* limits are set in generic job framework

but I don't want to get into that now. This was more of a drive-by
fix to get it to the same level as other jobs.

--
Martin

Revision history for this message

Martin Pool (mbp) wrote on 2010-12-21:

Robert suggests, on irc, monkey-patching out setrlimit to verify that it's called. That's one of those tests I find pretty unsatisfying: the risk here is not that straight-line code will fail to call it, but rather that we're misunderstanding the effects of calling it.

Revision history for this message

Tim Penhey (thumper) wrote on 2010-12-21:

On Tue, 21 Dec 2010 21:58:52 you wrote:
> Robert suggests, on irc, monkey-patching out setrlimit to verify that it's
> called. That's one of those tests I find pretty unsatisfying: the risk
> here is not that straight-line code will fail to call it, but rather that
> we're misunderstanding the effects of calling it.

My fear is that by adding the rlimit, we aren't solving the problem at all as
the branch will continue to fail every minute and not get marked as failing.
Until we are sure that this is not the case, I'm hesitant to land this.

Revision history for this message

Martin Pool (mbp) wrote on 2010-12-22:

> On Tue, 21 Dec 2010 21:58:52 you wrote:
> > Robert suggests, on irc, monkey-patching out setrlimit to verify that it's
> > called. That's one of those tests I find pretty unsatisfying: the risk
> > here is not that straight-line code will fail to call it, but rather that
> > we're misunderstanding the effects of calling it.
>
> My fear is that by adding the rlimit, we aren't solving the problem at all as
> the branch will continue to fail every minute and not get marked as failing.
> Until we are sure that this is not the case, I'm hesitant to land this.

I filed bug 693241 and bug 693243 for (some of) the larger work here. Given Tim's comments I'm just going to reject this.

review: Disapprove

Revision history for this message

Robert Collins (lifeless) wrote on 2010-12-22:

I'd like to see something like this, limits are good to prevent
runaway processes.

AIUI python is pretty good about converting malloc failures into
exceptions. Maybe not perfect, but certainly not terrible.

As far as repeated failure detection goes, we can and should use
opaque job failures to halt things automatically, and set a trap that
it needs looking at.

Revision history for this message

Robert Collins (lifeless) on 2011-04-06:

review: Approve

Revision history for this message

Robert Collins (lifeless) wrote on 2011-04-06:

The ayes have it = with the caveat that we need to watch for potential fail-retry-forever loops (but I think we'll be fine because we already have to deal with arbitrary errors)

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Barki Mustapha

Celso Providelo

Christian Reis

Christy Awad

Colin Watson

Harpianto,ANDI

James Troup

John A Meinel

Kevin bush

Launchpad code reviewers

Launchpad code reviewers from Canonical

Martin Pool

Matthew Tanner

Maximiliano Bertacchini

Oguz Ersoz

Simon Brakhane

Ubuntu-BR DevOps

William Grant

alhawiti

api.ng

pedro cavazos

todaioan

wenjingwen

to status/vote changes:

Tzaddi

Tzaddi Belding

1	=== modified file 'cronscripts/scan_branches.py'
2	--- cronscripts/scan_branches.py 2010-04-27 19:48:39 +0000
3	+++ cronscripts/scan_branches.py 2010-12-15 05:38:55 +0000
4	@@ -9,6 +9,8 @@
5
6	import _pythonpath
7
8	+import resource
9	+
10	from lp.services.job.runner import JobCronScript
11	from lp.code.interfaces.branchjob import IBranchScanJobSource
12
13	@@ -21,5 +23,8 @@
14
15
16	if __name__ == '__main__':
17	+
18	+ resource.setrlimit(resource.RLIMIT_AS, (2L << 30, 2L << 30))
19	+
20	script = RunScanBranches()
21	script.lock_and_run()