Merge lp:~danilo/launchpad/bug-408718 into lp:launchpad

Proposed by Данило Шеган
Status: Merged
Merged at revision: not available
Proposed branch: lp:~danilo/launchpad/bug-408718
Merge into: lp:launchpad
Diff against target: None lines
To merge this branch: bzr merge lp:~danilo/launchpad/bug-408718
Reviewer Review Type Date Requested Status
Henning Eggers (community) Approve
Review via email: mp+11682@code.launchpad.net
To post a comment you must log in.
Revision history for this message
Данило Шеган (danilo) wrote :

= Bug #408717 =

Note: this is a cherrypick candidate so is kept as simple as possible.
It's critical for Launchpad Translations (poimport is blocked on this).

We had super-fast-imports query (which is a huge optimization for our
imports, resulting in 50x faster imports on average) cause Postgres to
eat up diskspace with repeated requests for more temp space. That
happened a month ago (when bug #408717 was filed and considered a
one-off), and it happened again on Friday morning. We all suspect a
Postgres bug, but that doesn't help us out.

However, since this is only an optimization, we can avoid this query
when we do hit a problem. So, how do we know when we'll hit a problem?

Simple: a timeout. 5 minute timeout (300s) should be enough to handle
all real query runs which do not exhibit the bad behaviour. If this
query takes more than 5 minutes, it's probably hitting a PG bug, so
let's kill it. Note that PG was completely operational when this
happened, so it should be able to fulfill this as well.

Note that exactly the same query that failed when it was ran manually
caused no issues: so, it's not a particular query that fails, it's a
particular set of circumstances that's out of our control (i.e.
something like a PG bug) that make the query fail.

= Pre-implementation notes =

Discussed this with Stuart: he thinks it's a great idea :)

He advised me about what I need to do to set the statement timeout and
how to best test it.

= Proposed fix =

So, introduce a statement timeout right around the potentially breaking
query. If it doesn't complete in the allotted time, kill it and go the
slow route for this particular import.

poimport.statement_timeout option allows one to set it to 'timeout'
which will trigger the timeout on purpose. This is used in a test.

= Tests =

bin/test -vvt poimport -t superfastimports

= Demo & QA =

Test poimport thoroughly.

Revision history for this message
Henning Eggers (henninge) wrote :
Download full text (5.5 KiB)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Am 14.09.2009 11:35, Данило Шеган schrieb:
> = Bug #408717 =
>
> Note: this is a cherrypick candidate so is kept as simple as possible.
> It's critical for Launchpad Translations (poimport is blocked on this).

Thanks for taking this on so quickly and finding a solution. If Stuart
thinks it's a great idea, who am I to argue? ;-)

 review approve

Just two comments below.

> === modified file 'lib/canonical/config/schema-lazr.conf'
> --- lib/canonical/config/schema-lazr.conf 2009-09-08 15:16:33 +0000
> +++ lib/canonical/config/schema-lazr.conf 2009-09-14 09:12:58 +0000
> @@ -1429,6 +1429,10 @@
> storm_cache: generational
> storm_cache_size: 500
>
> +# Statement timeout (in seconds), limited to super-fast-imports query.
> +# Set to 'timeout' to make it timeout every time (for tests).
> +statement_timeout: 300
> +
> [processmail]
> # The database user which will be used by this process.
> # datatype: string
>
> === modified file 'lib/lp/translations/utilities/tests/test_superfastimports.py'
> --- lib/lp/translations/utilities/tests/test_superfastimports.py 2009-07-17 02:25:09 +0000
> +++ lib/lp/translations/utilities/tests/test_superfastimports.py 2009-09-14 09:12:58 +0000
> @@ -5,8 +5,9 @@
>
> import unittest
>
> +from canonical.config import config
> from canonical.testing import ZopelessDatabaseLayer
> -from lp.testing import TestCaseWithFactory, verifyObject

Thanks for cleaning this.

> +from lp.testing import TestCaseWithFactory
> from lp.translations.utilities.translation_import import (
> ExistingPOFileInDatabase)
> from lp.translations.utilities.translation_common_format import (
> @@ -67,6 +68,27 @@
> self.assertFalse(cached_file.isAlreadyImportedTheSame(message_data))
> self.assertFalse(cached_file.isAlreadyTranslatedTheSame(message_data))
>
> + def test_query_timeout(self):
> + # Test that super-fast-imports doesn't cache anything when it hits
> + # a timeout.
> +
> + # Override the config option to force a timeout error.
> + new_config = ("""
> + [poimport]
> + statement_timeout: timeout
> + """)
> + config.push('super_fast_timeout', new_config)
> +
> + # Add a message that would otherwise be cached (see other tests).
> + current_message = self.factory.makeTranslationMessage(
> + pofile=self.pofile, is_imported=False)
> + message_data = self.getTranslationMessageData(current_message)
> + cached_file = ExistingPOFileInDatabase(self.pofile)
> + self.assertFalse(cached_file.isAlreadyTranslatedTheSame(message_data))
> +
> + # Restore the old configuration.
> + config.pop('super_fast_timeout')
> +
>
> def test_suite():
> return unittest.TestLoader().loadTestsFromName(__name__)
>
> === modified file 'lib/lp/translations/utilities/translation_import.py'
> --- lib/lp/translations/utilities/translation_import.py 2009-08-05 13:11:00 +0000
> +++ lib/lp/translations/utilities/translation_import.py 2009-09-14 09:12:58 +0000
> @@ -18,8 +18,13 @@
>
> from operator import attrgetter
>
> +import transaction
> +
> +from...

Read more...

review: Approve
Revision history for this message
Данило Шеган (danilo) wrote :

Hi Henning,

Thanks for the review. For documentation purposes, I am writing up what
we've discussed on IRC as well.

У пон, 14. 09 2009. у 10:24 +0000, Henning Eggers пише:
> > + except TimeoutError:
> > + # Restart the transaction and return empty SelectResults.
> > + transaction.abort()
> > + transaction.begin()
> > + cur.execute("SELECT 1 WHERE 1=0")
>
> As you mentioned on IRC, you should reset the statement_timeout back to
> what it was (if that is not done by abort() ?). Please extend the test
> to check that, too.

I've decided not to (and I don't know if transaction.abort() resets
statement_timeout value): this script run will finish in at most 10
minutes, so having a safeguard while it runs is not too bad (considering
how we just almost killed the DB server).

> Also, could you consider using EmptyResultSet instead of such useless
> queries, please? ;)

As discussed, this is actually using SQLObject, and since I am going to
ask for a CP, I am not converting this query to Storm. With
EmptyResultSet the following would not be possible:

> > rows = cur.fetchall()

And I'd have to worry a bit more about breaking something else.

Cheers,
Danilo

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'lib/canonical/config/schema-lazr.conf'
--- lib/canonical/config/schema-lazr.conf 2009-09-08 15:16:33 +0000
+++ lib/canonical/config/schema-lazr.conf 2009-09-14 09:12:58 +0000
@@ -1429,6 +1429,10 @@
1429storm_cache: generational1429storm_cache: generational
1430storm_cache_size: 5001430storm_cache_size: 500
14311431
1432# Statement timeout (in seconds), limited to super-fast-imports query.
1433# Set to 'timeout' to make it timeout every time (for tests).
1434statement_timeout: 300
1435
1432[processmail]1436[processmail]
1433# The database user which will be used by this process.1437# The database user which will be used by this process.
1434# datatype: string1438# datatype: string
14351439
=== modified file 'lib/lp/translations/utilities/tests/test_superfastimports.py'
--- lib/lp/translations/utilities/tests/test_superfastimports.py 2009-07-17 02:25:09 +0000
+++ lib/lp/translations/utilities/tests/test_superfastimports.py 2009-09-14 09:12:58 +0000
@@ -5,8 +5,9 @@
55
6import unittest6import unittest
77
8from canonical.config import config
8from canonical.testing import ZopelessDatabaseLayer9from canonical.testing import ZopelessDatabaseLayer
9from lp.testing import TestCaseWithFactory, verifyObject10from lp.testing import TestCaseWithFactory
10from lp.translations.utilities.translation_import import (11from lp.translations.utilities.translation_import import (
11 ExistingPOFileInDatabase)12 ExistingPOFileInDatabase)
12from lp.translations.utilities.translation_common_format import (13from lp.translations.utilities.translation_common_format import (
@@ -67,6 +68,27 @@
67 self.assertFalse(cached_file.isAlreadyImportedTheSame(message_data))68 self.assertFalse(cached_file.isAlreadyImportedTheSame(message_data))
68 self.assertFalse(cached_file.isAlreadyTranslatedTheSame(message_data))69 self.assertFalse(cached_file.isAlreadyTranslatedTheSame(message_data))
6970
71 def test_query_timeout(self):
72 # Test that super-fast-imports doesn't cache anything when it hits
73 # a timeout.
74
75 # Override the config option to force a timeout error.
76 new_config = ("""
77 [poimport]
78 statement_timeout: timeout
79 """)
80 config.push('super_fast_timeout', new_config)
81
82 # Add a message that would otherwise be cached (see other tests).
83 current_message = self.factory.makeTranslationMessage(
84 pofile=self.pofile, is_imported=False)
85 message_data = self.getTranslationMessageData(current_message)
86 cached_file = ExistingPOFileInDatabase(self.pofile)
87 self.assertFalse(cached_file.isAlreadyTranslatedTheSame(message_data))
88
89 # Restore the old configuration.
90 config.pop('super_fast_timeout')
91
7092
71def test_suite():93def test_suite():
72 return unittest.TestLoader().loadTestsFromName(__name__)94 return unittest.TestLoader().loadTestsFromName(__name__)
7395
=== modified file 'lib/lp/translations/utilities/translation_import.py'
--- lib/lp/translations/utilities/translation_import.py 2009-08-05 13:11:00 +0000
+++ lib/lp/translations/utilities/translation_import.py 2009-09-14 09:12:58 +0000
@@ -18,8 +18,13 @@
1818
19from operator import attrgetter19from operator import attrgetter
2020
21import transaction
22
23from canonical.config import config
21from canonical.database.sqlbase import cursor, quote24from canonical.database.sqlbase import cursor, quote
2225
26from storm.exceptions import TimeoutError
27
23from canonical.cachedproperty import cachedproperty28from canonical.cachedproperty import cachedproperty
24from lp.registry.interfaces.person import (29from lp.registry.interfaces.person import (
25 IPersonSet,30 IPersonSet,
@@ -119,7 +124,6 @@
119 # Pre-fill self.messages and self.imported with data.124 # Pre-fill self.messages and self.imported with data.
120 self._fetchDBRows()125 self._fetchDBRows()
121126
122
123 def _fetchDBRows(self):127 def _fetchDBRows(self):
124 msgstr_joins = [128 msgstr_joins = [
125 "LEFT OUTER JOIN POTranslation pt%d "129 "LEFT OUTER JOIN POTranslation pt%d "
@@ -168,8 +172,32 @@
168 TranslationTemplateItem.sequence,172 TranslationTemplateItem.sequence,
169 TranslationMessage.potemplate NULLS LAST173 TranslationMessage.potemplate NULLS LAST
170 ''' % substitutions174 ''' % substitutions
175
171 cur = cursor()176 cur = cursor()
172 cur.execute(sql)177 try:
178 # XXX 2009-09-14 DaniloSegan (bug #408718):
179 # this statement causes postgres to eat the diskspace
180 # from time to time. Let's wrap it up in a timeout.
181 timeout = config.poimport.statement_timeout
182
183 # We have to commit what we've got so far or we'll lose
184 # it when we hit TimeoutError.
185 transaction.commit()
186
187 if timeout == 'timeout':
188 # This is used in tests.
189 query = "SELECT pg_sleep(2)"
190 timeout = '1s'
191 else:
192 timeout = 1000 * int(timeout)
193 query = sql
194 cur.execute("SET statement_timeout to %s" % quote(timeout))
195 cur.execute(query)
196 except TimeoutError:
197 # Restart the transaction and return empty SelectResults.
198 transaction.abort()
199 transaction.begin()
200 cur.execute("SELECT 1 WHERE 1=0")
173 rows = cur.fetchall()201 rows = cur.fetchall()
174202
175 assert TranslationConstants.MAX_PLURAL_FORMS == 6, (203 assert TranslationConstants.MAX_PLURAL_FORMS == 6, (
176204