Merge lp:~benji/launchpad/bug-622765 into lp:launchpad

Proposed by Benji York
Status: Merged
Approved by: Benji York
Approved revision: no longer in the source branch.
Merged at revision: 11477
Proposed branch: lp:~benji/launchpad/bug-622765
Merge into: lp:launchpad
Diff against target: 250 lines (+73/-24)
4 files modified
lib/canonical/launchpad/scripts/tests/test_librarian_apache_log_parser.py (+2/-2)
lib/lp/services/apachelogparser/base.py (+7/-3)
lib/lp/services/apachelogparser/script.py (+9/-1)
lib/lp/services/apachelogparser/tests/test_apachelogparser.py (+55/-18)
To merge this branch: bzr merge lp:~benji/launchpad/bug-622765
Reviewer Review Type Date Requested Status
Henning Eggers (community) code Approve
Review via email: mp+34077@code.launchpad.net

Description of the change

The config option "logparser_max_parsed_lines" was respected only for individual files, not across multiple files (e.g., if a max of 1000 lines were to be parsed and there were 10 files to parse 10,000 lines would be parsed instead of just 1000).

This branch fixes the problem by keeping track of the number of lines parsed across calls to parse_file.

The related tests can be run with

    bin/test -c test_apachelogparser

No lint was reported by make lint.

To post a comment you must log in.
Revision history for this message
Henning Eggers (henninge) wrote :
Download full text (11.6 KiB)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Benji,
thank you for this nice fix. It is basically good to land after you have
considered a few comments of mine.

 review approve code

Cheers,
Henning

Am 30.08.2010 17:03, schrieb Benji York:
>
> === modified file 'lib/canonical/launchpad/scripts/tests/test_librarian_apache_log_parser.py'
> --- lib/canonical/launchpad/scripts/tests/test_librarian_apache_log_parser.py 2010-08-20 20:31:18 +0000
> +++ lib/canonical/launchpad/scripts/tests/test_librarian_apache_log_parser.py 2010-08-30 15:03:44 +0000
> @@ -102,7 +102,7 @@
> '69.233.136.42 - - [13/Jun/2008:14:55:22 +0100] "GET '
> '/15018215/ul_logo_64x64.png HTTP/1.1" 200 2261 '
> '"https://launchpad.net/~ubuntulite/+archive" "Mozilla"')
> - downloads, parsed_bytes = parse_file(
> + downloads, parsed_bytes, ignored = parse_file(

I am not sure "ignored" is the best choice here. I wondered at first if it
means "ignored_bytes" or "ignored_lines". Would it hurt to call it what it is
("parsed_lines") and then simply ignore it?

There are multiple uses of "ignored" in here, as you know. ;-)

> fd, start_position=0, logger=self.logger,
> get_download_key=get_library_file_id)
> self.assertEqual(
> @@ -121,7 +121,7 @@
> fd = StringIO(
> '69.233.136.42 - - [13/Jun/2008:14:55:22 +0100] "GET / HTTP/1.1" '
> '200 2261 "https://launchpad.net/~ubuntulite/+archive" "Mozilla"')
> - downloads, parsed_bytes = parse_file(
> + downloads, parsed_bytes, ignored = parse_file(
> fd, start_position=0, logger=self.logger,
> get_download_key=get_library_file_id)
> self.assertEqual(
>
> === modified file 'lib/lp/services/apachelogparser/base.py'
> --- lib/lp/services/apachelogparser/base.py 2010-08-27 18:41:01 +0000
> +++ lib/lp/services/apachelogparser/base.py 2010-08-30 15:03:44 +0000

Very nice. Thank you.

[...]

> === modified file 'lib/lp/services/apachelogparser/script.py'
> --- lib/lp/services/apachelogparser/script.py 2010-08-27 18:41:01 +0000
> +++ lib/lp/services/apachelogparser/script.py 2010-08-30 15:03:44 +0000
> @@ -8,6 +8,7 @@
>
> from zope.component import getUtility
>
> +from canonical.config import config
> from lp.app.errors import NotFoundError
> from lp.services.apachelogparser.base import (
> create_or_update_parsedlog_entry,
> @@ -65,8 +66,15 @@
>
> self.setUpUtilities()
> country_set = getUtility(ICountrySet)
> + parsed_lines = 0
> + max_parsed_lines = getattr(
> + config.launchpad, 'logparser_max_parsed_lines', None)
> for fd, position in files_to_parse:
> - downloads, parsed_bytes = parse_file(
> + # If we've used up our budget of lines to process, stop.
> + if (max_parsed_lines is not None
> + and parsed_lines >= max_parsed_lines):

This line must be indented and the "and" should probably go on the previous
line but multi-line "if" conditions are always bad to read because the
dangling lines align with the body of the statement. Storing the boolean in a
variable fi...

review: Approve (code)
Revision history for this message
Benji York (benji) wrote :
Download full text (4.5 KiB)

On Mon, Aug 30, 2010 at 11:54 AM, Henning Eggers
<email address hidden> wrote:
> Review: Approve code
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi Benji,
> thank you for this nice fix. It is basically good to land after you have
> considered a few comments of mine.
>
>  review approve code
>
> Cheers,
> Henning
>
> Am 30.08.2010 17:03, schrieb Benji York:
>>
>> === modified file 'lib/canonical/launchpad/scripts/tests/test_librarian_apache_log_parser.py'
>> --- lib/canonical/launchpad/scripts/tests/test_librarian_apache_log_parser.py 2010-08-20 20:31:18 +0000
>> +++ lib/canonical/launchpad/scripts/tests/test_librarian_apache_log_parser.py 2010-08-30 15:03:44 +0000
>> @@ -102,7 +102,7 @@
>>              '69.233.136.42 - - [13/Jun/2008:14:55:22 +0100] "GET '
>>              '/15018215/ul_logo_64x64.png HTTP/1.1" 200 2261 '
>>              '"https://launchpad.net/~ubuntulite/+archive" "Mozilla"')
>> -        downloads, parsed_bytes = parse_file(
>> +        downloads, parsed_bytes, ignored = parse_file(
>
> I am not sure "ignored" is the best choice here. I wondered at first if it
> means "ignored_bytes" or "ignored_lines". Would it hurt to call it what it is
> ("parsed_lines") and then simply ignore it?

Done.

>> === modified file 'lib/lp/services/apachelogparser/script.py'
>> --- lib/lp/services/apachelogparser/script.py 2010-08-27 18:41:01 +0000
>> +++ lib/lp/services/apachelogparser/script.py 2010-08-30 15:03:44 +0000
>> @@ -8,6 +8,7 @@
>>
>>  from zope.component import getUtility
>>
>> +from canonical.config import config
>>  from lp.app.errors import NotFoundError
>>  from lp.services.apachelogparser.base import (
>>      create_or_update_parsedlog_entry,
>> @@ -65,8 +66,15 @@
>>
>>          self.setUpUtilities()
>>          country_set = getUtility(ICountrySet)
>> +        parsed_lines = 0
>> +        max_parsed_lines = getattr(
>> +            config.launchpad, 'logparser_max_parsed_lines', None)
>>          for fd, position in files_to_parse:
>> -            downloads, parsed_bytes = parse_file(
>> +            # If we've used up our budget of lines to process, stop.
>> +            if (max_parsed_lines is not None
>> +            and parsed_lines >= max_parsed_lines):
>
> This line must be indented and the "and" should probably go on the previous
> line but multi-line "if" conditions are always bad to read because the
> dangling lines align with the body of the statement. Storing the boolean in a
> variable first and checking that in the statement often works better.

Done.

>> +        # If we have already parsed some lines, then the number of lines
>> +        # parsed will be passed in (parsed_lines argument) and parse_file will
>> +        # take that number into account when determining if the maximum number
>> +        # of lines to parse has been reached.
>
> Hm, this is a functional test and not documentation. This description should
> probably be found in the docstring of parse_file, should it not? If found
> there, it need not be explained in such detail here.

My intent was to explain the functioning of the test. I'm sure I would
appreciate a nice explanation of what's going on if I was reading the
test....

Read more...

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'lib/canonical/launchpad/scripts/tests/test_librarian_apache_log_parser.py'
--- lib/canonical/launchpad/scripts/tests/test_librarian_apache_log_parser.py 2010-08-20 20:31:18 +0000
+++ lib/canonical/launchpad/scripts/tests/test_librarian_apache_log_parser.py 2010-08-30 19:47:43 +0000
@@ -102,7 +102,7 @@
102 '69.233.136.42 - - [13/Jun/2008:14:55:22 +0100] "GET '102 '69.233.136.42 - - [13/Jun/2008:14:55:22 +0100] "GET '
103 '/15018215/ul_logo_64x64.png HTTP/1.1" 200 2261 '103 '/15018215/ul_logo_64x64.png HTTP/1.1" 200 2261 '
104 '"https://launchpad.net/~ubuntulite/+archive" "Mozilla"')104 '"https://launchpad.net/~ubuntulite/+archive" "Mozilla"')
105 downloads, parsed_bytes = parse_file(105 downloads, parsed_bytes, ignored = parse_file(
106 fd, start_position=0, logger=self.logger,106 fd, start_position=0, logger=self.logger,
107 get_download_key=get_library_file_id)107 get_download_key=get_library_file_id)
108 self.assertEqual(108 self.assertEqual(
@@ -121,7 +121,7 @@
121 fd = StringIO(121 fd = StringIO(
122 '69.233.136.42 - - [13/Jun/2008:14:55:22 +0100] "GET / HTTP/1.1" '122 '69.233.136.42 - - [13/Jun/2008:14:55:22 +0100] "GET / HTTP/1.1" '
123 '200 2261 "https://launchpad.net/~ubuntulite/+archive" "Mozilla"')123 '200 2261 "https://launchpad.net/~ubuntulite/+archive" "Mozilla"')
124 downloads, parsed_bytes = parse_file(124 downloads, parsed_bytes, ignored = parse_file(
125 fd, start_position=0, logger=self.logger,125 fd, start_position=0, logger=self.logger,
126 get_download_key=get_library_file_id)126 get_download_key=get_library_file_id)
127 self.assertEqual(127 self.assertEqual(
128128
=== modified file 'lib/lp/services/apachelogparser/base.py'
--- lib/lp/services/apachelogparser/base.py 2010-08-27 18:41:01 +0000
+++ lib/lp/services/apachelogparser/base.py 2010-08-30 19:47:43 +0000
@@ -77,9 +77,14 @@
77 return fd, file_size77 return fd, file_size
7878
7979
80def parse_file(fd, start_position, logger, get_download_key):80def parse_file(fd, start_position, logger, get_download_key, parsed_lines=0):
81 """Parse the given file starting on the given position.81 """Parse the given file starting on the given position.
8282
83 parsed_lines accepts the number of lines that have been parsed during
84 previous calls to this function so they can be taken into account against
85 max_parsed_lines. The total number of parsed lines is then returned so it
86 can be passed back to future calls to this function.
87
83 Return a dictionary mapping file_ids (from the librarian) to days to88 Return a dictionary mapping file_ids (from the librarian) to days to
84 countries to number of downloads.89 countries to number of downloads.
85 """90 """
@@ -91,7 +96,6 @@
9196
92 geoip = getUtility(IGeoIP)97 geoip = getUtility(IGeoIP)
93 downloads = {}98 downloads = {}
94 parsed_lines = 0
9599
96 # Check for an optional max_parsed_lines config option.100 # Check for an optional max_parsed_lines config option.
97 max_parsed_lines = getattr(101 max_parsed_lines = getattr(
@@ -167,7 +171,7 @@
167 logger.info('Parsed %d lines resulting in %d download stats.' % (171 logger.info('Parsed %d lines resulting in %d download stats.' % (
168 parsed_lines, len(downloads)))172 parsed_lines, len(downloads)))
169173
170 return downloads, parsed_bytes174 return downloads, parsed_bytes, parsed_lines
171175
172176
173def create_or_update_parsedlog_entry(first_line, parsed_bytes):177def create_or_update_parsedlog_entry(first_line, parsed_bytes):
174178
=== modified file 'lib/lp/services/apachelogparser/script.py'
--- lib/lp/services/apachelogparser/script.py 2010-08-27 18:41:01 +0000
+++ lib/lp/services/apachelogparser/script.py 2010-08-30 19:47:43 +0000
@@ -8,6 +8,7 @@
88
9from zope.component import getUtility9from zope.component import getUtility
1010
11from canonical.config import config
11from lp.app.errors import NotFoundError12from lp.app.errors import NotFoundError
12from lp.services.apachelogparser.base import (13from lp.services.apachelogparser.base import (
13 create_or_update_parsedlog_entry,14 create_or_update_parsedlog_entry,
@@ -65,8 +66,15 @@
6566
66 self.setUpUtilities()67 self.setUpUtilities()
67 country_set = getUtility(ICountrySet)68 country_set = getUtility(ICountrySet)
69 parsed_lines = 0
70 max_parsed_lines = getattr(
71 config.launchpad, 'logparser_max_parsed_lines', None)
72 max_is_set = max_parsed_lines is not None
68 for fd, position in files_to_parse:73 for fd, position in files_to_parse:
69 downloads, parsed_bytes = parse_file(74 # If we've used up our budget of lines to process, stop.
75 if (max_is_set and parsed_lines >= max_parsed_lines):
76 break
77 downloads, parsed_bytes, parsed_lines = parse_file(
70 fd, position, self.logger, self.getDownloadKey)78 fd, position, self.logger, self.getDownloadKey)
71 # Use a while loop here because we want to pop items from the dict79 # Use a while loop here because we want to pop items from the dict
72 # in order to free some memory as we go along. This is a good80 # in order to free some memory as we go along. This is a good
7381
=== modified file 'lib/lp/services/apachelogparser/tests/test_apachelogparser.py'
--- lib/lp/services/apachelogparser/tests/test_apachelogparser.py 2010-08-27 18:41:01 +0000
+++ lib/lp/services/apachelogparser/tests/test_apachelogparser.py 2010-08-30 19:47:43 +0000
@@ -6,7 +6,6 @@
6import os6import os
7from StringIO import StringIO7from StringIO import StringIO
8import tempfile8import tempfile
9import textwrap
10from operator import itemgetter9from operator import itemgetter
11import unittest10import unittest
1211
@@ -125,17 +124,18 @@
125124
126 def test_parsing(self):125 def test_parsing(self):
127 # The parse_file() function returns a tuple containing a dict (mapping126 # The parse_file() function returns a tuple containing a dict (mapping
128 # days and library file IDs to number of downloads) and the total127 # days and library file IDs to number of downloads), the total number
129 # number of bytes that have been parsed from this file. In our sample128 # of bytes that have been parsed from this file, and the running total
129 # of lines parsed (for testing against the maximum). In our sample
130 # log, the file with ID 8196569 has been downloaded twice (once from130 # log, the file with ID 8196569 has been downloaded twice (once from
131 # Argentina and once from Japan) and the files with ID 12060796131 # Argentina and once from Japan) and the files with ID 12060796 and
132 # and 9096290 have been downloaded once. The file with ID 15018215132 # 9096290 have been downloaded once. The file with ID 15018215 has
133 # has also been downloaded once (last line of the sample log), but133 # also been downloaded once (last line of the sample log), but
134 # parse_file() always skips the last line as it may be truncated, so134 # parse_file() always skips the last line as it may be truncated, so
135 # it doesn't show up in the dict returned.135 # it doesn't show up in the dict returned.
136 fd = open(os.path.join(136 fd = open(os.path.join(
137 here, 'apache-log-files', 'launchpadlibrarian.net.access-log'))137 here, 'apache-log-files', 'launchpadlibrarian.net.access-log'))
138 downloads, parsed_bytes = parse_file(138 downloads, parsed_bytes, parsed_lines = parse_file(
139 fd, start_position=0, logger=self.logger,139 fd, start_position=0, logger=self.logger,
140 get_download_key=get_path_download_key)140 get_download_key=get_path_download_key)
141 self.assertEqual(141 self.assertEqual(
@@ -159,7 +159,7 @@
159 # line without worrying about whether or not it's been truncated.159 # line without worrying about whether or not it's been truncated.
160 fd = open(os.path.join(160 fd = open(os.path.join(
161 here, 'apache-log-files', 'launchpadlibrarian.net.access-log'))161 here, 'apache-log-files', 'launchpadlibrarian.net.access-log'))
162 downloads, parsed_bytes = parse_file(162 downloads, parsed_bytes, parsed_lines = parse_file(
163 fd, start_position=self._getLastLineStart(fd), logger=self.logger,163 fd, start_position=self._getLastLineStart(fd), logger=self.logger,
164 get_download_key=get_path_download_key)164 get_download_key=get_path_download_key)
165 self.assertEqual(165 self.assertEqual(
@@ -177,7 +177,7 @@
177 # parsed up to the line before the one where the failure occurred.177 # parsed up to the line before the one where the failure occurred.
178 # Here we force an unexpected error on the first line.178 # Here we force an unexpected error on the first line.
179 fd = StringIO('Not a log')179 fd = StringIO('Not a log')
180 downloads, parsed_bytes = parse_file(180 downloads, parsed_bytes, parsed_lines = parse_file(
181 fd, start_position=0, logger=self.logger,181 fd, start_position=0, logger=self.logger,
182 get_download_key=get_path_download_key)182 get_download_key=get_path_download_key)
183 self.assertIn('Error', self.logger.buffer.getvalue())183 self.assertIn('Error', self.logger.buffer.getvalue())
@@ -188,7 +188,7 @@
188 """Assert that responses with the given status are ignored."""188 """Assert that responses with the given status are ignored."""
189 fd = StringIO(189 fd = StringIO(
190 self.sample_line % dict(status=status, method='GET'))190 self.sample_line % dict(status=status, method='GET'))
191 downloads, parsed_bytes = parse_file(191 downloads, parsed_bytes, parsed_lines = parse_file(
192 fd, start_position=0, logger=self.logger,192 fd, start_position=0, logger=self.logger,
193 get_download_key=get_path_download_key)193 get_download_key=get_path_download_key)
194 self.assertEqual(194 self.assertEqual(
@@ -213,7 +213,7 @@
213 """Assert that requests with the given method are ignored."""213 """Assert that requests with the given method are ignored."""
214 fd = StringIO(214 fd = StringIO(
215 self.sample_line % dict(status='200', method=method))215 self.sample_line % dict(status='200', method=method))
216 downloads, parsed_bytes = parse_file(216 downloads, parsed_bytes, parsed_lines = parse_file(
217 fd, start_position=0, logger=self.logger,217 fd, start_position=0, logger=self.logger,
218 get_download_key=get_path_download_key)218 get_download_key=get_path_download_key)
219 self.assertEqual(219 self.assertEqual(
@@ -231,7 +231,7 @@
231 def test_normal_request_is_not_ignored(self):231 def test_normal_request_is_not_ignored(self):
232 fd = StringIO(232 fd = StringIO(
233 self.sample_line % dict(status=200, method='GET'))233 self.sample_line % dict(status=200, method='GET'))
234 downloads, parsed_bytes = parse_file(234 downloads, parsed_bytes, parsed_lines = parse_file(
235 fd, start_position=0, logger=self.logger,235 fd, start_position=0, logger=self.logger,
236 get_download_key=get_path_download_key)236 get_download_key=get_path_download_key)
237 self.assertEqual(237 self.assertEqual(
@@ -249,22 +249,20 @@
249 # lines.249 # lines.
250 config.push(250 config.push(
251 'log_parser config',251 'log_parser config',
252 textwrap.dedent('''\252 '[launchpad]\nlogparser_max_parsed_lines: 2')
253 [launchpad]
254 logparser_max_parsed_lines: 2
255 '''))
256 self.addCleanup(config.pop, 'log_parser config')253 self.addCleanup(config.pop, 'log_parser config')
257 fd = open(os.path.join(254 fd = open(os.path.join(
258 here, 'apache-log-files', 'launchpadlibrarian.net.access-log'))255 here, 'apache-log-files', 'launchpadlibrarian.net.access-log'))
259 self.addCleanup(fd.close)256 self.addCleanup(fd.close)
260257
261 downloads, parsed_bytes = parse_file(258 downloads, parsed_bytes, parsed_lines = parse_file(
262 fd, start_position=0, logger=self.logger,259 fd, start_position=0, logger=self.logger,
263 get_download_key=get_path_download_key)260 get_download_key=get_path_download_key)
264261
265 # We have initially parsed only the first two lines of data,262 # We have initially parsed only the first two lines of data,
266 # corresponding to one download (the first line is a 404 and263 # corresponding to one download (the first line is a 404 and
267 # so ignored).264 # so ignored).
265 self.assertEqual(parsed_lines, 2)
268 date = datetime(2008, 6, 13)266 date = datetime(2008, 6, 13)
269 self.assertContentEqual(267 self.assertContentEqual(
270 downloads.items(),268 downloads.items(),
@@ -276,7 +274,7 @@
276274
277 # And the subsequent parse will be for the 3rd and 4th lines,275 # And the subsequent parse will be for the 3rd and 4th lines,
278 # corresponding to two downloads of the same file.276 # corresponding to two downloads of the same file.
279 downloads, parsed_bytes = parse_file(277 downloads, parsed_bytes, parsed_lines = parse_file(
280 fd, start_position=parsed_bytes, logger=self.logger,278 fd, start_position=parsed_bytes, logger=self.logger,
281 get_download_key=get_path_download_key)279 get_download_key=get_path_download_key)
282 self.assertContentEqual(280 self.assertContentEqual(
@@ -285,6 +283,45 @@
285 ('/8196569/mediumubuntulogo.png', {date: {'AR': 1}})])283 ('/8196569/mediumubuntulogo.png', {date: {'AR': 1}})])
286 self.assertEqual(parsed_bytes, sum(line_lengths[:4]))284 self.assertEqual(parsed_bytes, sum(line_lengths[:4]))
287285
286 def test_max_parsed_lines_exceeded(self):
287 # Show that if a non-zero parsed_lines is passed in, the number of
288 # lines parsed will be less than it would otherwise have been.
289
290 # The max_parsed_lines config option limits the number of parsed
291 # lines.
292 config.push(
293 'log_parser config',
294 '[launchpad]\nlogparser_max_parsed_lines: 2')
295 self.addCleanup(config.pop, 'log_parser config')
296 fd = open(os.path.join(
297 here, 'apache-log-files', 'launchpadlibrarian.net.access-log'))
298 self.addCleanup(fd.close)
299
300 # We want to start parsing on line 2 so we will have a value in
301 # "downloads" to make a positive assertion about. (The first line is
302 # a 404 so wouldn't generate any output.)
303 start_position = len(fd.readline())
304
305 # If we have already parsed some lines, then the number of lines
306 # parsed will be passed in (parsed_lines argument) and parse_file will
307 # take that number into account when determining if the maximum number
308 # of lines to parse has been reached.
309 parsed_lines = 1
310 downloads, parsed_bytes, parsed_lines = parse_file(
311 fd, start_position=start_position, logger=self.logger,
312 get_download_key=get_path_download_key, parsed_lines=parsed_lines)
313
314 # The total number of lines parsed during the run (1 line) plus the
315 # number of lines parsed previously (1 line, as passed in via
316 # parsed_lines) is returned.
317 self.assertEqual(parsed_lines, 2)
318 # Since we told parse_file that we had already parsed 1 line and the
319 # limit is 2 lines, it only parsed a single line.
320 date = datetime(2008, 6, 13)
321 self.assertContentEqual(
322 downloads.items(),
323 [('/9096290/me-tv-icon-14x14.png', {date: {'AU': 1}})])
324
288325
289class TestParsedFilesDetection(TestCase):326class TestParsedFilesDetection(TestCase):
290 """Test the detection of already parsed logs."""327 """Test the detection of already parsed logs."""