Merge lp:~benji/launchpad/bug-622765 into lp:launchpad

Proposed by Benji York
Status: Merged
Approved by: Benji York
Approved revision: no longer in the source branch.
Merged at revision: 11477
Proposed branch: lp:~benji/launchpad/bug-622765
Merge into: lp:launchpad
Diff against target: 250 lines (+73/-24)
4 files modified
lib/canonical/launchpad/scripts/tests/test_librarian_apache_log_parser.py (+2/-2)
lib/lp/services/apachelogparser/base.py (+7/-3)
lib/lp/services/apachelogparser/script.py (+9/-1)
lib/lp/services/apachelogparser/tests/test_apachelogparser.py (+55/-18)
To merge this branch: bzr merge lp:~benji/launchpad/bug-622765
Reviewer Review Type Date Requested Status
Henning Eggers (community) code Approve
Review via email: mp+34077@code.launchpad.net

Description of the change

The config option "logparser_max_parsed_lines" was respected only for individual files, not across multiple files (e.g., if a max of 1000 lines were to be parsed and there were 10 files to parse 10,000 lines would be parsed instead of just 1000).

This branch fixes the problem by keeping track of the number of lines parsed across calls to parse_file.

The related tests can be run with

    bin/test -c test_apachelogparser

No lint was reported by make lint.

To post a comment you must log in.
Revision history for this message
Henning Eggers (henninge) wrote :
Download full text (11.6 KiB)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Benji,
thank you for this nice fix. It is basically good to land after you have
considered a few comments of mine.

 review approve code

Cheers,
Henning

Am 30.08.2010 17:03, schrieb Benji York:
>
> === modified file 'lib/canonical/launchpad/scripts/tests/test_librarian_apache_log_parser.py'
> --- lib/canonical/launchpad/scripts/tests/test_librarian_apache_log_parser.py 2010-08-20 20:31:18 +0000
> +++ lib/canonical/launchpad/scripts/tests/test_librarian_apache_log_parser.py 2010-08-30 15:03:44 +0000
> @@ -102,7 +102,7 @@
> '69.233.136.42 - - [13/Jun/2008:14:55:22 +0100] "GET '
> '/15018215/ul_logo_64x64.png HTTP/1.1" 200 2261 '
> '"https://launchpad.net/~ubuntulite/+archive" "Mozilla"')
> - downloads, parsed_bytes = parse_file(
> + downloads, parsed_bytes, ignored = parse_file(

I am not sure "ignored" is the best choice here. I wondered at first if it
means "ignored_bytes" or "ignored_lines". Would it hurt to call it what it is
("parsed_lines") and then simply ignore it?

There are multiple uses of "ignored" in here, as you know. ;-)

> fd, start_position=0, logger=self.logger,
> get_download_key=get_library_file_id)
> self.assertEqual(
> @@ -121,7 +121,7 @@
> fd = StringIO(
> '69.233.136.42 - - [13/Jun/2008:14:55:22 +0100] "GET / HTTP/1.1" '
> '200 2261 "https://launchpad.net/~ubuntulite/+archive" "Mozilla"')
> - downloads, parsed_bytes = parse_file(
> + downloads, parsed_bytes, ignored = parse_file(
> fd, start_position=0, logger=self.logger,
> get_download_key=get_library_file_id)
> self.assertEqual(
>
> === modified file 'lib/lp/services/apachelogparser/base.py'
> --- lib/lp/services/apachelogparser/base.py 2010-08-27 18:41:01 +0000
> +++ lib/lp/services/apachelogparser/base.py 2010-08-30 15:03:44 +0000

Very nice. Thank you.

[...]

> === modified file 'lib/lp/services/apachelogparser/script.py'
> --- lib/lp/services/apachelogparser/script.py 2010-08-27 18:41:01 +0000
> +++ lib/lp/services/apachelogparser/script.py 2010-08-30 15:03:44 +0000
> @@ -8,6 +8,7 @@
>
> from zope.component import getUtility
>
> +from canonical.config import config
> from lp.app.errors import NotFoundError
> from lp.services.apachelogparser.base import (
> create_or_update_parsedlog_entry,
> @@ -65,8 +66,15 @@
>
> self.setUpUtilities()
> country_set = getUtility(ICountrySet)
> + parsed_lines = 0
> + max_parsed_lines = getattr(
> + config.launchpad, 'logparser_max_parsed_lines', None)
> for fd, position in files_to_parse:
> - downloads, parsed_bytes = parse_file(
> + # If we've used up our budget of lines to process, stop.
> + if (max_parsed_lines is not None
> + and parsed_lines >= max_parsed_lines):

This line must be indented and the "and" should probably go on the previous
line but multi-line "if" conditions are always bad to read because the
dangling lines align with the body of the statement. Storing the boolean in a
variable fi...

review: Approve (code)
Revision history for this message
Benji York (benji) wrote :
Download full text (4.5 KiB)

On Mon, Aug 30, 2010 at 11:54 AM, Henning Eggers
<email address hidden> wrote:
> Review: Approve code
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi Benji,
> thank you for this nice fix. It is basically good to land after you have
> considered a few comments of mine.
>
>  review approve code
>
> Cheers,
> Henning
>
> Am 30.08.2010 17:03, schrieb Benji York:
>>
>> === modified file 'lib/canonical/launchpad/scripts/tests/test_librarian_apache_log_parser.py'
>> --- lib/canonical/launchpad/scripts/tests/test_librarian_apache_log_parser.py 2010-08-20 20:31:18 +0000
>> +++ lib/canonical/launchpad/scripts/tests/test_librarian_apache_log_parser.py 2010-08-30 15:03:44 +0000
>> @@ -102,7 +102,7 @@
>>              '69.233.136.42 - - [13/Jun/2008:14:55:22 +0100] "GET '
>>              '/15018215/ul_logo_64x64.png HTTP/1.1" 200 2261 '
>>              '"https://launchpad.net/~ubuntulite/+archive" "Mozilla"')
>> -        downloads, parsed_bytes = parse_file(
>> +        downloads, parsed_bytes, ignored = parse_file(
>
> I am not sure "ignored" is the best choice here. I wondered at first if it
> means "ignored_bytes" or "ignored_lines". Would it hurt to call it what it is
> ("parsed_lines") and then simply ignore it?

Done.

>> === modified file 'lib/lp/services/apachelogparser/script.py'
>> --- lib/lp/services/apachelogparser/script.py 2010-08-27 18:41:01 +0000
>> +++ lib/lp/services/apachelogparser/script.py 2010-08-30 15:03:44 +0000
>> @@ -8,6 +8,7 @@
>>
>>  from zope.component import getUtility
>>
>> +from canonical.config import config
>>  from lp.app.errors import NotFoundError
>>  from lp.services.apachelogparser.base import (
>>      create_or_update_parsedlog_entry,
>> @@ -65,8 +66,15 @@
>>
>>          self.setUpUtilities()
>>          country_set = getUtility(ICountrySet)
>> +        parsed_lines = 0
>> +        max_parsed_lines = getattr(
>> +            config.launchpad, 'logparser_max_parsed_lines', None)
>>          for fd, position in files_to_parse:
>> -            downloads, parsed_bytes = parse_file(
>> +            # If we've used up our budget of lines to process, stop.
>> +            if (max_parsed_lines is not None
>> +            and parsed_lines >= max_parsed_lines):
>
> This line must be indented and the "and" should probably go on the previous
> line but multi-line "if" conditions are always bad to read because the
> dangling lines align with the body of the statement. Storing the boolean in a
> variable first and checking that in the statement often works better.

Done.

>> +        # If we have already parsed some lines, then the number of lines
>> +        # parsed will be passed in (parsed_lines argument) and parse_file will
>> +        # take that number into account when determining if the maximum number
>> +        # of lines to parse has been reached.
>
> Hm, this is a functional test and not documentation. This description should
> probably be found in the docstring of parse_file, should it not? If found
> there, it need not be explained in such detail here.

My intent was to explain the functioning of the test. I'm sure I would
appreciate a nice explanation of what's going on if I was reading the
test....

Read more...

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'lib/canonical/launchpad/scripts/tests/test_librarian_apache_log_parser.py'
2--- lib/canonical/launchpad/scripts/tests/test_librarian_apache_log_parser.py 2010-08-20 20:31:18 +0000
3+++ lib/canonical/launchpad/scripts/tests/test_librarian_apache_log_parser.py 2010-08-30 19:47:43 +0000
4@@ -102,7 +102,7 @@
5 '69.233.136.42 - - [13/Jun/2008:14:55:22 +0100] "GET '
6 '/15018215/ul_logo_64x64.png HTTP/1.1" 200 2261 '
7 '"https://launchpad.net/~ubuntulite/+archive" "Mozilla"')
8- downloads, parsed_bytes = parse_file(
9+ downloads, parsed_bytes, ignored = parse_file(
10 fd, start_position=0, logger=self.logger,
11 get_download_key=get_library_file_id)
12 self.assertEqual(
13@@ -121,7 +121,7 @@
14 fd = StringIO(
15 '69.233.136.42 - - [13/Jun/2008:14:55:22 +0100] "GET / HTTP/1.1" '
16 '200 2261 "https://launchpad.net/~ubuntulite/+archive" "Mozilla"')
17- downloads, parsed_bytes = parse_file(
18+ downloads, parsed_bytes, ignored = parse_file(
19 fd, start_position=0, logger=self.logger,
20 get_download_key=get_library_file_id)
21 self.assertEqual(
22
23=== modified file 'lib/lp/services/apachelogparser/base.py'
24--- lib/lp/services/apachelogparser/base.py 2010-08-27 18:41:01 +0000
25+++ lib/lp/services/apachelogparser/base.py 2010-08-30 19:47:43 +0000
26@@ -77,9 +77,14 @@
27 return fd, file_size
28
29
30-def parse_file(fd, start_position, logger, get_download_key):
31+def parse_file(fd, start_position, logger, get_download_key, parsed_lines=0):
32 """Parse the given file starting on the given position.
33
34+ parsed_lines accepts the number of lines that have been parsed during
35+ previous calls to this function so they can be taken into account against
36+ max_parsed_lines. The total number of parsed lines is then returned so it
37+ can be passed back to future calls to this function.
38+
39 Return a dictionary mapping file_ids (from the librarian) to days to
40 countries to number of downloads.
41 """
42@@ -91,7 +96,6 @@
43
44 geoip = getUtility(IGeoIP)
45 downloads = {}
46- parsed_lines = 0
47
48 # Check for an optional max_parsed_lines config option.
49 max_parsed_lines = getattr(
50@@ -167,7 +171,7 @@
51 logger.info('Parsed %d lines resulting in %d download stats.' % (
52 parsed_lines, len(downloads)))
53
54- return downloads, parsed_bytes
55+ return downloads, parsed_bytes, parsed_lines
56
57
58 def create_or_update_parsedlog_entry(first_line, parsed_bytes):
59
60=== modified file 'lib/lp/services/apachelogparser/script.py'
61--- lib/lp/services/apachelogparser/script.py 2010-08-27 18:41:01 +0000
62+++ lib/lp/services/apachelogparser/script.py 2010-08-30 19:47:43 +0000
63@@ -8,6 +8,7 @@
64
65 from zope.component import getUtility
66
67+from canonical.config import config
68 from lp.app.errors import NotFoundError
69 from lp.services.apachelogparser.base import (
70 create_or_update_parsedlog_entry,
71@@ -65,8 +66,15 @@
72
73 self.setUpUtilities()
74 country_set = getUtility(ICountrySet)
75+ parsed_lines = 0
76+ max_parsed_lines = getattr(
77+ config.launchpad, 'logparser_max_parsed_lines', None)
78+ max_is_set = max_parsed_lines is not None
79 for fd, position in files_to_parse:
80- downloads, parsed_bytes = parse_file(
81+ # If we've used up our budget of lines to process, stop.
82+ if (max_is_set and parsed_lines >= max_parsed_lines):
83+ break
84+ downloads, parsed_bytes, parsed_lines = parse_file(
85 fd, position, self.logger, self.getDownloadKey)
86 # Use a while loop here because we want to pop items from the dict
87 # in order to free some memory as we go along. This is a good
88
89=== modified file 'lib/lp/services/apachelogparser/tests/test_apachelogparser.py'
90--- lib/lp/services/apachelogparser/tests/test_apachelogparser.py 2010-08-27 18:41:01 +0000
91+++ lib/lp/services/apachelogparser/tests/test_apachelogparser.py 2010-08-30 19:47:43 +0000
92@@ -6,7 +6,6 @@
93 import os
94 from StringIO import StringIO
95 import tempfile
96-import textwrap
97 from operator import itemgetter
98 import unittest
99
100@@ -125,17 +124,18 @@
101
102 def test_parsing(self):
103 # The parse_file() function returns a tuple containing a dict (mapping
104- # days and library file IDs to number of downloads) and the total
105- # number of bytes that have been parsed from this file. In our sample
106+ # days and library file IDs to number of downloads), the total number
107+ # of bytes that have been parsed from this file, and the running total
108+ # of lines parsed (for testing against the maximum). In our sample
109 # log, the file with ID 8196569 has been downloaded twice (once from
110- # Argentina and once from Japan) and the files with ID 12060796
111- # and 9096290 have been downloaded once. The file with ID 15018215
112- # has also been downloaded once (last line of the sample log), but
113+ # Argentina and once from Japan) and the files with ID 12060796 and
114+ # 9096290 have been downloaded once. The file with ID 15018215 has
115+ # also been downloaded once (last line of the sample log), but
116 # parse_file() always skips the last line as it may be truncated, so
117 # it doesn't show up in the dict returned.
118 fd = open(os.path.join(
119 here, 'apache-log-files', 'launchpadlibrarian.net.access-log'))
120- downloads, parsed_bytes = parse_file(
121+ downloads, parsed_bytes, parsed_lines = parse_file(
122 fd, start_position=0, logger=self.logger,
123 get_download_key=get_path_download_key)
124 self.assertEqual(
125@@ -159,7 +159,7 @@
126 # line without worrying about whether or not it's been truncated.
127 fd = open(os.path.join(
128 here, 'apache-log-files', 'launchpadlibrarian.net.access-log'))
129- downloads, parsed_bytes = parse_file(
130+ downloads, parsed_bytes, parsed_lines = parse_file(
131 fd, start_position=self._getLastLineStart(fd), logger=self.logger,
132 get_download_key=get_path_download_key)
133 self.assertEqual(
134@@ -177,7 +177,7 @@
135 # parsed up to the line before the one where the failure occurred.
136 # Here we force an unexpected error on the first line.
137 fd = StringIO('Not a log')
138- downloads, parsed_bytes = parse_file(
139+ downloads, parsed_bytes, parsed_lines = parse_file(
140 fd, start_position=0, logger=self.logger,
141 get_download_key=get_path_download_key)
142 self.assertIn('Error', self.logger.buffer.getvalue())
143@@ -188,7 +188,7 @@
144 """Assert that responses with the given status are ignored."""
145 fd = StringIO(
146 self.sample_line % dict(status=status, method='GET'))
147- downloads, parsed_bytes = parse_file(
148+ downloads, parsed_bytes, parsed_lines = parse_file(
149 fd, start_position=0, logger=self.logger,
150 get_download_key=get_path_download_key)
151 self.assertEqual(
152@@ -213,7 +213,7 @@
153 """Assert that requests with the given method are ignored."""
154 fd = StringIO(
155 self.sample_line % dict(status='200', method=method))
156- downloads, parsed_bytes = parse_file(
157+ downloads, parsed_bytes, parsed_lines = parse_file(
158 fd, start_position=0, logger=self.logger,
159 get_download_key=get_path_download_key)
160 self.assertEqual(
161@@ -231,7 +231,7 @@
162 def test_normal_request_is_not_ignored(self):
163 fd = StringIO(
164 self.sample_line % dict(status=200, method='GET'))
165- downloads, parsed_bytes = parse_file(
166+ downloads, parsed_bytes, parsed_lines = parse_file(
167 fd, start_position=0, logger=self.logger,
168 get_download_key=get_path_download_key)
169 self.assertEqual(
170@@ -249,22 +249,20 @@
171 # lines.
172 config.push(
173 'log_parser config',
174- textwrap.dedent('''\
175- [launchpad]
176- logparser_max_parsed_lines: 2
177- '''))
178+ '[launchpad]\nlogparser_max_parsed_lines: 2')
179 self.addCleanup(config.pop, 'log_parser config')
180 fd = open(os.path.join(
181 here, 'apache-log-files', 'launchpadlibrarian.net.access-log'))
182 self.addCleanup(fd.close)
183
184- downloads, parsed_bytes = parse_file(
185+ downloads, parsed_bytes, parsed_lines = parse_file(
186 fd, start_position=0, logger=self.logger,
187 get_download_key=get_path_download_key)
188
189 # We have initially parsed only the first two lines of data,
190 # corresponding to one download (the first line is a 404 and
191 # so ignored).
192+ self.assertEqual(parsed_lines, 2)
193 date = datetime(2008, 6, 13)
194 self.assertContentEqual(
195 downloads.items(),
196@@ -276,7 +274,7 @@
197
198 # And the subsequent parse will be for the 3rd and 4th lines,
199 # corresponding to two downloads of the same file.
200- downloads, parsed_bytes = parse_file(
201+ downloads, parsed_bytes, parsed_lines = parse_file(
202 fd, start_position=parsed_bytes, logger=self.logger,
203 get_download_key=get_path_download_key)
204 self.assertContentEqual(
205@@ -285,6 +283,45 @@
206 ('/8196569/mediumubuntulogo.png', {date: {'AR': 1}})])
207 self.assertEqual(parsed_bytes, sum(line_lengths[:4]))
208
209+ def test_max_parsed_lines_exceeded(self):
210+ # Show that if a non-zero parsed_lines is passed in, the number of
211+ # lines parsed will be less than it would otherwise have been.
212+
213+ # The max_parsed_lines config option limits the number of parsed
214+ # lines.
215+ config.push(
216+ 'log_parser config',
217+ '[launchpad]\nlogparser_max_parsed_lines: 2')
218+ self.addCleanup(config.pop, 'log_parser config')
219+ fd = open(os.path.join(
220+ here, 'apache-log-files', 'launchpadlibrarian.net.access-log'))
221+ self.addCleanup(fd.close)
222+
223+ # We want to start parsing on line 2 so we will have a value in
224+ # "downloads" to make a positive assertion about. (The first line is
225+ # a 404 so wouldn't generate any output.)
226+ start_position = len(fd.readline())
227+
228+ # If we have already parsed some lines, then the number of lines
229+ # parsed will be passed in (parsed_lines argument) and parse_file will
230+ # take that number into account when determining if the maximum number
231+ # of lines to parse has been reached.
232+ parsed_lines = 1
233+ downloads, parsed_bytes, parsed_lines = parse_file(
234+ fd, start_position=start_position, logger=self.logger,
235+ get_download_key=get_path_download_key, parsed_lines=parsed_lines)
236+
237+ # The total number of lines parsed during the run (1 line) plus the
238+ # number of lines parsed previously (1 line, as passed in via
239+ # parsed_lines) is returned.
240+ self.assertEqual(parsed_lines, 2)
241+ # Since we told parse_file that we had already parsed 1 line and the
242+ # limit is 2 lines, it only parsed a single line.
243+ date = datetime(2008, 6, 13)
244+ self.assertContentEqual(
245+ downloads.items(),
246+ [('/9096290/me-tv-icon-14x14.png', {date: {'AU': 1}})])
247+
248
249 class TestParsedFilesDetection(TestCase):
250 """Test the detection of already parsed logs."""