Merge lp:~cjwatson/meliae/py3-loader-source-bytes into lp:meliae
- py3-loader-source-bytes
- Merge into trunk
Status: | Merged |
---|---|
Approved by: | John A Meinel |
Approved revision: | 224 |
Merged at revision: | 226 |
Proposed branch: | lp:~cjwatson/meliae/py3-loader-source-bytes |
Merge into: | lp:meliae |
Diff against target: |
523 lines (+152/-100) 4 files modified
meliae/_loader.pyx (+16/-4) meliae/loader.py (+35/-24) meliae/tests/test__loader.py (+4/-1) meliae/tests/test_loader.py (+97/-71) |
To merge this branch: | bzr merge lp:~cjwatson/meliae/py3-loader-source-bytes |
Related bugs: |
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
John A Meinel | Approve | ||
Review via email: mp+378581@code.launchpad.net |
Commit message
Ensure coherent bytes/text handling of sources in meliae.loader.
Description of the change
meliae.
John A Meinel (jameinel) wrote : | # |
Colin Watson (cjwatson) wrote : | # |
I hadn't considered the question of large dumps. In that case the difference in size between bytes and text isn't quite the point; the issue would be that decoding would add another copy of the data in memory. Indeed, while (simple)json.loads is a more accurate parser, it unpacks it into a much less memory-efficient representation along the way. This is only an issue if there are single objects in the dump that are very large (e.g. a large string), but of course that's quite possible.
So I guess if I just arranged for the _from_line decoder to cope with bytes to save the extra intermediate representation and dropped the line.decode call, that would be good enough? (json.loads accepts either bytes or text on Python 3, so there's no type-safety issue for _from_json.)
John A Meinel (jameinel) wrote : | # |
So I believe the file format is such that it *could* be just-one-big json.loads(), but you don't get any progress, etc when doing that. I believe since it is just [\n{data}
I'm happy to continue doing so, as any given line isn't going to be a major overhead, and the memory savings are by putting it into a compacted data structure. (though that followed the Py2 Dict model of a table with holes in it vs the python 3 table-with-no-holes and an index with holes.)
I think the scanner already truncates long strings, so you only see the prefix, so we shouldn't have to worry about that too much.
- 222. By Colin Watson
-
Use more compact representations when loading dumps.
The natural representations of the "type_str", "name", and (in some cases)
"value" fields of objects loaded from dumps would be str, but on Python 3
this is a somewhat less efficient representation of ASCII strings, and for
dumps of large processes a compact representation may well matter. To that
end, use bytes where possible.In general I've tried to confine this to just the highest-volume objects.
For example, meliae.loader. _TypeSummary doesn't need to be as dense, so
convenience makes more sense there and _TypeSummary.type_str is a str. Some methods gain affordances to encode from or decode to str where
appropriate for convenience, where doing so doesn't cause other problems.I extended meliae.
tests.test_ loader. _example_ dump to include both bytes and
text objects, in order to test the slightly different representations of
each.
Colin Watson (cjwatson) wrote : | # |
OK, it took me a while to get my head around what you were driving at here, but I think I now understand. How's this? I've turned type_str and name (and sometimes value) back into bytes objects, and everything should now be more compact again even on Python 3.
John A Meinel (jameinel) wrote : | # |
I'm happy to land this as is. I'm also willing for you to push back and say "the memory saving probably isn't worth disrupting people using the library who end up seeing b'' strings where they just expect strings".
The fact that you play around casting type_str back to string makes me wonder.
I'll wait to merge until I hear back from you.
- 223. By Colin Watson
-
Merge trunk.
- 224. By Colin Watson
-
Store type_str as an interned str rather than bytes.
Type strings are likely to be drawn from a relatively small pool, so this
still saves memory for large dumps while being more convenient to deal with
than bytes on Python 3.
Colin Watson (cjwatson) wrote : | # |
As discussed on IRC, I've made type_str be an interned str instead, which indeed does make things generally easier to deal with. I think keeping the less-shareable fields as bytes rather than str is reasonably justifiable in the context of a memory debugging tool though; it's still a little surprising, but seems tolerable.
Preview Diff
1 | === modified file 'meliae/_loader.pyx' |
2 | --- meliae/_loader.pyx 2020-02-03 14:38:41 +0000 |
3 | +++ meliae/_loader.pyx 2020-05-05 11:04:32 +0000 |
4 | @@ -54,9 +54,15 @@ |
5 | PyObject *val) except -1 |
6 | |
7 | import gc |
8 | +import sys |
9 | + |
10 | from meliae import warn |
11 | |
12 | |
13 | +if sys.version_info[0] >= 3: |
14 | + intern = sys.intern |
15 | + |
16 | + |
17 | ctypedef struct RefList: |
18 | long size |
19 | PyObject *refs[0] |
20 | @@ -176,6 +182,7 @@ |
21 | addr = <PyObject *>address |
22 | Py_XINCREF(addr) |
23 | new_entry.address = addr |
24 | + type_str = intern(type_str) |
25 | new_entry.type_str = <PyObject *>type_str |
26 | Py_XINCREF(new_entry.type_str) |
27 | new_entry.size = size |
28 | @@ -550,9 +557,12 @@ |
29 | else: |
30 | # TODO: This isn't perfect, as it doesn't do proper json |
31 | # escaping |
32 | - if '"' in self.value: |
33 | - raise AssertionError(self.value) |
34 | - value = '"value": "%s", ' % self.value |
35 | + text_value = self.value |
36 | + if sys.version_info[0] >= 3 and isinstance(text_value, bytes): |
37 | + text_value = text_value.decode('latin-1') |
38 | + if '"' in text_value: |
39 | + raise AssertionError(text_value) |
40 | + value = '"value": "%s", ' % text_value |
41 | else: |
42 | value = '' |
43 | return '{"address": %d, "type": "%s", "size": %d, %s"refs": [%s]}' % ( |
44 | @@ -579,7 +589,9 @@ |
45 | # a tuple/dict/etc |
46 | if val.type_str == 'bool': |
47 | val = (val.value == 'True') |
48 | - elif val.type_str in ('int', 'long', 'str', 'unicode', 'float', |
49 | + elif val.type_str in ('int', 'long', |
50 | + 'bytes', 'str', 'unicode', |
51 | + 'float', |
52 | ) and val.value is not None: |
53 | val = val.value |
54 | elif val.type_str == 'NoneType': |
55 | |
56 | === modified file 'meliae/loader.py' |
57 | --- meliae/loader.py 2020-03-11 20:33:25 +0000 |
58 | +++ meliae/loader.py 2020-05-05 11:04:32 +0000 |
59 | @@ -38,6 +38,9 @@ |
60 | ) |
61 | |
62 | |
63 | +if sys.version_info[0] >= 3: |
64 | + intern = sys.intern |
65 | + |
66 | timer = time.time |
67 | if sys.platform == 'win32': |
68 | timer = time.clock |
69 | @@ -46,35 +49,39 @@ |
70 | # faster than simplejson without extensions, though slower than simplejson w/ |
71 | # extensions. |
72 | _object_re = re.compile( |
73 | - r'\{"address": (?P<address>\d+)' |
74 | - r', "type": "(?P<type>[^"]*)"' |
75 | - r', "size": (?P<size>\d+)' |
76 | - r'(, "name": "(?P<name>.*)")?' |
77 | - r'(, "len": (?P<len>\d+))?' |
78 | - r'(, "value": (?P<valuequote>"?)(?P<value>.*)(?P=valuequote))?' |
79 | - r', "refs": \[(?P<refs>[^]]*)\]' |
80 | - r'\}') |
81 | + br'\{"address": (?P<address>\d+)' |
82 | + br', "type": "(?P<type>[^"]*)"' |
83 | + br', "size": (?P<size>\d+)' |
84 | + br'(, "name": "(?P<name>.*)")?' |
85 | + br'(, "len": (?P<len>\d+))?' |
86 | + br'(, "value": (?P<valuequote>"?)(?P<value>.*)(?P=valuequote))?' |
87 | + br', "refs": \[(?P<refs>[^]]*)\]' |
88 | + br'\}') |
89 | |
90 | _refs_re = re.compile( |
91 | - r'(?P<ref>\d+)' |
92 | + br'(?P<ref>\d+)' |
93 | ) |
94 | |
95 | |
96 | def _from_json(cls, line, temp_cache=None): |
97 | val = simplejson.loads(line) |
98 | # simplejson likes to turn everything into unicode strings, but we know |
99 | - # everything is just a plain 'str', and we can save some bytes if we |
100 | - # cast it back |
101 | + # everything is just plain ASCII, and we can save some bytes if we cast |
102 | + # things back to `bytes`. This is a little surprising on Python 3, but |
103 | + # it makes it easier to deal with large dumps. |
104 | + name = val.get('name', None) |
105 | + if name is not None and isinstance(name, six.text_type): |
106 | + name = name.encode('ASCII') |
107 | obj = cls(address=val['address'], |
108 | - type_str=str(val['type']), |
109 | + type_str=intern(str(val['type'])), |
110 | size=val['size'], |
111 | children=val['refs'], |
112 | length=val.get('len', None), |
113 | value=val.get('value', None), |
114 | - name=val.get('name', None)) |
115 | - if (obj.type_str == 'str'): |
116 | - if type(obj.value) is unicode: |
117 | - obj.value = obj.value.encode('latin-1') |
118 | + name=name) |
119 | + if (obj.type_str != six.text_type.__name__ and |
120 | + isinstance(obj.value, six.text_type)): |
121 | + obj.value = obj.value.encode('latin-1') |
122 | if temp_cache is not None: |
123 | obj._intern_from_cache(temp_cache) |
124 | return obj |
125 | @@ -87,9 +94,11 @@ |
126 | (address, type_str, size, name, length, value, |
127 | refs) = m.group('address', 'type', 'size', 'name', 'len', |
128 | 'value', 'refs') |
129 | + if not isinstance(type_str, str): |
130 | + type_str = type_str.decode('UTF-8') |
131 | assert '\\' not in type_str |
132 | if name is not None: |
133 | - assert '\\' not in name |
134 | + assert b'\\' not in name |
135 | if length is not None: |
136 | length = int(length) |
137 | refs = [int(val) for val in _refs_re.findall(refs)] |
138 | @@ -105,9 +114,8 @@ |
139 | length=length, |
140 | value=value, |
141 | name=name) |
142 | - if (obj.type_str == 'str'): |
143 | - if type(obj.value) is unicode: |
144 | - obj.value = obj.value.encode('latin-1') |
145 | + if obj.type_str == six.text_type.__name__ and isinstance(obj.value, bytes): |
146 | + obj.value = obj.value.decode('latin-1') |
147 | if temp_cache is not None: |
148 | obj._intern_from_cache(temp_cache) |
149 | return obj |
150 | @@ -443,7 +451,10 @@ |
151 | obj.size = obj.size + dict_obj.size |
152 | obj.total_size = 0 |
153 | if obj.type_str == 'instance': |
154 | - obj.type_str = type_obj.value |
155 | + instance_type_str = type_obj.value |
156 | + if not isinstance(instance_type_str, str): |
157 | + instance_type_str = instance_type_str.decode('UTF-8') |
158 | + obj.type_str = instance_type_str |
159 | # Now that all the data has been moved into the instance, we |
160 | # will want to remove the dict from the collection. We'll do the |
161 | # actual deletion later, since we are using iteritems for this |
162 | @@ -576,7 +587,7 @@ |
163 | input_mb = input_size / 1024. / 1024. |
164 | temp_cache = {} |
165 | address_re = re.compile( |
166 | - r'{"address": (?P<address>\d+)' |
167 | + br'{"address": (?P<address>\d+)' |
168 | ) |
169 | bytes_read = count = 0 |
170 | last = 0 |
171 | @@ -589,9 +600,9 @@ |
172 | factory = _loader._MemObjectProxy_from_args |
173 | for line_num, line in enumerate(source): |
174 | bytes_read += len(line) |
175 | - if line in ("[\n", "]\n"): |
176 | + if line in (b"[\n", b"]\n"): |
177 | continue |
178 | - if line.endswith(',\n'): |
179 | + if line.endswith(b',\n'): |
180 | line = line[:-2] |
181 | if objs: |
182 | # Skip duplicate objects |
183 | |
184 | === modified file 'meliae/tests/test__loader.py' |
185 | --- meliae/tests/test__loader.py 2020-03-11 20:54:23 +0000 |
186 | +++ meliae/tests/test__loader.py 2020-05-05 11:04:32 +0000 |
187 | @@ -354,7 +354,10 @@ |
188 | mop.children = [addr876542+1, addr654320+1] |
189 | mop.parents = [addr876542+1, addr654320+1] |
190 | self.assertFalse(mop.address is addr) |
191 | - self.assertFalse(mop.type_str is t) |
192 | + # type_str always gets interned, so mop.type_str is identical to the |
193 | + # cached object even though its input string isn't. |
194 | + self.assertFalse(type_str is t) |
195 | + self.assertTrue(mop.type_str is t) |
196 | rl = mop.children |
197 | self.assertFalse(rl[0] is addr876543) |
198 | self.assertFalse(rl[1] is addr654321) |
199 | |
200 | === modified file 'meliae/tests/test_loader.py' |
201 | --- meliae/tests/test_loader.py 2020-01-29 13:19:59 +0000 |
202 | +++ meliae/tests/test_loader.py 2020-05-05 11:04:32 +0000 |
203 | @@ -19,6 +19,8 @@ |
204 | import sys |
205 | import tempfile |
206 | |
207 | +import six |
208 | + |
209 | from meliae import ( |
210 | _loader, |
211 | loader, |
212 | @@ -32,22 +34,26 @@ |
213 | # a@5 = 1 |
214 | # b@4 = 2 |
215 | # c@6 = 'a str' |
216 | -# t@7 = (a, b) |
217 | +# u@8 = u'a unicode' |
218 | +# t@7 = (a, b, u) |
219 | # d@2 = {a:b, c:t} |
220 | -# l@3 = [a, b] |
221 | +# l@3 = [a, b, u] |
222 | # l.append(l) |
223 | # outer@1 = (d, l) |
224 | _example_dump = [ |
225 | '{"address": 1, "type": "tuple", "size": 20, "len": 2, "refs": [2, 3]}', |
226 | -'{"address": 3, "type": "list", "size": 44, "len": 3, "refs": [3, 4, 5]}', |
227 | +'{"address": 3, "type": "list", "size": 44, "len": 3, "refs": [3, 4, 5, 8]}', |
228 | '{"address": 5, "type": "int", "size": 12, "value": 1, "refs": []}', |
229 | '{"address": 4, "type": "int", "size": 12, "value": 2, "refs": []}', |
230 | '{"address": 2, "type": "dict", "size": 124, "len": 2, "refs": [4, 5, 6, 7]}', |
231 | -'{"address": 7, "type": "tuple", "size": 20, "len": 2, "refs": [4, 5]}', |
232 | -'{"address": 6, "type": "str", "size": 29, "len": 5, "value": "a str"' |
233 | - ', "refs": []}', |
234 | -'{"address": 8, "type": "module", "size": 60, "name": "mymod", "refs": [2]}', |
235 | +'{"address": 7, "type": "tuple", "size": 20, "len": 2, "refs": [4, 5, 8]}', |
236 | +'{"address": 6, "type": "%s", "size": 29, "len": 5, "value": "a str"' |
237 | + ', "refs": []}' % bytes.__name__, |
238 | +'{"address": 8, "type": "%s", "size": 88, "len": 9, "value": "a unicode"' |
239 | + ', "refs": []}' % six.text_type.__name__, |
240 | +'{"address": 9, "type": "module", "size": 60, "name": "mymod", "refs": [2]}', |
241 | ] |
242 | +_example_dump = [line.encode('ASCII') for line in _example_dump] |
243 | |
244 | # Note that this doesn't have a complete copy of the references. Namely when |
245 | # you subclass object you get a lot of references, and type instances also |
246 | @@ -72,6 +78,7 @@ |
247 | '{"address": 14, "type": "module", "size": 28, "name": "sys", "refs": [15]}', |
248 | '{"address": 15, "type": "dict", "size": 140, "len": 2, "refs": [5, 6, 9, 6]}', |
249 | ] |
250 | +_instance_dump = [line.encode('ASCII') for line in _instance_dump] |
251 | |
252 | _old_instance_dump = [ |
253 | '{"address": 1, "type": "instance", "size": 36, "refs": [2, 3]}', |
254 | @@ -86,6 +93,7 @@ |
255 | ', "refs": []}', |
256 | '{"address": 8, "type": "tuple", "size": 28, "len": 0, "refs": []}', |
257 | ] |
258 | +_old_instance_dump = [line.encode('ASCII') for line in _old_instance_dump] |
259 | |
260 | _intern_dict_dump = [ |
261 | '{"address": 2, "type": "str", "size": 25, "len": 1, "value": "a", "refs": []}', |
262 | @@ -96,6 +104,7 @@ |
263 | '{"address": 7, "type": "dict", "size": 512, "refs": [6, 6, 5, 5, 4, 4, 3, 3]}', |
264 | '{"address": 8, "type": "dict", "size": 512, "refs": [2, 2, 5, 5, 4, 4, 3, 3]}', |
265 | ] |
266 | +_intern_dict_dump = [line.encode('ASCII') for line in _intern_dict_dump] |
267 | |
268 | |
269 | class TestLoad(tests.TestCase): |
270 | @@ -116,8 +125,8 @@ |
271 | |
272 | def test_load_one(self): |
273 | objs = loader.load([ |
274 | - '{"address": 1234, "type": "int", "size": 12, "value": 10' |
275 | - ', "refs": []}'], show_prog=False).objs |
276 | + b'{"address": 1234, "type": "int", "size": 12, "value": 10' |
277 | + b', "refs": []}'], show_prog=False).objs |
278 | keys = objs.keys() |
279 | self.assertEqual([1234], keys) |
280 | obj = objs[1234] |
281 | @@ -128,16 +137,19 @@ |
282 | |
283 | def test_load_without_simplejson(self): |
284 | objs = loader.load([ |
285 | - '{"address": 1234, "type": "int", "size": 12, "value": 10' |
286 | - ', "refs": []}', |
287 | - '{"address": 2345, "type": "module", "size": 60, "name": "mymod"' |
288 | - ', "refs": [1234]}', |
289 | - '{"address": 4567, "type": "str", "size": 150, "len": 126' |
290 | - ', "value": "Test \\\'whoami\\\'\\u000a\\"Your name\\""' |
291 | - ', "refs": []}' |
292 | + b'{"address": 1234, "type": "int", "size": 12, "value": 10' |
293 | + b', "refs": []}', |
294 | + b'{"address": 2345, "type": "module", "size": 60, "name": "mymod"' |
295 | + b', "refs": [1234]}', |
296 | + ('{"address": 4567, "type": "%s", "size": 150, "len": 126' |
297 | + ', "value": "Test \\/whoami\\/\\u000a\\"Your name\\""' |
298 | + ', "refs": []}' % bytes.__name__).encode('UTF-8'), |
299 | + ('{"address": 5678, "type": "%s", "size": 150, "len": 126' |
300 | + ', "value": "Test \\/whoami\\/\\u000a\\"Your name\\""' |
301 | + ', "refs": []}' % six.text_type.__name__).encode('UTF-8'), |
302 | ], using_json=False, show_prog=False).objs |
303 | keys = sorted(objs.keys()) |
304 | - self.assertEqual([1234, 2345, 4567], keys) |
305 | + self.assertEqual([1234, 2345, 4567, 5678], keys) |
306 | obj = objs[1234] |
307 | self.assertTrue(isinstance(obj, _loader._MemObjectProxy)) |
308 | # The address should be exactly the same python object as the key in |
309 | @@ -146,9 +158,15 @@ |
310 | self.assertEqual(10, obj.value) |
311 | obj = objs[2345] |
312 | self.assertEqual("module", obj.type_str) |
313 | - self.assertEqual("mymod", obj.value) |
314 | + self.assertEqual(b"mymod", obj.value) |
315 | obj = objs[4567] |
316 | - self.assertEqual("Test \\'whoami\\'\\u000a\\\"Your name\\\"", obj.value) |
317 | + self.assertTrue(isinstance(obj.value, bytes)) |
318 | + self.assertEqual( |
319 | + b"Test \\/whoami\\/\\u000a\\\"Your name\\\"", obj.value) |
320 | + obj = objs[5678] |
321 | + self.assertTrue(isinstance(obj.value, six.text_type)) |
322 | + self.assertEqual( |
323 | + u"Test \\/whoami\\/\\u000a\\\"Your name\\\"", obj.value) |
324 | |
325 | def test_load_example(self): |
326 | objs = loader.load(_example_dump, show_prog=False) |
327 | @@ -168,7 +186,7 @@ |
328 | try: |
329 | content = gzip.GzipFile(mode='wb', compresslevel=6, fileobj=f) |
330 | for line in _example_dump: |
331 | - content.write(line + '\n') |
332 | + content.write(line + b'\n') |
333 | content.flush() |
334 | content.close() |
335 | del content |
336 | @@ -197,24 +215,24 @@ |
337 | def test_remove_expensive_references(self): |
338 | lines = list(_example_dump) |
339 | lines.pop(-1) # Remove the old module |
340 | - lines.append('{"address": 8, "type": "module", "size": 12' |
341 | - ', "name": "mymod", "refs": [9]}') |
342 | - lines.append('{"address": 9, "type": "dict", "size": 124' |
343 | - ', "refs": [10, 11]}') |
344 | - lines.append('{"address": 10, "type": "module", "size": 12' |
345 | - ', "name": "mod2", "refs": [12]}') |
346 | - lines.append('{"address": 11, "type": "str", "size": 27' |
347 | - ', "value": "boo", "refs": []}') |
348 | - lines.append('{"address": 12, "type": "dict", "size": 124' |
349 | - ', "refs": []}') |
350 | + lines.append(b'{"address": 9, "type": "module", "size": 12' |
351 | + b', "name": "mymod", "refs": [10]}') |
352 | + lines.append(b'{"address": 10, "type": "dict", "size": 124' |
353 | + b', "refs": [11, 12]}') |
354 | + lines.append(b'{"address": 11, "type": "module", "size": 12' |
355 | + b', "name": "mod2", "refs": [13]}') |
356 | + lines.append(b'{"address": 12, "type": "str", "size": 27' |
357 | + b', "value": "boo", "refs": []}') |
358 | + lines.append(b'{"address": 13, "type": "dict", "size": 124' |
359 | + b', "refs": []}') |
360 | source = lambda:loader.iter_objs(lines) |
361 | - mymod_dict = list(source())[8] |
362 | - self.assertEqual([10, 11], mymod_dict.children) |
363 | + mymod_dict = list(source())[9] |
364 | + self.assertEqual([11, 12], mymod_dict.children) |
365 | result = list(loader.remove_expensive_references(source)) |
366 | null_obj = result[0][1] |
367 | self.assertEqual(0, null_obj.address) |
368 | self.assertEqual('<ex-reference>', null_obj.type_str) |
369 | - self.assertEqual([11, 0], result[9][1].children) |
370 | + self.assertEqual([12, 0], result[10][1].children) |
371 | |
372 | |
373 | class TestMemObj(tests.TestCase): |
374 | @@ -226,12 +244,15 @@ |
375 | expected = [ |
376 | '{"address": 1, "type": "tuple", "size": 20, "refs": [2, 3]}', |
377 | '{"address": 2, "type": "dict", "size": 124, "refs": [4, 5, 6, 7]}', |
378 | -'{"address": 3, "type": "list", "size": 44, "refs": [3, 4, 5]}', |
379 | +'{"address": 3, "type": "list", "size": 44, "refs": [3, 4, 5, 8]}', |
380 | '{"address": 4, "type": "int", "size": 12, "value": 2, "refs": []}', |
381 | '{"address": 5, "type": "int", "size": 12, "value": 1, "refs": []}', |
382 | -'{"address": 6, "type": "str", "size": 29, "value": "a str", "refs": []}', |
383 | -'{"address": 7, "type": "tuple", "size": 20, "refs": [4, 5]}', |
384 | -'{"address": 8, "type": "module", "size": 60, "value": "mymod", "refs": [2]}', |
385 | +'{"address": 6, "type": "%s", "size": 29, "value": "a str"' |
386 | + ', "refs": []}' % bytes.__name__, |
387 | +'{"address": 7, "type": "tuple", "size": 20, "refs": [4, 5, 8]}', |
388 | +'{"address": 8, "type": "%s", "size": 88, "value": "a unicode"' |
389 | + ', "refs": []}' % six.text_type.__name__, |
390 | +'{"address": 9, "type": "module", "size": 60, "value": "mymod", "refs": [2]}', |
391 | ] |
392 | self.assertEqual(expected, [obj.to_json() for obj in objs]) |
393 | |
394 | @@ -243,11 +264,12 @@ |
395 | objs = manager.objs |
396 | self.assertEqual((), objs[1].parents) |
397 | self.assertEqual([1, 3], objs[3].parents) |
398 | - self.assertEqual([3, 7, 8], sorted(objs[4].parents)) |
399 | - self.assertEqual([3, 7, 8], sorted(objs[5].parents)) |
400 | - self.assertEqual([8], objs[6].parents) |
401 | - self.assertEqual([8], objs[7].parents) |
402 | - self.assertEqual((), objs[8].parents) |
403 | + self.assertEqual([3, 7, 9], sorted(objs[4].parents)) |
404 | + self.assertEqual([3, 7, 9], sorted(objs[5].parents)) |
405 | + self.assertEqual([9], objs[6].parents) |
406 | + self.assertEqual([9], objs[7].parents) |
407 | + self.assertEqual([3, 7], objs[8].parents) |
408 | + self.assertEqual((), objs[9].parents) |
409 | |
410 | def test_compute_referrers(self): |
411 | # Deprecated |
412 | @@ -267,11 +289,12 @@ |
413 | warn.trap_warnings(old_func) |
414 | self.assertEqual((), objs[1].parents) |
415 | self.assertEqual([1, 3], objs[3].parents) |
416 | - self.assertEqual([3, 7, 8], sorted(objs[4].parents)) |
417 | - self.assertEqual([3, 7, 8], sorted(objs[5].parents)) |
418 | - self.assertEqual([8], objs[6].parents) |
419 | - self.assertEqual([8], objs[7].parents) |
420 | - self.assertEqual((), objs[8].parents) |
421 | + self.assertEqual([3, 7, 9], sorted(objs[4].parents)) |
422 | + self.assertEqual([3, 7, 9], sorted(objs[5].parents)) |
423 | + self.assertEqual([9], objs[6].parents) |
424 | + self.assertEqual([9], objs[7].parents) |
425 | + self.assertEqual([3, 7], objs[8].parents) |
426 | + self.assertEqual((), objs[9].parents) |
427 | |
428 | def test_compute_parents_ignore_repeated(self): |
429 | manager = loader.load(_intern_dict_dump, show_prog=False) |
430 | @@ -294,6 +317,7 @@ |
431 | for x in range(200): |
432 | content.append('{"address": %d, "type": "tuple", "size": 20,' |
433 | ' "len": 2, "refs": [2, 2]}' % (x+100)) |
434 | + content = [line.encode('UTF-8') for line in content] |
435 | # By default, we only track 100 parents |
436 | manager = loader.load(content, show_prog=False) |
437 | self.assertEqual(100, manager[2].num_parents) |
438 | @@ -307,42 +331,42 @@ |
439 | def test_compute_total_size(self): |
440 | manager = loader.load(_example_dump, show_prog=False) |
441 | objs = manager.objs |
442 | - manager.compute_total_size(objs[8]) |
443 | - self.assertEqual(257, objs[8].total_size) |
444 | + manager.compute_total_size(objs[9]) |
445 | + self.assertEqual(345, objs[9].total_size) |
446 | |
447 | def test_compute_total_size_missing_ref(self): |
448 | lines = list(_example_dump) |
449 | # 999 isn't in the dump, not sure how we get these in real life, but |
450 | # they exist. we should live with references that can't be resolved. |
451 | - lines[-1] = ('{"address": 8, "type": "tuple", "size": 16, "len": 1' |
452 | - ', "refs": [999]}') |
453 | + lines[-1] = (b'{"address": 9, "type": "tuple", "size": 16, "len": 1' |
454 | + b', "refs": [999]}') |
455 | manager = loader.load(lines, show_prog=False) |
456 | - obj = manager[8] |
457 | + obj = manager[9] |
458 | manager.compute_total_size(obj) |
459 | self.assertEqual(16, obj.total_size) |
460 | |
461 | def test_remove_expensive_references(self): |
462 | lines = list(_example_dump) |
463 | lines.pop(-1) # Remove the old module |
464 | - lines.append('{"address": 8, "type": "module", "size": 12' |
465 | - ', "name": "mymod", "refs": [9]}') |
466 | - lines.append('{"address": 9, "type": "dict", "size": 124' |
467 | - ', "refs": [10, 11]}') |
468 | - lines.append('{"address": 10, "type": "module", "size": 12' |
469 | - ', "name": "mod2", "refs": [12]}') |
470 | - lines.append('{"address": 11, "type": "str", "size": 27' |
471 | - ', "value": "boo", "refs": []}') |
472 | - lines.append('{"address": 12, "type": "dict", "size": 124' |
473 | - ', "refs": []}') |
474 | + lines.append(b'{"address": 9, "type": "module", "size": 12' |
475 | + b', "name": "mymod", "refs": [10]}') |
476 | + lines.append(b'{"address": 10, "type": "dict", "size": 124' |
477 | + b', "refs": [11, 12]}') |
478 | + lines.append(b'{"address": 11, "type": "module", "size": 12' |
479 | + b', "name": "mod2", "refs": [13]}') |
480 | + lines.append(b'{"address": 12, "type": "str", "size": 27' |
481 | + b', "value": "boo", "refs": []}') |
482 | + lines.append(b'{"address": 13, "type": "dict", "size": 124' |
483 | + b', "refs": []}') |
484 | manager = loader.load(lines, show_prog=False, collapse=False) |
485 | - mymod_dict = manager.objs[9] |
486 | - self.assertEqual([10, 11], mymod_dict.children) |
487 | + mymod_dict = manager.objs[10] |
488 | + self.assertEqual([11, 12], mymod_dict.children) |
489 | manager.remove_expensive_references() |
490 | self.assertTrue(0 in manager.objs) |
491 | null_obj = manager.objs[0] |
492 | self.assertEqual(0, null_obj.address) |
493 | self.assertEqual('<ex-reference>', null_obj.type_str) |
494 | - self.assertEqual([11, 0], mymod_dict.children) |
495 | + self.assertEqual([12, 0], mymod_dict.children) |
496 | |
497 | def test_collapse_instance_dicts(self): |
498 | manager = loader.load(_instance_dump, show_prog=False, collapse=False) |
499 | @@ -419,16 +443,18 @@ |
500 | |
501 | def test_summarize_refs(self): |
502 | manager = loader.load(_example_dump, show_prog=False) |
503 | - summary = manager.summarize(manager[8]) |
504 | + summary = manager.summarize(manager[9]) |
505 | # Note that the module is included in the summary |
506 | - self.assertEqual(['int', 'module', 'str', 'tuple'], |
507 | + self.assertEqual(sorted(['int', 'module', bytes.__name__, |
508 | + six.text_type.__name__, 'tuple']), |
509 | sorted(summary.type_summaries.keys())) |
510 | - self.assertEqual(257, summary.total_size) |
511 | + self.assertEqual(345, summary.total_size) |
512 | |
513 | def test_summarize_excluding(self): |
514 | manager = loader.load(_example_dump, show_prog=False) |
515 | - summary = manager.summarize(manager[8], excluding=[4, 5]) |
516 | + summary = manager.summarize(manager[9], excluding=[4, 5]) |
517 | # No ints when they are explicitly filtered |
518 | - self.assertEqual(['module', 'str', 'tuple'], |
519 | + self.assertEqual(sorted(['module', bytes.__name__, |
520 | + six.text_type.__name__, 'tuple']), |
521 | sorted(summary.type_summaries.keys())) |
522 | - self.assertEqual(233, summary.total_size) |
523 | + self.assertEqual(321, summary.total_size) |
This is an interesting one. I think fundamentally we should just drop the _from_line decoder, since 'json' is now a core part of the Python stdlib.
The question remains whether we want to leverage the fact that Bytes might be more memory efficient than PyUnicode objects.
>>> import sys
>>> for i in [0, 1, 10, 50, 500]:
... x = b'1'*i
... y = '1'*i
... print(i, sys.getsizeof(x), sys.getsizeof(y))
...
0 33 49
1 34 50
10 43 59
50 83 99
500 533 549
It is a modest win at low sizes (59 vs 43 at 10 bytes is 37% more overhead).
I don't know how much it really matters, but I do remember playing a lot of tricks to make it easier to look at a large memory dump without then needing at least as much memory as you were using live. (its why there is a MemObjectCollection which is functionally a dict but typed, and why there are Proxy objects that live long enough to look like a fleshed out Python object instead of just a C struct.)