Meliae

Merge lp:~cjwatson/meliae/py3-loader-source-bytes into lp:meliae

py3-loader-source-bytes
Merge into trunk

Proposed by Colin Watson on 2020-02-05

Status:	Merged
Approved by:	John A Meinel on 2020-05-05
Approved revision:	224
Merged at revision:	226
Proposed branch:	lp:~cjwatson/meliae/py3-loader-source-bytes
Merge into:	lp:meliae
Diff against target:	523 lines (+152/-100) 4 files modified meliae/_loader.pyx (+16/-4) meliae/loader.py (+35/-24) meliae/tests/test__loader.py (+4/-1) meliae/tests/test_loader.py (+97/-71)
To merge this branch:	bzr merge lp:~cjwatson/meliae/py3-loader-source-bytes
Related bugs:	Link a bug report

Reviewer	Review Type	Date Requested	Status
John A Meinel		2020-02-05	Approve on 2020-05-05
Review via email: mp+378581@code.launchpad.net

Commit message

Ensure coherent bytes/text handling of sources in meliae.loader.

Description of the change

meliae.files.open_file (called by load if given a string file name) opens files in binary mode, so it seems most appropriate for the loader tests to pass in dumps as bytes. Store fields as bytes internally where necessary for a compact memory representation of dumps.

Revision history for this message

John A Meinel (jameinel) wrote on 2020-02-05:

This is an interesting one. I think fundamentally we should just drop the _from_line decoder, since 'json' is now a core part of the Python stdlib.
The question remains whether we want to leverage the fact that Bytes might be more memory efficient than PyUnicode objects.

>>> import sys
>>> for i in [0, 1, 10, 50, 500]:
... x = b'1'*i
... y = '1'*i
... print(i, sys.getsizeof(x), sys.getsizeof(y))
...
   0 33 49
   1 34 50
  10 43 59
  50 83 99
500 533 549

It is a modest win at low sizes (59 vs 43 at 10 bytes is 37% more overhead).

I don't know how much it really matters, but I do remember playing a lot of tricks to make it easier to look at a large memory dump without then needing at least as much memory as you were using live. (its why there is a MemObjectCollection which is functionally a dict but typed, and why there are Proxy objects that live long enough to look like a fleshed out Python object instead of just a C struct.)

Revision history for this message

Colin Watson (cjwatson) wrote on 2020-02-05:

I hadn't considered the question of large dumps. In that case the difference in size between bytes and text isn't quite the point; the issue would be that decoding would add another copy of the data in memory. Indeed, while (simple)json.loads is a more accurate parser, it unpacks it into a much less memory-efficient representation along the way. This is only an issue if there are single objects in the dump that are very large (e.g. a large string), but of course that's quite possible.

So I guess if I just arranged for the _from_line decoder to cope with bytes to save the extra intermediate representation and dropped the line.decode call, that would be good enough? (json.loads accepts either bytes or text on Python 3, so there's no type-safety issue for _from_json.)

Revision history for this message

John A Meinel (jameinel) wrote on 2020-02-06:

So I believe the file format is such that it *could* be just-one-big json.loads(), but you don't get any progress, etc when doing that. I believe since it is just [\n{data},\n{data},...] you can load each line separately with json, which is what I thought I was doing (trimming off the trailing ,).

I'm happy to continue doing so, as any given line isn't going to be a major overhead, and the memory savings are by putting it into a compacted data structure. (though that followed the Py2 Dict model of a table with holes in it vs the python 3 table-with-no-holes and an index with holes.)

I think the scanner already truncates long strings, so you only see the prefix, so we shouldn't have to worry about that too much.

lp:~cjwatson/meliae/py3-loader-source-bytes updated on 2020-03-11

222. By Colin Watson on 2020-03-11

Use more compact representations when loading dumps.

The natural representations of the "type_str", "name", and (in some cases)
"value" fields of objects loaded from dumps would be str, but on Python 3
this is a somewhat less efficient representation of ASCII strings, and for
dumps of large processes a compact representation may well matter. To that
end, use bytes where possible.

In general I've tried to confine this to just the highest-volume objects.
For example, meliae.loader._TypeSummary doesn't need to be as dense, so
convenience makes more sense there and _TypeSummary.type_str is a str.

Some methods gain affordances to encode from or decode to str where
appropriate for convenience, where doing so doesn't cause other problems.

I extended meliae.tests.test_loader._example_dump to include both bytes and
text objects, in order to test the slightly different representations of
each.

Revision history for this message

Colin Watson (cjwatson) wrote on 2020-03-11:

OK, it took me a while to get my head around what you were driving at here, but I think I now understand. How's this? I've turned type_str and name (and sometimes value) back into bytes objects, and everything should now be more compact again even on Python 3.

Revision history for this message

John A Meinel (jameinel) wrote on 2020-05-05:

I'm happy to land this as is. I'm also willing for you to push back and say "the memory saving probably isn't worth disrupting people using the library who end up seeing b'' strings where they just expect strings".

The fact that you play around casting type_str back to string makes me wonder.
I'll wait to merge until I hear back from you.

review: Approve

lp:~cjwatson/meliae/py3-loader-source-bytes updated on 2020-05-05

223. By Colin Watson on 2020-05-05

Merge trunk.

224. By Colin Watson on 2020-05-05

Store type_str as an interned str rather than bytes.

Type strings are likely to be drawn from a relatively small pool, so this
still saves memory for large dumps while being more convenient to deal with
than bytes on Python 3.

Revision history for this message

Colin Watson (cjwatson) wrote on 2020-05-05:

As discussed on IRC, I've made type_str be an interned str instead, which indeed does make things generally easier to deal with. I think keeping the less-shareable fields as bytes rather than str is reasonably justifiable in the context of a memory debugging tool though; it's still a little surprising, but seems tolerable.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Colin Watson

John A Meinel

 === modified file 'meliae/_loader.pyx'
 --- meliae/_loader.pyx	2020-02-03 14:38:41 +0000
 +++ meliae/_loader.pyx	2020-05-05 11:04:32 +0000
@@ -54,9 +54,15 @@
                                               PyObject *val) except -1
  import gc
++import sys
++
  from meliae import warn
++if sys.version_info[0] >= 3:
++    intern = sys.intern
++
++
  ctypedef struct RefList:
      long size
      PyObject *refs[0]
@@ -176,6 +182,7 @@
      addr = <PyObject *>address
      Py_XINCREF(addr)
      new_entry.address = addr
++    type_str = intern(type_str)
      new_entry.type_str = <PyObject *>type_str
      Py_XINCREF(new_entry.type_str)
      new_entry.size = size
@@ -550,9 +557,12 @@
              else:
                  # TODO: This isn't perfect, as it doesn't do proper json
                  #       escaping
--                if '"' in self.value:
--                    raise AssertionError(self.value)
--                value = '"value": "%s", ' % self.value
++                text_value = self.value
++                if sys.version_info[0] >= 3 and isinstance(text_value, bytes):
++                    text_value = text_value.decode('latin-1')
++                if '"' in text_value:
++                    raise AssertionError(text_value)
++                value = '"value": "%s", ' % text_value
          else:
              value = ''
          return '{"address": %d, "type": "%s", "size": %d, %s"refs": [%s]}' % (
@@ -579,7 +589,9 @@
              #       a tuple/dict/etc
              if val.type_str == 'bool':
                  val = (val.value == 'True')
--            elif val.type_str in ('int', 'long', 'str', 'unicode', 'float',
++            elif val.type_str in ('int', 'long',
++                                  'bytes', 'str', 'unicode',
++                                  'float',
                                    ) and val.value is not None:
                  val = val.value
              elif val.type_str == 'NoneType':
 === modified file 'meliae/loader.py'
 --- meliae/loader.py	2020-03-11 20:33:25 +0000
 +++ meliae/loader.py	2020-05-05 11:04:32 +0000
@@ -38,6 +38,9 @@
+     )
++if sys.version_info[0] >= 3:
++    intern = sys.intern
++
  timer = time.time
  if sys.platform == 'win32':
      timer = time.clock
@@ -46,35 +49,39 @@
  # faster than simplejson without extensions, though slower than simplejson w/
  # extensions.
  _object_re = re.compile(
--    r'\{"address": (?P<address>\d+)'
--    r', "type": "(?P<type>[^"]*)"'
--    r', "size": (?P<size>\d+)'
--    r'(, "name": "(?P<name>.*)")?'
--    r'(, "len": (?P<len>\d+))?'
--    r'(, "value": (?P<valuequote>"?)(?P<value>.*)(?P=valuequote))?'
--    r', "refs": \[(?P<refs>[^]]*)\]'
--    r'\}')
++    br'\{"address": (?P<address>\d+)'
++    br', "type": "(?P<type>[^"]*)"'
++    br', "size": (?P<size>\d+)'
++    br'(, "name": "(?P<name>.*)")?'
++    br'(, "len": (?P<len>\d+))?'
++    br'(, "value": (?P<valuequote>"?)(?P<value>.*)(?P=valuequote))?'
++    br', "refs": \[(?P<refs>[^]]*)\]'
++    br'\}')
  _refs_re = re.compile(
--    r'(?P<ref>\d+)'
++    br'(?P<ref>\d+)'
+     )
  def _from_json(cls, line, temp_cache=None):
      val = simplejson.loads(line)
      # simplejson likes to turn everything into unicode strings, but we know
--    # everything is just a plain 'str', and we can save some bytes if we
--    # cast it back
++    # everything is just plain ASCII, and we can save some bytes if we cast
++    # things back to `bytes`.  This is a little surprising on Python 3, but
++    # it makes it easier to deal with large dumps.
++    name = val.get('name', None)
++    if name is not None and isinstance(name, six.text_type):
++        name = name.encode('ASCII')
      obj = cls(address=val['address'],
--              type_str=str(val['type']),
++              type_str=intern(str(val['type'])),
                size=val['size'],
                children=val['refs'],
                length=val.get('len', None),
                value=val.get('value', None),
--              name=val.get('name', None))
--    if (obj.type_str == 'str'):
--        if type(obj.value) is unicode:
--            obj.value = obj.value.encode('latin-1')
++              name=name)
++    if (obj.type_str != six.text_type.__name__ and
++            isinstance(obj.value, six.text_type)):
++        obj.value = obj.value.encode('latin-1')
      if temp_cache is not None:
          obj._intern_from_cache(temp_cache)
      return obj
@@ -87,9 +94,11 @@
      (address, type_str, size, name, length, value,
       refs) = m.group('address', 'type', 'size', 'name', 'len',
                       'value', 'refs')
++    if not isinstance(type_str, str):
++        type_str = type_str.decode('UTF-8')
      assert '\\' not in type_str
      if name is not None:
--        assert '\\' not in name
++        assert b'\\' not in name
      if length is not None:
          length = int(length)
      refs = [int(val) for val in _refs_re.findall(refs)]
@@ -105,9 +114,8 @@
                length=length,
                value=value,
                name=name)
--    if (obj.type_str == 'str'):
--        if type(obj.value) is unicode:
--            obj.value = obj.value.encode('latin-1')
++    if obj.type_str == six.text_type.__name__ and isinstance(obj.value, bytes):
++        obj.value = obj.value.decode('latin-1')
      if temp_cache is not None:
          obj._intern_from_cache(temp_cache)
      return obj
@@ -443,7 +451,10 @@
              obj.size = obj.size + dict_obj.size
              obj.total_size = 0
              if obj.type_str == 'instance':
--                obj.type_str = type_obj.value
++                instance_type_str = type_obj.value
++                if not isinstance(instance_type_str, str):
++                    instance_type_str = instance_type_str.decode('UTF-8')
++                obj.type_str = instance_type_str
              # Now that all the data has been moved into the instance, we
              # will want to remove the dict from the collection.  We'll do the
              # actual deletion later, since we are using iteritems for this
@@ -576,7 +587,7 @@
      input_mb = input_size / 1024. / 1024.
      temp_cache = {}
      address_re = re.compile(
--        r'{"address": (?P<address>\d+)'
++        br'{"address": (?P<address>\d+)'
+         )
      bytes_read = count = 0
      last = 0
@@ -589,9 +600,9 @@
          factory = _loader._MemObjectProxy_from_args
      for line_num, line in enumerate(source):
          bytes_read += len(line)
--        if line in ("[\n", "]\n"):
++        if line in (b"[\n", b"]\n"):
              continue
--        if line.endswith(',\n'):
++        if line.endswith(b',\n'):
              line = line[:-2]
          if objs:
              # Skip duplicate objects
 === modified file 'meliae/tests/test__loader.py'
 --- meliae/tests/test__loader.py	2020-03-11 20:54:23 +0000
 +++ meliae/tests/test__loader.py	2020-05-05 11:04:32 +0000
@@ -354,7 +354,10 @@
          mop.children = [addr876542+1, addr654320+1]
          mop.parents = [addr876542+1, addr654320+1]
          self.assertFalse(mop.address is addr)
--        self.assertFalse(mop.type_str is t)
++        # type_str always gets interned, so mop.type_str is identical to the
++        # cached object even though its input string isn't.
++        self.assertFalse(type_str is t)
++        self.assertTrue(mop.type_str is t)
          rl = mop.children
          self.assertFalse(rl[0] is addr876543)
          self.assertFalse(rl[1] is addr654321)
 === modified file 'meliae/tests/test_loader.py'
 --- meliae/tests/test_loader.py	2020-01-29 13:19:59 +0000
 +++ meliae/tests/test_loader.py	2020-05-05 11:04:32 +0000
@@ -19,6 +19,8 @@
  import sys
  import tempfile
++import six
++
  from meliae import (
      _loader,
      loader,
@@ -32,22 +34,26 @@
  # a@5 = 1
  # b@4 = 2
  # c@6 = 'a str'
--# t@7 = (a, b)
++# u@8 = u'a unicode'
++# t@7 = (a, b, u)
  # d@2 = {a:b, c:t}
--# l@3 = [a, b]
++# l@3 = [a, b, u]
  # l.append(l)
  # outer@1 = (d, l)
  _example_dump = [
  '{"address": 1, "type": "tuple", "size": 20, "len": 2, "refs": [2, 3]}',
--'{"address": 3, "type": "list", "size": 44, "len": 3, "refs": [3, 4, 5]}',
++'{"address": 3, "type": "list", "size": 44, "len": 3, "refs": [3, 4, 5, 8]}',
  '{"address": 5, "type": "int", "size": 12, "value": 1, "refs": []}',
  '{"address": 4, "type": "int", "size": 12, "value": 2, "refs": []}',
  '{"address": 2, "type": "dict", "size": 124, "len": 2, "refs": [4, 5, 6, 7]}',
--'{"address": 7, "type": "tuple", "size": 20, "len": 2, "refs": [4, 5]}',
--'{"address": 6, "type": "str", "size": 29, "len": 5, "value": "a str"'
-- ', "refs": []}',
--'{"address": 8, "type": "module", "size": 60, "name": "mymod", "refs": [2]}',
++'{"address": 7, "type": "tuple", "size": 20, "len": 2, "refs": [4, 5, 8]}',
++'{"address": 6, "type": "%s", "size": 29, "len": 5, "value": "a str"'
++ ', "refs": []}' % bytes.__name__,
++'{"address": 8, "type": "%s", "size": 88, "len": 9, "value": "a unicode"'
++ ', "refs": []}' % six.text_type.__name__,
++'{"address": 9, "type": "module", "size": 60, "name": "mymod", "refs": [2]}',
+ ]
++_example_dump = [line.encode('ASCII') for line in _example_dump]
  # Note that this doesn't have a complete copy of the references. Namely when
  # you subclass object you get a lot of references, and type instances also
@@ -72,6 +78,7 @@
  '{"address": 14, "type": "module", "size": 28, "name": "sys", "refs": [15]}',
  '{"address": 15, "type": "dict", "size": 140, "len": 2, "refs": [5, 6, 9, 6]}',
+ ]
++_instance_dump = [line.encode('ASCII') for line in _instance_dump]
  _old_instance_dump = [
  '{"address": 1, "type": "instance", "size": 36, "refs": [2, 3]}',
@@ -86,6 +93,7 @@
   ', "refs": []}',
  '{"address": 8, "type": "tuple", "size": 28, "len": 0, "refs": []}',
+ ]
++_old_instance_dump = [line.encode('ASCII') for line in _old_instance_dump]
  _intern_dict_dump = [
  '{"address": 2, "type": "str", "size": 25, "len": 1, "value": "a", "refs": []}',
@@ -96,6 +104,7 @@
  '{"address": 7, "type": "dict", "size": 512, "refs": [6, 6, 5, 5, 4, 4, 3, 3]}',
  '{"address": 8, "type": "dict", "size": 512, "refs": [2, 2, 5, 5, 4, 4, 3, 3]}',
+ ]
++_intern_dict_dump = [line.encode('ASCII') for line in _intern_dict_dump]
  class TestLoad(tests.TestCase):
@@ -116,8 +125,8 @@
      def test_load_one(self):
          objs = loader.load([
--            '{"address": 1234, "type": "int", "size": 12, "value": 10'
--            ', "refs": []}'], show_prog=False).objs
++            b'{"address": 1234, "type": "int", "size": 12, "value": 10'
++            b', "refs": []}'], show_prog=False).objs
          keys = objs.keys()
          self.assertEqual([1234], keys)
          obj = objs[1234]
@@ -128,16 +137,19 @@
      def test_load_without_simplejson(self):
          objs = loader.load([
--            '{"address": 1234, "type": "int", "size": 12, "value": 10'
--                ', "refs": []}',
--            '{"address": 2345, "type": "module", "size": 60, "name": "mymod"'
--                ', "refs": [1234]}',
--            '{"address": 4567, "type": "str", "size": 150, "len": 126'
--                ', "value": "Test \\\'whoami\\\'\\u000a\\"Your name\\""'
--                ', "refs": []}'
++            b'{"address": 1234, "type": "int", "size": 12, "value": 10'
++                b', "refs": []}',
++            b'{"address": 2345, "type": "module", "size": 60, "name": "mymod"'
++                b', "refs": [1234]}',
++            ('{"address": 4567, "type": "%s", "size": 150, "len": 126'
++                ', "value": "Test \\/whoami\\/\\u000a\\"Your name\\""'
++                ', "refs": []}' % bytes.__name__).encode('UTF-8'),
++            ('{"address": 5678, "type": "%s", "size": 150, "len": 126'
++                ', "value": "Test \\/whoami\\/\\u000a\\"Your name\\""'
++                ', "refs": []}' % six.text_type.__name__).encode('UTF-8'),
              ], using_json=False, show_prog=False).objs
          keys = sorted(objs.keys())
--        self.assertEqual([1234, 2345, 4567], keys)
++        self.assertEqual([1234, 2345, 4567, 5678], keys)
          obj = objs[1234]
          self.assertTrue(isinstance(obj, _loader._MemObjectProxy))
          # The address should be exactly the same python object as the key in
@@ -146,9 +158,15 @@
          self.assertEqual(10, obj.value)
          obj = objs[2345]
          self.assertEqual("module", obj.type_str)
--        self.assertEqual("mymod", obj.value)
++        self.assertEqual(b"mymod", obj.value)
          obj = objs[4567]
--        self.assertEqual("Test \\'whoami\\'\\u000a\\\"Your name\\\"", obj.value)
++        self.assertTrue(isinstance(obj.value, bytes))
++        self.assertEqual(
++            b"Test \\/whoami\\/\\u000a\\\"Your name\\\"", obj.value)
++        obj = objs[5678]
++        self.assertTrue(isinstance(obj.value, six.text_type))
++        self.assertEqual(
++            u"Test \\/whoami\\/\\u000a\\\"Your name\\\"", obj.value)
      def test_load_example(self):
          objs = loader.load(_example_dump, show_prog=False)
@@ -168,7 +186,7 @@
          try:
              content = gzip.GzipFile(mode='wb', compresslevel=6, fileobj=f)
              for line in _example_dump:
--                content.write(line + '\n')
++                content.write(line + b'\n')
              content.flush()
              content.close()
              del content
@@ -197,24 +215,24 @@
      def test_remove_expensive_references(self):
          lines = list(_example_dump)
          lines.pop(-1) # Remove the old module
--        lines.append('{"address": 8, "type": "module", "size": 12'
--                     ', "name": "mymod", "refs": [9]}')
--        lines.append('{"address": 9, "type": "dict", "size": 124'
--                     ', "refs": [10, 11]}')
--        lines.append('{"address": 10, "type": "module", "size": 12'
--                     ', "name": "mod2", "refs": [12]}')
--        lines.append('{"address": 11, "type": "str", "size": 27'
--                     ', "value": "boo", "refs": []}')
--        lines.append('{"address": 12, "type": "dict", "size": 124'
--                     ', "refs": []}')
++        lines.append(b'{"address": 9, "type": "module", "size": 12'
++                     b', "name": "mymod", "refs": [10]}')
++        lines.append(b'{"address": 10, "type": "dict", "size": 124'
++                     b', "refs": [11, 12]}')
++        lines.append(b'{"address": 11, "type": "module", "size": 12'
++                     b', "name": "mod2", "refs": [13]}')
++        lines.append(b'{"address": 12, "type": "str", "size": 27'
++                     b', "value": "boo", "refs": []}')
++        lines.append(b'{"address": 13, "type": "dict", "size": 124'
++                     b', "refs": []}')
          source = lambda:loader.iter_objs(lines)
--        mymod_dict = list(source())[8]
--        self.assertEqual([10, 11], mymod_dict.children)
++        mymod_dict = list(source())[9]
++        self.assertEqual([11, 12], mymod_dict.children)
          result = list(loader.remove_expensive_references(source))
          null_obj = result[0][1]
          self.assertEqual(0, null_obj.address)
          self.assertEqual('<ex-reference>', null_obj.type_str)
--        self.assertEqual([11, 0], result[9][1].children)
++        self.assertEqual([12, 0], result[10][1].children)
  class TestMemObj(tests.TestCase):
@@ -226,12 +244,15 @@
          expected = [
  '{"address": 1, "type": "tuple", "size": 20, "refs": [2, 3]}',
  '{"address": 2, "type": "dict", "size": 124, "refs": [4, 5, 6, 7]}',
--'{"address": 3, "type": "list", "size": 44, "refs": [3, 4, 5]}',
++'{"address": 3, "type": "list", "size": 44, "refs": [3, 4, 5, 8]}',
  '{"address": 4, "type": "int", "size": 12, "value": 2, "refs": []}',
  '{"address": 5, "type": "int", "size": 12, "value": 1, "refs": []}',
--'{"address": 6, "type": "str", "size": 29, "value": "a str", "refs": []}',
--'{"address": 7, "type": "tuple", "size": 20, "refs": [4, 5]}',
--'{"address": 8, "type": "module", "size": 60, "value": "mymod", "refs": [2]}',
++'{"address": 6, "type": "%s", "size": 29, "value": "a str"'
++ ', "refs": []}' % bytes.__name__,
++'{"address": 7, "type": "tuple", "size": 20, "refs": [4, 5, 8]}',
++'{"address": 8, "type": "%s", "size": 88, "value": "a unicode"'
++ ', "refs": []}' % six.text_type.__name__,
++'{"address": 9, "type": "module", "size": 60, "value": "mymod", "refs": [2]}',
+         ]
          self.assertEqual(expected, [obj.to_json() for obj in objs])
@@ -243,11 +264,12 @@
          objs = manager.objs
          self.assertEqual((), objs[1].parents)
          self.assertEqual([1, 3], objs[3].parents)
--        self.assertEqual([3, 7, 8], sorted(objs[4].parents))
--        self.assertEqual([3, 7, 8], sorted(objs[5].parents))
--        self.assertEqual([8], objs[6].parents)
--        self.assertEqual([8], objs[7].parents)
--        self.assertEqual((), objs[8].parents)
++        self.assertEqual([3, 7, 9], sorted(objs[4].parents))
++        self.assertEqual([3, 7, 9], sorted(objs[5].parents))
++        self.assertEqual([9], objs[6].parents)
++        self.assertEqual([9], objs[7].parents)
++        self.assertEqual([3, 7], objs[8].parents)
++        self.assertEqual((), objs[9].parents)
      def test_compute_referrers(self):
          # Deprecated
@@ -267,11 +289,12 @@
              warn.trap_warnings(old_func)
          self.assertEqual((), objs[1].parents)
          self.assertEqual([1, 3], objs[3].parents)
--        self.assertEqual([3, 7, 8], sorted(objs[4].parents))
--        self.assertEqual([3, 7, 8], sorted(objs[5].parents))
--        self.assertEqual([8], objs[6].parents)
--        self.assertEqual([8], objs[7].parents)
--        self.assertEqual((), objs[8].parents)
++        self.assertEqual([3, 7, 9], sorted(objs[4].parents))
++        self.assertEqual([3, 7, 9], sorted(objs[5].parents))
++        self.assertEqual([9], objs[6].parents)
++        self.assertEqual([9], objs[7].parents)
++        self.assertEqual([3, 7], objs[8].parents)
++        self.assertEqual((), objs[9].parents)
      def test_compute_parents_ignore_repeated(self):
          manager = loader.load(_intern_dict_dump, show_prog=False)
@@ -294,6 +317,7 @@
          for x in range(200):
              content.append('{"address": %d, "type": "tuple", "size": 20,'
                             ' "len": 2, "refs": [2, 2]}' % (x+100))
++        content = [line.encode('UTF-8') for line in content]
          # By default, we only track 100 parents
          manager = loader.load(content, show_prog=False)
          self.assertEqual(100, manager[2].num_parents)
@@ -307,42 +331,42 @@
      def test_compute_total_size(self):
          manager = loader.load(_example_dump, show_prog=False)
          objs = manager.objs
--        manager.compute_total_size(objs[8])
--        self.assertEqual(257, objs[8].total_size)
++        manager.compute_total_size(objs[9])
++        self.assertEqual(345, objs[9].total_size)
      def test_compute_total_size_missing_ref(self):
          lines = list(_example_dump)
          # 999 isn't in the dump, not sure how we get these in real life, but
          # they exist. we should live with references that can't be resolved.
--        lines[-1] = ('{"address": 8, "type": "tuple", "size": 16, "len": 1'
--                     ', "refs": [999]}')
++        lines[-1] = (b'{"address": 9, "type": "tuple", "size": 16, "len": 1'
++                     b', "refs": [999]}')
          manager = loader.load(lines, show_prog=False)
--        obj = manager[8]
++        obj = manager[9]
          manager.compute_total_size(obj)
          self.assertEqual(16, obj.total_size)
      def test_remove_expensive_references(self):
          lines = list(_example_dump)
          lines.pop(-1) # Remove the old module
--        lines.append('{"address": 8, "type": "module", "size": 12'
--                     ', "name": "mymod", "refs": [9]}')
--        lines.append('{"address": 9, "type": "dict", "size": 124'
--                     ', "refs": [10, 11]}')
--        lines.append('{"address": 10, "type": "module", "size": 12'
--                     ', "name": "mod2", "refs": [12]}')
--        lines.append('{"address": 11, "type": "str", "size": 27'
--                     ', "value": "boo", "refs": []}')
--        lines.append('{"address": 12, "type": "dict", "size": 124'
--                     ', "refs": []}')
++        lines.append(b'{"address": 9, "type": "module", "size": 12'
++                     b', "name": "mymod", "refs": [10]}')
++        lines.append(b'{"address": 10, "type": "dict", "size": 124'
++                     b', "refs": [11, 12]}')
++        lines.append(b'{"address": 11, "type": "module", "size": 12'
++                     b', "name": "mod2", "refs": [13]}')
++        lines.append(b'{"address": 12, "type": "str", "size": 27'
++                     b', "value": "boo", "refs": []}')
++        lines.append(b'{"address": 13, "type": "dict", "size": 124'
++                     b', "refs": []}')
          manager = loader.load(lines, show_prog=False, collapse=False)
--        mymod_dict = manager.objs[9]
--        self.assertEqual([10, 11], mymod_dict.children)
++        mymod_dict = manager.objs[10]
++        self.assertEqual([11, 12], mymod_dict.children)
          manager.remove_expensive_references()
          self.assertTrue(0 in manager.objs)
          null_obj = manager.objs[0]
          self.assertEqual(0, null_obj.address)
          self.assertEqual('<ex-reference>', null_obj.type_str)
--        self.assertEqual([11, 0], mymod_dict.children)
++        self.assertEqual([12, 0], mymod_dict.children)
      def test_collapse_instance_dicts(self):
          manager = loader.load(_instance_dump, show_prog=False, collapse=False)
@@ -419,16 +443,18 @@
      def test_summarize_refs(self):
          manager = loader.load(_example_dump, show_prog=False)
--        summary = manager.summarize(manager[8])
++        summary = manager.summarize(manager[9])
          # Note that the module is included in the summary
--        self.assertEqual(['int', 'module', 'str', 'tuple'],
++        self.assertEqual(sorted(['int', 'module', bytes.__name__,
++                                 six.text_type.__name__, 'tuple']),
                           sorted(summary.type_summaries.keys()))
--        self.assertEqual(257, summary.total_size)
++        self.assertEqual(345, summary.total_size)
      def test_summarize_excluding(self):
          manager = loader.load(_example_dump, show_prog=False)
--        summary = manager.summarize(manager[8], excluding=[4, 5])
++        summary = manager.summarize(manager[9], excluding=[4, 5])
          # No ints when they are explicitly filtered
--        self.assertEqual(['module', 'str', 'tuple'],
++        self.assertEqual(sorted(['module', bytes.__name__,
++                                 six.text_type.__name__, 'tuple']),
                           sorted(summary.type_summaries.keys()))
--        self.assertEqual(233, summary.total_size)
++        self.assertEqual(321, summary.total_size)