Bazaar

Merge lp:~jameinel/bzr/1.15-pack-source into lp:~bzr/bzr/trunk-old

1.15-pack-source
Merge into trunk-old

Proposed by John A Meinel on 2009-06-02

Status:	Merged
Merged at revision:	not available
Proposed branch:	lp:~jameinel/bzr/1.15-pack-source
Merge into:	lp:~bzr/bzr/trunk-old
Diff against target:	824 lines
To merge this branch:	bzr merge lp:~jameinel/bzr/1.15-pack-source
Related bugs:	Link a bug report

Reviewer	Review Type	Date Requested	Status
Martin Pool		2009-06-02	Approve on 2009-06-16
Review via email: mp+6985@code.launchpad.net

Revision history for this message

John A Meinel (jameinel) wrote on 2009-06-02:

This proposal changes how pack <=> pack fetching triggers.

It removes the InterPackRepo optimizer (which uses Packer internally) in favor of a new KnitPackStreamSource.

The new source is a very streamlined version of StreamSource, which doesn't attempt to handle all the different cross-format issues. It only supports exact format fetching, and does so in a nice streamlined fashion.

Specifically, it sends data as (signatures, revisions, inventories, texts) since it knows we have atomic insertion.

It walks the inventory pages a single time, and extracts the text keys as the fetch is going, rather than doing so in a pre-read fetch. This is a moderate win for dump transport fetching (versus StreamSource, but not InterPackRepo) because it avoids reading the Inventory pages twice.

It also fixes a bug with the current InterPackRepo code. Namely, the Packer code was recently changed to make sure that all file_keys that are referenced are fetched, rather than only the ones mentioned in the specific revisions being fetched. This was done at ~ the same time as the updates to file_ids_altered_by... However, in updating that, it was not updated to read the parent inventories and remove their text keys.

This meant that if you got a fulltext inventory, you would end up copying the data for all texts in that revision, whether they were modified or not. For bzr.dev, this meant that it often downloaded ~3MB of extra data for a small change. I considered fixing Packer to handle this, but I figured we wanted to move to StreamSource as the one-and-only method for fetching anyway.

I also did a little bit of changes to make it clearer when a set of something was *keys* (tuples) and when it was *ids* (strings).

I also moved some of the helpers that were added as part of the gc-stacking patch, into the base Repository class, so that I could simply re-use them.

Revision history for this message

Martin Pool (mbp) wrote on 2009-06-16:

This looks ok to me, though you might want to run the concept past Robert.

review: Approve

Revision history for this message

Robert Collins (lifeless) wrote on 2009-06-16:

On Tue, 2009-06-16 at 05:33 +0000, Martin Pool wrote:
> Review: Approve
> This looks ok to me, though you might want to run the concept past Robert.

Conceptually fine. Using Packer was a hack when we had no interface able
to be efficient back in the days of single VersionedFile and Knits.

-Rob

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Aaron Bentley

Denys Duchier

Eric Siegerman

Gary van der Merwe

Jelmer Vernooij

John A Meinel

John Szakmeister

Jonathan Lange

Marius Kruger

Martin Albisetti

Matt Nordhoff

Paul Hummer

SuperMMX

Talden

Yoshinori Sano

to status/vote changes:

Alexander Belchenko

Martin Eisenhardt

Tim Penhey

Vincent Ladeuil

Bazaar

Merge lp:~jameinel/bzr/1.15-pack-source into lp:~bzr/bzr/trunk-old

Commit message

Description of the change

Preview Diff

Subscribers

 === modified file 'bzrlib/fetch.py'
 --- bzrlib/fetch.py	2009-06-10 03:56:49 +0000
 +++ bzrlib/fetch.py	2009-06-16 02:36:36 +0000
@@ -51,9 +51,6 @@
          :param last_revision: If set, try to limit to the data this revision
              references.
          :param find_ghosts: If True search the entire history for ghosts.
--        :param _write_group_acquired_callable: Don't use; this parameter only
--            exists to facilitate a hack done in InterPackRepo.fetch.  We would
--            like to remove this parameter.
          :param pb: ProgressBar object to use; deprecated and ignored.
              This method will just create one on top of the stack.
          """
 === modified file 'bzrlib/repofmt/groupcompress_repo.py'
 --- bzrlib/repofmt/groupcompress_repo.py	2009-06-12 01:11:00 +0000
 +++ bzrlib/repofmt/groupcompress_repo.py	2009-06-16 02:36:36 +0000
@@ -48,6 +48,7 @@
      Pack,
      NewPack,
      KnitPackRepository,
++    KnitPackStreamSource,
      PackRootCommitBuilder,
      RepositoryPackCollection,
      RepositoryFormatPack,
@@ -736,21 +737,10 @@
          # make it raise to trap naughty direct users.
          raise NotImplementedError(self._iter_inventory_xmls)
--    def _find_parent_ids_of_revisions(self, revision_ids):
--        # TODO: we probably want to make this a helper that other code can get
--        #       at
--        parent_map = self.get_parent_map(revision_ids)
--        parents = set()
--        map(parents.update, parent_map.itervalues())
--        parents.difference_update(revision_ids)
--        parents.discard(_mod_revision.NULL_REVISION)
--        return parents
--
--    def _find_present_inventory_ids(self, revision_ids):
--        keys = [(r,) for r in revision_ids]
--        parent_map = self.inventories.get_parent_map(keys)
--        present_inventory_ids = set(k[-1] for k in parent_map)
--        return present_inventory_ids
++    def _find_present_inventory_keys(self, revision_keys):
++        parent_map = self.inventories.get_parent_map(revision_keys)
++        present_inventory_keys = set(k for k in parent_map)
++        return present_inventory_keys
      def fileids_altered_by_revision_ids(self, revision_ids, _inv_weave=None):
          """Find the file ids and versions affected by revisions.
@@ -767,12 +757,20 @@
          file_id_revisions = {}
          pb = ui.ui_factory.nested_progress_bar()
          try:
--            parent_ids = self._find_parent_ids_of_revisions(revision_ids)
--            present_parent_inv_ids = self._find_present_inventory_ids(parent_ids)
++            revision_keys = [(r,) for r in revision_ids]
++            parent_keys = self._find_parent_keys_of_revisions(revision_keys)
++            # TODO: instead of using _find_present_inventory_keys, change the
++            #       code paths to allow missing inventories to be tolerated.
++            #       However, we only want to tolerate missing parent
++            #       inventories, not missing inventories for revision_ids
++            present_parent_inv_keys = self._find_present_inventory_keys(
++                                        parent_keys)
++            present_parent_inv_ids = set(
++                [k[-1] for k in present_parent_inv_keys])
              uninteresting_root_keys = set()
              interesting_root_keys = set()
--            inventories_to_read = set(present_parent_inv_ids)
--            inventories_to_read.update(revision_ids)
++            inventories_to_read = set(revision_ids)
++            inventories_to_read.update(present_parent_inv_ids)
              for inv in self.iter_inventories(inventories_to_read):
                  entry_chk_root_key = inv.id_to_entry.key()
                  if inv.revision_id in present_parent_inv_ids:
@@ -846,7 +844,7 @@
          return super(CHKInventoryRepository, self)._get_source(to_format)
--class GroupCHKStreamSource(repository.StreamSource):
++class GroupCHKStreamSource(KnitPackStreamSource):
      """Used when both the source and target repo are GroupCHK repos."""
      def __init__(self, from_repository, to_format):
@@ -854,6 +852,7 @@
          super(GroupCHKStreamSource, self).__init__(from_repository, to_format)
          self._revision_keys = None
          self._text_keys = None
++        self._text_fetch_order = 'groupcompress'
          self._chk_id_roots = None
          self._chk_p_id_roots = None
@@ -898,16 +897,10 @@
              p_id_roots_set.clear()
          return ('inventories', _filtered_inv_stream())
--    def _find_present_inventories(self, revision_ids):
--        revision_keys = [(r,) for r in revision_ids]
--        inventories = self.from_repository.inventories
--        present_inventories = inventories.get_parent_map(revision_keys)
--        return [p[-1] for p in present_inventories]
--
--    def _get_filtered_chk_streams(self, excluded_revision_ids):
++    def _get_filtered_chk_streams(self, excluded_revision_keys):
          self._text_keys = set()
--        excluded_revision_ids.discard(_mod_revision.NULL_REVISION)
--        if not excluded_revision_ids:
++        excluded_revision_keys.discard(_mod_revision.NULL_REVISION)
++        if not excluded_revision_keys:
              uninteresting_root_keys = set()
              uninteresting_pid_root_keys = set()
          else:
@@ -915,9 +908,9 @@
              # actually present
              # TODO: Update Repository.iter_inventories() to add
              #       ignore_missing=True
--            present_ids = self.from_repository._find_present_inventory_ids(
--                            excluded_revision_ids)
--            present_ids = self._find_present_inventories(excluded_revision_ids)
++            present_keys = self.from_repository._find_present_inventory_keys(
++                            excluded_revision_keys)
++            present_ids = [k[-1] for k in present_keys]
              uninteresting_root_keys = set()
              uninteresting_pid_root_keys = set()
              for inv in self.from_repository.iter_inventories(present_ids):
@@ -948,14 +941,6 @@
              self._chk_p_id_roots = None
          yield 'chk_bytes', _get_parent_id_basename_to_file_id_pages()
--    def _get_text_stream(self):
--        # Note: We know we don't have to handle adding root keys, because both
--        # the source and target are GCCHK, and those always support rich-roots
--        # We may want to request as 'unordered', in case the source has done a
--        # 'split' packing
--        return ('texts', self.from_repository.texts.get_record_stream(
--                            self._text_keys, 'groupcompress', False))
--
      def get_stream(self, search):
          revision_ids = search.get_keys()
          for stream_info in self._fetch_revision_texts(revision_ids):
@@ -966,8 +951,9 @@
          # For now, exclude all parents that are at the edge of ancestry, for
          # which we have inventories
          from_repo = self.from_repository
--        parent_ids = from_repo._find_parent_ids_of_revisions(revision_ids)
--        for stream_info in self._get_filtered_chk_streams(parent_ids):
++        parent_keys = from_repo._find_parent_keys_of_revisions(
++                        self._revision_keys)
++        for stream_info in self._get_filtered_chk_streams(parent_keys):
              yield stream_info
          yield self._get_text_stream()
@@ -991,8 +977,8 @@
          # no unavailable texts when the ghost inventories are not filled in.
          yield self._get_inventory_stream(missing_inventory_keys,
                                           allow_absent=True)
--        # We use the empty set for excluded_revision_ids, to make it clear that
--        # we want to transmit all referenced chk pages.
++        # We use the empty set for excluded_revision_keys, to make it clear
++        # that we want to transmit all referenced chk pages.
          for stream_info in self._get_filtered_chk_streams(set()):
              yield stream_info
 === modified file 'bzrlib/repofmt/pack_repo.py'
 --- bzrlib/repofmt/pack_repo.py	2009-06-10 03:56:49 +0000
 +++ bzrlib/repofmt/pack_repo.py	2009-06-16 02:36:36 +0000
@@ -73,6 +73,7 @@
      MetaDirRepositoryFormat,
      RepositoryFormat,
      RootCommitBuilder,
++    StreamSource,
+     )
  import bzrlib.revision as _mod_revision
  from bzrlib.trace import (
@@ -2265,6 +2266,11 @@
              pb.finished()
          return result
++    def _get_source(self, to_format):
++        if to_format.network_name() == self._format.network_name():
++            return KnitPackStreamSource(self, to_format)
++        return super(KnitPackRepository, self)._get_source(to_format)
++
      def _make_parents_provider(self):
          return graph.CachingParentsProvider(self)
@@ -2384,6 +2390,79 @@
                  repo.unlock()
++class KnitPackStreamSource(StreamSource):
++    """A StreamSource used to transfer data between same-format KnitPack repos.
++
++    This source assumes:
++        1) Same serialization format for all objects
++        2) Same root information
++        3) XML format inventories
++        4) Atomic inserts (so we can stream inventory texts before text
++           content)
++        5) No chk_bytes
++    """
++
++    def __init__(self, from_repository, to_format):
++        super(KnitPackStreamSource, self).__init__(from_repository, to_format)
++        self._text_keys = None
++        self._text_fetch_order = 'unordered'
++
++    def _get_filtered_inv_stream(self, revision_ids):
++        from_repo = self.from_repository
++        parent_ids = from_repo._find_parent_ids_of_revisions(revision_ids)
++        parent_keys = [(p,) for p in parent_ids]
++        find_text_keys = from_repo._find_text_key_references_from_xml_inventory_lines
++        parent_text_keys = set(find_text_keys(
++            from_repo._inventory_xml_lines_for_keys(parent_keys)))
++        content_text_keys = set()
++        knit = KnitVersionedFiles(None, None)
++        factory = KnitPlainFactory()
++        def find_text_keys_from_content(record):
++            if record.storage_kind not in ('knit-delta-gz', 'knit-ft-gz'):
++                raise ValueError("Unknown content storage kind for"
++                    " inventory text: %s" % (record.storage_kind,))
++            # It's a knit record, it has a _raw_record field (even if it was
++            # reconstituted from a network stream).
++            raw_data = record._raw_record
++            # read the entire thing
++            revision_id = record.key[-1]
++            content, _ = knit._parse_record(revision_id, raw_data)
++            if record.storage_kind == 'knit-delta-gz':
++                line_iterator = factory.get_linedelta_content(content)
++            elif record.storage_kind == 'knit-ft-gz':
++                line_iterator = factory.get_fulltext_content(content)
++            content_text_keys.update(find_text_keys(
++                [(line, revision_id) for line in line_iterator]))
++        revision_keys = [(r,) for r in revision_ids]
++        def _filtered_inv_stream():
++            source_vf = from_repo.inventories
++            stream = source_vf.get_record_stream(revision_keys,
++                                                 'unordered', False)
++            for record in stream:
++                if record.storage_kind == 'absent':
++                    raise errors.NoSuchRevision(from_repo, record.key)
++                find_text_keys_from_content(record)
++                yield record
++            self._text_keys = content_text_keys - parent_text_keys
++        return ('inventories', _filtered_inv_stream())
++
++    def _get_text_stream(self):
++        # Note: We know we don't have to handle adding root keys, because both
++        # the source and target are the identical network name.
++        text_stream = self.from_repository.texts.get_record_stream(
++                        self._text_keys, self._text_fetch_order, False)
++        return ('texts', text_stream)
++
++    def get_stream(self, search):
++        revision_ids = search.get_keys()
++        for stream_info in self._fetch_revision_texts(revision_ids):
++            yield stream_info
++        self._revision_keys = [(rev_id,) for rev_id in revision_ids]
++        yield self._get_filtered_inv_stream(revision_ids)
++        yield self._get_text_stream()
++
++
++
  class RepositoryFormatPack(MetaDirRepositoryFormat):
      """Format logic for pack structured repositories.
 === modified file 'bzrlib/repository.py'
 --- bzrlib/repository.py	2009-06-12 01:11:00 +0000
 +++ bzrlib/repository.py	2009-06-16 02:36:36 +0000
@@ -1919,29 +1919,25 @@
                      yield line, revid
      def _find_file_ids_from_xml_inventory_lines(self, line_iterator,
--        revision_ids):
++        revision_keys):
          """Helper routine for fileids_altered_by_revision_ids.
          This performs the translation of xml lines to revision ids.
          :param line_iterator: An iterator of lines, origin_version_id
--        :param revision_ids: The revision ids to filter for. This should be a
++        :param revision_keys: The revision ids to filter for. This should be a
              set or other type which supports efficient __contains__ lookups, as
--            the revision id from each parsed line will be looked up in the
--            revision_ids filter.
++            the revision key from each parsed line will be looked up in the
++            revision_keys filter.
          :return: a dictionary mapping altered file-ids to an iterable of
          revision_ids. Each altered file-ids has the exact revision_ids that
          altered it listed explicitly.
          """
          seen = set(self._find_text_key_references_from_xml_inventory_lines(
                  line_iterator).iterkeys())
--        # Note that revision_ids are revision keys.
--        parent_maps = self.revisions.get_parent_map(revision_ids)
--        parents = set()
--        map(parents.update, parent_maps.itervalues())
--        parents.difference_update(revision_ids)
++        parent_keys = self._find_parent_keys_of_revisions(revision_keys)
          parent_seen = set(self._find_text_key_references_from_xml_inventory_lines(
--            self._inventory_xml_lines_for_keys(parents)))
++            self._inventory_xml_lines_for_keys(parent_keys)))
          new_keys = seen - parent_seen
          result = {}
          setdefault = result.setdefault
@@ -1949,6 +1945,33 @@
              setdefault(key[0], set()).add(key[-1])
          return result
++    def _find_parent_ids_of_revisions(self, revision_ids):
++        """Find all parent ids that are mentioned in the revision graph.
++
++        :return: set of revisions that are parents of revision_ids which are
++            not part of revision_ids themselves
++        """
++        parent_map = self.get_parent_map(revision_ids)
++        parent_ids = set()
++        map(parent_ids.update, parent_map.itervalues())
++        parent_ids.difference_update(revision_ids)
++        parent_ids.discard(_mod_revision.NULL_REVISION)
++        return parent_ids
++
++    def _find_parent_keys_of_revisions(self, revision_keys):
++        """Similar to _find_parent_ids_of_revisions, but used with keys.
++
++        :param revision_keys: An iterable of revision_keys.
++        :return: The parents of all revision_keys that are not already in
++            revision_keys
++        """
++        parent_map = self.revisions.get_parent_map(revision_keys)
++        parent_keys = set()
++        map(parent_keys.update, parent_map.itervalues())
++        parent_keys.difference_update(revision_keys)
++        parent_keys.discard(_mod_revision.NULL_REVISION)
++        return parent_keys
++
      def fileids_altered_by_revision_ids(self, revision_ids, _inv_weave=None):
          """Find the file ids and versions affected by revisions.
@@ -3418,144 +3441,6 @@
          return self.source.revision_ids_to_search_result(result_set)
--class InterPackRepo(InterSameDataRepository):
--    """Optimised code paths between Pack based repositories."""
--
--    @classmethod
--    def _get_repo_format_to_test(self):
--        from bzrlib.repofmt import pack_repo
--        return pack_repo.RepositoryFormatKnitPack6RichRoot()
--
--    @staticmethod
--    def is_compatible(source, target):
--        """Be compatible with known Pack formats.
--
--        We don't test for the stores being of specific types because that
--        could lead to confusing results, and there is no need to be
--        overly general.
--
--        InterPackRepo does not support CHK based repositories.
--        """
--        from bzrlib.repofmt.pack_repo import RepositoryFormatPack
--        from bzrlib.repofmt.groupcompress_repo import RepositoryFormatCHK1
--        try:
--            are_packs = (isinstance(source._format, RepositoryFormatPack) and
--                isinstance(target._format, RepositoryFormatPack))
--            not_packs = (isinstance(source._format, RepositoryFormatCHK1) or
--                isinstance(target._format, RepositoryFormatCHK1))
--        except AttributeError:
--            return False
--        if not_packs or not are_packs:
--            return False
--        return InterRepository._same_model(source, target)
--
--    @needs_write_lock
--    def fetch(self, revision_id=None, pb=None, find_ghosts=False,
--            fetch_spec=None):
--        """See InterRepository.fetch()."""
--        if (len(self.source._fallback_repositories) > 0 or
--            len(self.target._fallback_repositories) > 0):
--            # The pack layer is not aware of fallback repositories, so when
--            # fetching from a stacked repository or into a stacked repository
--            # we use the generic fetch logic which uses the VersionedFiles
--            # attributes on repository.
--            from bzrlib.fetch import RepoFetcher
--            fetcher = RepoFetcher(self.target, self.source, revision_id,
--                    pb, find_ghosts, fetch_spec=fetch_spec)
--        if fetch_spec is not None:
--            if len(list(fetch_spec.heads)) != 1:
--                raise AssertionError(
--                    "InterPackRepo.fetch doesn't support "
--                    "fetching multiple heads yet.")
--            revision_id = list(fetch_spec.heads)[0]
--            fetch_spec = None
--        if revision_id is None:
--            # TODO:
--            # everything to do - use pack logic
--            # to fetch from all packs to one without
--            # inventory parsing etc, IFF nothing to be copied is in the target.
--            # till then:
--            source_revision_ids = frozenset(self.source.all_revision_ids())
--            revision_ids = source_revision_ids - \
--                frozenset(self.target.get_parent_map(source_revision_ids))
--            revision_keys = [(revid,) for revid in revision_ids]
--            index = self.target._pack_collection.revision_index.combined_index
--            present_revision_ids = set(item[1][0] for item in
--                index.iter_entries(revision_keys))
--            revision_ids = set(revision_ids) - present_revision_ids
--            # implementing the TODO will involve:
--            # - detecting when all of a pack is selected
--            # - avoiding as much as possible pre-selection, so the
--            # more-core routines such as create_pack_from_packs can filter in
--            # a just-in-time fashion. (though having a HEADS list on a
--            # repository might make this a lot easier, because we could
--            # sensibly detect 'new revisions' without doing a full index scan.
--        elif _mod_revision.is_null(revision_id):
--            # nothing to do:
--            return (0, [])
--        else:
--            revision_ids = self.search_missing_revision_ids(revision_id,
--                find_ghosts=find_ghosts).get_keys()
--            if len(revision_ids) == 0:
--                return (0, [])
--        return self._pack(self.source, self.target, revision_ids)
--
--    def _pack(self, source, target, revision_ids):
--        from bzrlib.repofmt.pack_repo import Packer
--        packs = source._pack_collection.all_packs()
--        pack = Packer(self.target._pack_collection, packs, '.fetch',
--            revision_ids).pack()
--        if pack is not None:
--            self.target._pack_collection._save_pack_names()
--            copied_revs = pack.get_revision_count()
--            # Trigger an autopack. This may duplicate effort as we've just done
--            # a pack creation, but for now it is simpler to think about as
--            # 'upload data, then repack if needed'.
--            self.target._pack_collection.autopack()
--            return (copied_revs, [])
--        else:
--            return (0, [])
--
--    @needs_read_lock
--    def search_missing_revision_ids(self, revision_id=None, find_ghosts=True):
--        """See InterRepository.missing_revision_ids().
--
--        :param find_ghosts: Find ghosts throughout the ancestry of
--            revision_id.
--        """
--        if not find_ghosts and revision_id is not None:
--            return self._walk_to_common_revisions([revision_id])
--        elif revision_id is not None:
--            # Find ghosts: search for revisions pointing from one repository to
--            # the other, and vice versa, anywhere in the history of revision_id.
--            graph = self.target.get_graph(other_repository=self.source)
--            searcher = graph._make_breadth_first_searcher([revision_id])
--            found_ids = set()
--            while True:
--                try:
--                    next_revs, ghosts = searcher.next_with_ghosts()
--                except StopIteration:
--                    break
--                if revision_id in ghosts:
--                    raise errors.NoSuchRevision(self.source, revision_id)
--                found_ids.update(next_revs)
--                found_ids.update(ghosts)
--            found_ids = frozenset(found_ids)
--            # Double query here: should be able to avoid this by changing the
--            # graph api further.
--            result_set = found_ids - frozenset(
--                self.target.get_parent_map(found_ids))
--        else:
--            source_ids = self.source.all_revision_ids()
--            # source_ids is the worst possible case we may need to pull.
--            # now we want to filter source_ids against what we actually
--            # have in target, but don't try to check for existence where we know
--            # we do not have a revision as that would be pointless.
--            target_ids = set(self.target.all_revision_ids())
--            result_set = set(source_ids).difference(target_ids)
--        return self.source.revision_ids_to_search_result(result_set)
--
--
  class InterDifferingSerializer(InterRepository):
      @classmethod
@@ -3836,7 +3721,6 @@
  InterRepository.register_optimiser(InterSameDataRepository)
  InterRepository.register_optimiser(InterWeaveRepo)
  InterRepository.register_optimiser(InterKnitRepo)
--InterRepository.register_optimiser(InterPackRepo)
  class CopyConverter(object):
 === modified file 'bzrlib/tests/test_pack_repository.py'
 --- bzrlib/tests/test_pack_repository.py	2009-06-10 03:56:49 +0000
 +++ bzrlib/tests/test_pack_repository.py	2009-06-16 02:36:36 +0000
@@ -38,6 +38,10 @@
      upgrade,
      workingtree,
+     )
++from bzrlib.repofmt import (
++    pack_repo,
++    groupcompress_repo,
++    )
  from bzrlib.repofmt.groupcompress_repo import RepositoryFormatCHK1
  from bzrlib.smart import (
      client,
@@ -556,58 +560,43 @@
              missing_ghost.get_inventory, 'ghost')
      def make_write_ready_repo(self):
--        repo = self.make_repository('.', format=self.get_format())
++        format = self.get_format()
++        if isinstance(format.repository_format, RepositoryFormatCHK1):
++            raise TestNotApplicable("No missing compression parents")
++        repo = self.make_repository('.', format=format)
          repo.lock_write()
++        self.addCleanup(repo.unlock)
          repo.start_write_group()
++        self.addCleanup(repo.abort_write_group)
          return repo
      def test_missing_inventories_compression_parent_prevents_commit(self):
          repo = self.make_write_ready_repo()
          key = ('junk',)
--        if not getattr(repo.inventories._index, '_missing_compression_parents',
--            None):
--            raise TestSkipped("No missing compression parents")
          repo.inventories._index._missing_compression_parents.add(key)
          self.assertRaises(errors.BzrCheckError, repo.commit_write_group)
          self.assertRaises(errors.BzrCheckError, repo.commit_write_group)
--        repo.abort_write_group()
--        repo.unlock()
      def test_missing_revisions_compression_parent_prevents_commit(self):
          repo = self.make_write_ready_repo()
          key = ('junk',)
--        if not getattr(repo.inventories._index, '_missing_compression_parents',
--            None):
--            raise TestSkipped("No missing compression parents")
          repo.revisions._index._missing_compression_parents.add(key)
          self.assertRaises(errors.BzrCheckError, repo.commit_write_group)
          self.assertRaises(errors.BzrCheckError, repo.commit_write_group)
--        repo.abort_write_group()
--        repo.unlock()
      def test_missing_signatures_compression_parent_prevents_commit(self):
          repo = self.make_write_ready_repo()
          key = ('junk',)
--        if not getattr(repo.inventories._index, '_missing_compression_parents',
--            None):
--            raise TestSkipped("No missing compression parents")
          repo.signatures._index._missing_compression_parents.add(key)
          self.assertRaises(errors.BzrCheckError, repo.commit_write_group)
          self.assertRaises(errors.BzrCheckError, repo.commit_write_group)
--        repo.abort_write_group()
--        repo.unlock()
      def test_missing_text_compression_parent_prevents_commit(self):
          repo = self.make_write_ready_repo()
          key = ('some', 'junk')
--        if not getattr(repo.inventories._index, '_missing_compression_parents',
--            None):
--            raise TestSkipped("No missing compression parents")
          repo.texts._index._missing_compression_parents.add(key)
          self.assertRaises(errors.BzrCheckError, repo.commit_write_group)
          e = self.assertRaises(errors.BzrCheckError, repo.commit_write_group)
--        repo.abort_write_group()
--        repo.unlock()
      def test_supports_external_lookups(self):
          repo = self.make_repository('.', format=self.get_format())
 === modified file 'bzrlib/tests/test_repository.py'
 --- bzrlib/tests/test_repository.py	2009-06-10 03:56:49 +0000
 +++ bzrlib/tests/test_repository.py	2009-06-16 02:36:36 +0000
@@ -31,7 +31,10 @@
                             UnknownFormatError,
                             UnsupportedFormatError,
+                            )
--from bzrlib import graph
++from bzrlib import (
++    graph,
++    tests,
++    )
  from bzrlib.branchbuilder import BranchBuilder
  from bzrlib.btree_index import BTreeBuilder, BTreeGraphIndex
  from bzrlib.index import GraphIndex, InMemoryGraphIndex
@@ -685,6 +688,147 @@
          self.assertEqual(65536,
              inv.parent_id_basename_to_file_id._root_node.maximum_size)
++    def test_stream_source_to_gc(self):
++        source = self.make_repository('source', format='development6-rich-root')
++        target = self.make_repository('target', format='development6-rich-root')
++        stream = source._get_source(target._format)
++        self.assertIsInstance(stream, groupcompress_repo.GroupCHKStreamSource)
++
++    def test_stream_source_to_non_gc(self):
++        source = self.make_repository('source', format='development6-rich-root')
++        target = self.make_repository('target', format='rich-root-pack')
++        stream = source._get_source(target._format)
++        # We don't want the child GroupCHKStreamSource
++        self.assertIs(type(stream), repository.StreamSource)
++
++    def test_get_stream_for_missing_keys_includes_all_chk_refs(self):
++        source_builder = self.make_branch_builder('source',
++                            format='development6-rich-root')
++        # We have to build a fairly large tree, so that we are sure the chk
++        # pages will have split into multiple pages.
++        entries = [('add', ('', 'a-root-id', 'directory', None))]
++        for i in 'abcdefghijklmnopqrstuvwxyz123456789':
++            for j in 'abcdefghijklmnopqrstuvwxyz123456789':
++                fname = i + j
++                fid = fname + '-id'
++                content = 'content for %s\n' % (fname,)
++                entries.append(('add', (fname, fid, 'file', content)))
++        source_builder.start_series()
++        source_builder.build_snapshot('rev-1', None, entries)
++        # Now change a few of them, so we get a few new pages for the second
++        # revision
++        source_builder.build_snapshot('rev-2', ['rev-1'], [
++            ('modify', ('aa-id', 'new content for aa-id\n')),
++            ('modify', ('cc-id', 'new content for cc-id\n')),
++            ('modify', ('zz-id', 'new content for zz-id\n')),
++            ])
++        source_builder.finish_series()
++        source_branch = source_builder.get_branch()
++        source_branch.lock_read()
++        self.addCleanup(source_branch.unlock)
++        target = self.make_repository('target', format='development6-rich-root')
++        source = source_branch.repository._get_source(target._format)
++        self.assertIsInstance(source, groupcompress_repo.GroupCHKStreamSource)
++
++        # On a regular pass, getting the inventories and chk pages for rev-2
++        # would only get the newly created chk pages
++        search = graph.SearchResult(set(['rev-2']), set(['rev-1']), 1,
++                                    set(['rev-2']))
++        simple_chk_records = []
++        for vf_name, substream in source.get_stream(search):
++            if vf_name == 'chk_bytes':
++                for record in substream:
++                    simple_chk_records.append(record.key)
++            else:
++                for _ in substream:
++                    continue
++        # 3 pages, the root (InternalNode), + 2 pages which actually changed
++        self.assertEqual([('sha1:91481f539e802c76542ea5e4c83ad416bf219f73',),
++                          ('sha1:4ff91971043668583985aec83f4f0ab10a907d3f',),
++                          ('sha1:81e7324507c5ca132eedaf2d8414ee4bb2226187',),
++                          ('sha1:b101b7da280596c71a4540e9a1eeba8045985ee0',)],
++                         simple_chk_records)
++        # Now, when we do a similar call using 'get_stream_for_missing_keys'
++        # we should get a much larger set of pages.
++        missing = [('inventories', 'rev-2')]
++        full_chk_records = []
++        for vf_name, substream in source.get_stream_for_missing_keys(missing):
++            if vf_name == 'inventories':
++                for record in substream:
++                    self.assertEqual(('rev-2',), record.key)
++            elif vf_name == 'chk_bytes':
++                for record in substream:
++                    full_chk_records.append(record.key)
++            else:
++                self.fail('Should not be getting a stream of %s' % (vf_name,))
++        # We have 257 records now. This is because we have 1 root page, and 256
++        # leaf pages in a complete listing.
++        self.assertEqual(257, len(full_chk_records))
++        self.assertSubset(simple_chk_records, full_chk_records)
++
++
++class TestKnitPackStreamSource(tests.TestCaseWithMemoryTransport):
++
++    def test_source_to_exact_pack_092(self):
++        source = self.make_repository('source', format='pack-0.92')
++        target = self.make_repository('target', format='pack-0.92')
++        stream_source = source._get_source(target._format)
++        self.assertIsInstance(stream_source, pack_repo.KnitPackStreamSource)
++
++    def test_source_to_exact_pack_rich_root_pack(self):
++        source = self.make_repository('source', format='rich-root-pack')
++        target = self.make_repository('target', format='rich-root-pack')
++        stream_source = source._get_source(target._format)
++        self.assertIsInstance(stream_source, pack_repo.KnitPackStreamSource)
++
++    def test_source_to_exact_pack_19(self):
++        source = self.make_repository('source', format='1.9')
++        target = self.make_repository('target', format='1.9')
++        stream_source = source._get_source(target._format)
++        self.assertIsInstance(stream_source, pack_repo.KnitPackStreamSource)
++
++    def test_source_to_exact_pack_19_rich_root(self):
++        source = self.make_repository('source', format='1.9-rich-root')
++        target = self.make_repository('target', format='1.9-rich-root')
++        stream_source = source._get_source(target._format)
++        self.assertIsInstance(stream_source, pack_repo.KnitPackStreamSource)
++
++    def test_source_to_remote_exact_pack_19(self):
++        trans = self.make_smart_server('target')
++        trans.ensure_base()
++        source = self.make_repository('source', format='1.9')
++        target = self.make_repository('target', format='1.9')
++        target = repository.Repository.open(trans.base)
++        stream_source = source._get_source(target._format)
++        self.assertIsInstance(stream_source, pack_repo.KnitPackStreamSource)
++
++    def test_stream_source_to_non_exact(self):
++        source = self.make_repository('source', format='pack-0.92')
++        target = self.make_repository('target', format='1.9')
++        stream = source._get_source(target._format)
++        self.assertIs(type(stream), repository.StreamSource)
++
++    def test_stream_source_to_non_exact_rich_root(self):
++        source = self.make_repository('source', format='1.9')
++        target = self.make_repository('target', format='1.9-rich-root')
++        stream = source._get_source(target._format)
++        self.assertIs(type(stream), repository.StreamSource)
++
++    def test_source_to_remote_non_exact_pack_19(self):
++        trans = self.make_smart_server('target')
++        trans.ensure_base()
++        source = self.make_repository('source', format='1.9')
++        target = self.make_repository('target', format='1.6')
++        target = repository.Repository.open(trans.base)
++        stream_source = source._get_source(target._format)
++        self.assertIs(type(stream_source), repository.StreamSource)
++
++    def test_stream_source_to_knit(self):
++        source = self.make_repository('source', format='pack-0.92')
++        target = self.make_repository('target', format='dirstate')
++        stream = source._get_source(target._format)
++        self.assertIs(type(stream), repository.StreamSource)
++
  class TestDevelopment6FindParentIdsOfRevisions(TestCaseWithTransport):
      """Tests for _find_parent_ids_of_revisions."""
@@ -1204,84 +1348,3 @@
          self.assertTrue(new_pack.inventory_index._optimize_for_size)
          self.assertTrue(new_pack.text_index._optimize_for_size)
          self.assertTrue(new_pack.signature_index._optimize_for_size)
--
--
--class TestGCCHKPackCollection(TestCaseWithTransport):
--
--    def test_stream_source_to_gc(self):
--        source = self.make_repository('source', format='development6-rich-root')
--        target = self.make_repository('target', format='development6-rich-root')
--        stream = source._get_source(target._format)
--        self.assertIsInstance(stream, groupcompress_repo.GroupCHKStreamSource)
--
--    def test_stream_source_to_non_gc(self):
--        source = self.make_repository('source', format='development6-rich-root')
--        target = self.make_repository('target', format='rich-root-pack')
--        stream = source._get_source(target._format)
--        # We don't want the child GroupCHKStreamSource
--        self.assertIs(type(stream), repository.StreamSource)
--
--    def test_get_stream_for_missing_keys_includes_all_chk_refs(self):
--        source_builder = self.make_branch_builder('source',
--                            format='development6-rich-root')
--        # We have to build a fairly large tree, so that we are sure the chk
--        # pages will have split into multiple pages.
--        entries = [('add', ('', 'a-root-id', 'directory', None))]
--        for i in 'abcdefghijklmnopqrstuvwxyz123456789':
--            for j in 'abcdefghijklmnopqrstuvwxyz123456789':
--                fname = i + j
--                fid = fname + '-id'
--                content = 'content for %s\n' % (fname,)
--                entries.append(('add', (fname, fid, 'file', content)))
--        source_builder.start_series()
--        source_builder.build_snapshot('rev-1', None, entries)
--        # Now change a few of them, so we get a few new pages for the second
--        # revision
--        source_builder.build_snapshot('rev-2', ['rev-1'], [
--            ('modify', ('aa-id', 'new content for aa-id\n')),
--            ('modify', ('cc-id', 'new content for cc-id\n')),
--            ('modify', ('zz-id', 'new content for zz-id\n')),
--            ])
--        source_builder.finish_series()
--        source_branch = source_builder.get_branch()
--        source_branch.lock_read()
--        self.addCleanup(source_branch.unlock)
--        target = self.make_repository('target', format='development6-rich-root')
--        source = source_branch.repository._get_source(target._format)
--        self.assertIsInstance(source, groupcompress_repo.GroupCHKStreamSource)
--
--        # On a regular pass, getting the inventories and chk pages for rev-2
--        # would only get the newly created chk pages
--        search = graph.SearchResult(set(['rev-2']), set(['rev-1']), 1,
--                                    set(['rev-2']))
--        simple_chk_records = []
--        for vf_name, substream in source.get_stream(search):
--            if vf_name == 'chk_bytes':
--                for record in substream:
--                    simple_chk_records.append(record.key)
--            else:
--                for _ in substream:
--                    continue
--        # 3 pages, the root (InternalNode), + 2 pages which actually changed
--        self.assertEqual([('sha1:91481f539e802c76542ea5e4c83ad416bf219f73',),
--                          ('sha1:4ff91971043668583985aec83f4f0ab10a907d3f',),
--                          ('sha1:81e7324507c5ca132eedaf2d8414ee4bb2226187',),
--                          ('sha1:b101b7da280596c71a4540e9a1eeba8045985ee0',)],
--                         simple_chk_records)
--        # Now, when we do a similar call using 'get_stream_for_missing_keys'
--        # we should get a much larger set of pages.
--        missing = [('inventories', 'rev-2')]
--        full_chk_records = []
--        for vf_name, substream in source.get_stream_for_missing_keys(missing):
--            if vf_name == 'inventories':
--                for record in substream:
--                    self.assertEqual(('rev-2',), record.key)
--            elif vf_name == 'chk_bytes':
--                for record in substream:
--                    full_chk_records.append(record.key)
--            else:
--                self.fail('Should not be getting a stream of %s' % (vf_name,))
--        # We have 257 records now. This is because we have 1 root page, and 256
--        # leaf pages in a complete listing.
--        self.assertEqual(257, len(full_chk_records))
--        self.assertSubset(simple_chk_records, full_chk_records)