U1DB

Merge lp:~jameinel/u1db/get-all-docs into lp:u1db

get-all-docs
Merge into trunk

Proposed by John A Meinel on 2011-11-01

Status:	Work in progress
Proposed branch:	lp:~jameinel/u1db/get-all-docs
Merge into:	lp:u1db
Diff against target:	98 lines (+48/-0) 4 files modified u1db/__init__.py (+11/-0) u1db/backends/inmemory.py (+6/-0) u1db/backends/sqlite_backend.py (+16/-0) u1db/tests/test_backends.py (+15/-0)
To merge this branch:	bzr merge lp:~jameinel/u1db/get-all-docs
Related bugs:	Link a bug report

Reviewer	Review Type	Date Requested	Status
Samuele Pedroni		2011-11-01	Approve on 2011-11-21
Review via email: mp+80939@code.launchpad.net

Description of the change

I'm not 100% sure if we want to add this api or not, but it was easy to implement.

John Lenton was working on a generic u1db introspection tool, and noted that he wanted a way to describe all the documents.

This adds a direct DB.get_all_doc_ids().

However, I realized that you can do "DB.whats_changed(0)" and get the full list as well. The latter will internally be a bit inefficient after many generations, since it will see documents that have been updated multiple times. It will filter out duplicates, but it has to process M rows if there are M generations and N documents. N will always be < M. I suppose if we wanted, we could do rollup functionality to allow us to answer whats_changed(0) faster.

We can leave this on the side until we decide if we want to expose it or not.

Revision history for this message

Samuele Pedroni (pedronis) wrote on 2011-11-02:

the implementations don't filter out deleted documents, I would expect it to do that, or to have the option to at least. corresponding test:

=== modified file 'u1db/tests/test_backends.py'
--- u1db/tests/test_backends.py 2011-11-01 19:03:05 +0000
+++ u1db/tests/test_backends.py 2011-11-02 13:49:10 +0000
@@ -51,11 +51,14 @@

     def test_get_all_doc_ids(self):
         self.assertEqual([], self.db.get_all_doc_ids())
- doc1_id, _ = self.db.create_doc(simple_doc)
+ doc1_id, doc1_rev = self.db.create_doc(simple_doc)
         self.assertEqual([doc1_id], self.db.get_all_doc_ids())
         doc2_id, _ = self.db.create_doc(nested_doc)
         self.assertEqual(sorted([doc1_id, doc2_id]),
                          sorted(self.db.get_all_doc_ids()))
+ self.db.delete_doc(doc1_id, doc1_rev)
+ self.assertEqual(sorted([doc2_id]),
+ sorted(self.db.get_all_doc_ids()))

def test_get_docs(self):
doc1_id, doc1_rev = self.db.create_doc(simple_doc)

review: Needs Fixing

Revision history for this message

John A Meinel (jameinel) wrote on 2011-11-02:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11/02/2011 02:52 PM, Samuele Pedroni wrote:
> Review: Needs Fixing
>
> the implementations don't filter out deleted documents, I would
> expect it to do that, or to have the option to at least.
> corresponding test:
>
> === modified file 'u1db/tests/test_backends.py' ---
> u1db/tests/test_backends.py 2011-11-01 19:03:05 +0000 +++
> u1db/tests/test_backends.py 2011-11-02 13:49:10 +0000 @@ -51,11
> +51,14 @@
>
> def test_get_all_doc_ids(self): self.assertEqual([],
> self.db.get_all_doc_ids()) - doc1_id, _ =
> self.db.create_doc(simple_doc) + doc1_id, doc1_rev =
> self.db.create_doc(simple_doc) self.assertEqual([doc1_id],
> self.db.get_all_doc_ids()) doc2_id, _ =
> self.db.create_doc(nested_doc) self.assertEqual(sorted([doc1_id,
> doc2_id]), sorted(self.db.get_all_doc_ids())) +
> self.db.delete_doc(doc1_id, doc1_rev) +
> self.assertEqual(sorted([doc2_id]), +
> sorted(self.db.get_all_doc_ids()))
>
> def test_get_docs(self): doc1_id, doc1_rev =
> self.db.create_doc(simple_doc)
>
>

Note that 'whatschanged' is going to return this doc_id as being
updated, and get_doc is going to return the state that it is deleted.

So I'm not 100% sure on this.
John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk6xUAQACgkQJdeBCYSNAANgiQCgnmRrKeg4+dy0Ng8AnEYjcuCV
KaAAoJy725gHz3VLLmyfvPRIa3yY/p3B
=Cd3Y
-----END PGP SIGNATURE-----

Revision history for this message

Samuele Pedroni (pedronis) wrote on 2011-11-02:

>
> Note that 'whatschanged' is going to return this doc_id as being
> updated, and get_doc is going to return the state that it is deleted.
>
> So I'm not 100% sure on this.
> John

on the other hand indexes filter out deleted docs. It probably means that if think we need this kind of functionality we need both a way to get everything (though that may indeed be what whatchanged(0) is for) and a way to get only not deleted docs

> =:->
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.11 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iEYEARECAAYFAk6xUAQACgkQJdeBCYSNAANgiQCgnmRrKeg4+dy0Ng8AnEYjcuCV
> KaAAoJy725gHz3VLLmyfvPRIa3yY/p3B
> =Cd3Y
> -----END PGP SIGNATURE-----

Revision history for this message

Stuart Langridge (sil) wrote on 2011-11-09:

I personally think that get_all_docs should return only live docs, not deleted ones, since you almost always don't want deleted docs -- the only thing that really cares about them is the syncer.
If get_all_docs returns deleted ones too, I'd almost always have to do "for doc in [x for x in db.get_all_docs() if not x.deleted]" rather than just "for doc in db.get_all_docs()", which is not what I'd expect.

lp:~jameinel/u1db/get-all-docs updated on 2011-11-15

96. By John A Meinel on 2011-11-15

Merge trunk to get all the new goodness.

97. By John A Meinel on 2011-11-15

get_all_doc_ids now only returns documents that aren't marked deleted.

98. By John A Meinel on 2011-11-15

Have get_all_doc_ids return the current db_generation just like whats_changed.

So now the only real difference between the two is that get_all_doc_ids doesn't return
the ids for deleted documents.

Revision history for this message

John A Meinel (jameinel) wrote on 2011-11-15:

This now removes doc_ids that have been deleted. I also added the ability to get the database generation so that you can track 'liveness'.

So basically, this is whats_changed, but with deleted items removed.

I'm thinking this might be better expressed as whats_changed(0, include_deleted=False).

Though this may be a more obvious api for users.

Revision history for this message

Samuele Pedroni (pedronis) wrote on 2011-11-21:

+1, but indeed still unsure if this is the interface we want because it's different from both get_docs and the get_from_index

review: Approve

Revision history for this message

John A Meinel (jameinel) wrote on 2011-12-14:

Just marking this WIP, because we haven't decided if we actually want it or not, and I don't want it sitting in the queue.

Unmerged revisions

98. By John A Meinel on 2011-11-15

Have get_all_doc_ids return the current db_generation just like whats_changed.

So now the only real difference between the two is that get_all_doc_ids doesn't return
the ids for deleted documents.

97. By John A Meinel on 2011-11-15

get_all_doc_ids now only returns documents that aren't marked deleted.

96. By John A Meinel on 2011-11-15

Merge trunk to get all the new goodness.

95. By John A Meinel on 2011-11-01

Document and implement a get_all_doc_ids function.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Christina A Reitbauer

John A Meinel

Lucio Torre

Samuele Pedroni

Ubuntu One hackers

 === modified file 'u1db/__init__.py'
 --- u1db/__init__.py	2011-11-14 13:52:12 +0000
 +++ u1db/__init__.py	2011-11-15 14:16:24 +0000
@@ -39,6 +39,17 @@
          """
          raise NotImplementedError(self.whats_changed)
++    def get_all_doc_ids(self):
++        """Return the identifiers for all documents currently in the db.
++
++        This returns the database generation, so that you can use
++        get_all_doc_ids to start, and then whats_changed to keep up to date. In
++        the case of concurrent updates, we guarantee db_generation will be old
++        enough that you won't miss any new updates.
++        :return: (db_generation, [doc_id])
++        """
++        raise NotImplementedError(self.get_all_doc_ids)
++
      def get_doc(self, doc_id):
          """Get the JSON string for the given document.
 === modified file 'u1db/backends/inmemory.py'
 --- u1db/backends/inmemory.py	2011-11-14 14:53:22 +0000
 +++ u1db/backends/inmemory.py	2011-11-15 14:16:24 +0000
@@ -91,6 +91,12 @@
      def _has_conflicts(self, doc_id):
          return doc_id in self._conflicts
++    def get_all_doc_ids(self):
++        db_gen = len(self._transaction_log)
++        doc_ids = [k for k, rev_doc in self._docs.iteritems()
++                     if rev_doc[1] is not None]
++        return db_gen, doc_ids
++
      def get_doc(self, doc_id):
          doc_rev, doc = self._get_doc(doc_id)
          if doc == 'null':
 === modified file 'u1db/backends/sqlite_backend.py'
 --- u1db/backends/sqlite_backend.py	2011-11-14 13:52:12 +0000
 +++ u1db/backends/sqlite_backend.py	2011-11-15 14:16:24 +0000
@@ -198,6 +198,12 @@
          else:
              return True
++    def get_all_doc_ids(self):
++        db_gen = self._get_generation()
++        c = self._db_handle.cursor()
++        c.execute("SELECT doc_id FROM document WHERE doc IS NOT NULL")
++        return db_gen, [r[0] for r in c.fetchall()]
++
      def get_doc(self, doc_id):
          doc_rev, doc = self._get_doc(doc_id)
          if doc == 'null':
@@ -655,6 +661,16 @@
          doc = simplejson.dumps(raw_doc)
          return doc_rev, doc
++    def get_all_doc_ids(self):
++        db_gen = self._get_generation()
++        c = self._db_handle.cursor()
++        # OnlyExpand stores the actual content in the document_fields content,
++        # and stores NULL in document, except when it is deleted, and we store
++        # '<deleted>' in document and nothing in document_fields.
++        c.execute("SELECT doc_id FROM document WHERE doc IS NULL")
++        return db_gen, [r[0] for r in c.fetchall()]
++
++
      def get_from_index(self, index_name, key_values):
          # The base implementation does all the complex index joining. But it
          # doesn't manage to extract the actual document content correctly.
 === modified file 'u1db/tests/test_backends.py'
 --- u1db/tests/test_backends.py	2011-11-14 13:52:12 +0000
 +++ u1db/tests/test_backends.py	2011-11-15 14:16:24 +0000
@@ -52,6 +52,21 @@
          self.assertRaises(errors.InvalidDocId,
              self.db.put_doc, None, None, simple_doc)
++    def test_get_all_doc_ids(self):
++        self.assertEqual((0, []), self.db.get_all_doc_ids())
++        doc1_id, _ = self.db.create_doc(simple_doc)
++        self.assertEqual((1, [doc1_id]), self.db.get_all_doc_ids())
++        doc2_id, _ = self.db.create_doc(nested_doc)
++        db_gen, doc_ids = self.db.get_all_doc_ids()
++        self.assertEqual((2, sorted([doc1_id, doc2_id])),
++                         (db_gen, sorted(doc_ids)))
++
++    def test_get_all_doc_ids_ignores_deleted(self):
++        doc1_id, doc1_rev = self.db.create_doc(simple_doc)
++        doc2_id, _ = self.db.create_doc(nested_doc)
++        self.db.delete_doc(doc1_id, doc1_rev)
++        self.assertEqual((3, [doc2_id]), self.db.get_all_doc_ids())
++
      def test_get_docs(self):
          doc1_id, doc1_rev = self.db.create_doc(simple_doc)
          doc2_id, doc2_rev = self.db.create_doc(nested_doc)

U1DB

Merge lp:~jameinel/u1db/get-all-docs into lp:u1db

Commit message

Description of the change

Unmerged revisions

Preview Diff

Subscribers