Merge lp:~jameinel/u1db/get-all-docs into lp:u1db

Proposed by John A Meinel
Status: Work in progress
Proposed branch: lp:~jameinel/u1db/get-all-docs
Merge into: lp:u1db
Diff against target: 98 lines (+48/-0)
4 files modified
u1db/__init__.py (+11/-0)
u1db/backends/inmemory.py (+6/-0)
u1db/backends/sqlite_backend.py (+16/-0)
u1db/tests/test_backends.py (+15/-0)
To merge this branch: bzr merge lp:~jameinel/u1db/get-all-docs
Reviewer Review Type Date Requested Status
Samuele Pedroni Approve
Review via email: mp+80939@code.launchpad.net

Description of the change

I'm not 100% sure if we want to add this api or not, but it was easy to implement.

John Lenton was working on a generic u1db introspection tool, and noted that he wanted a way to describe all the documents.

This adds a direct DB.get_all_doc_ids().

However, I realized that you can do "DB.whats_changed(0)" and get the full list as well. The latter will internally be a bit inefficient after many generations, since it will see documents that have been updated multiple times. It will filter out duplicates, but it has to process M rows if there are M generations and N documents. N will always be < M. I suppose if we wanted, we could do rollup functionality to allow us to answer whats_changed(0) faster.

We can leave this on the side until we decide if we want to expose it or not.

To post a comment you must log in.
Revision history for this message
Samuele Pedroni (pedronis) wrote :

the implementations don't filter out deleted documents, I would expect it to do that, or to have the option to at least. corresponding test:

=== modified file 'u1db/tests/test_backends.py'
--- u1db/tests/test_backends.py 2011-11-01 19:03:05 +0000
+++ u1db/tests/test_backends.py 2011-11-02 13:49:10 +0000
@@ -51,11 +51,14 @@

     def test_get_all_doc_ids(self):
         self.assertEqual([], self.db.get_all_doc_ids())
- doc1_id, _ = self.db.create_doc(simple_doc)
+ doc1_id, doc1_rev = self.db.create_doc(simple_doc)
         self.assertEqual([doc1_id], self.db.get_all_doc_ids())
         doc2_id, _ = self.db.create_doc(nested_doc)
         self.assertEqual(sorted([doc1_id, doc2_id]),
                          sorted(self.db.get_all_doc_ids()))
+ self.db.delete_doc(doc1_id, doc1_rev)
+ self.assertEqual(sorted([doc2_id]),
+ sorted(self.db.get_all_doc_ids()))

     def test_get_docs(self):
         doc1_id, doc1_rev = self.db.create_doc(simple_doc)

review: Needs Fixing
Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11/02/2011 02:52 PM, Samuele Pedroni wrote:
> Review: Needs Fixing
>
> the implementations don't filter out deleted documents, I would
> expect it to do that, or to have the option to at least.
> corresponding test:
>
> === modified file 'u1db/tests/test_backends.py' ---
> u1db/tests/test_backends.py 2011-11-01 19:03:05 +0000 +++
> u1db/tests/test_backends.py 2011-11-02 13:49:10 +0000 @@ -51,11
> +51,14 @@
>
> def test_get_all_doc_ids(self): self.assertEqual([],
> self.db.get_all_doc_ids()) - doc1_id, _ =
> self.db.create_doc(simple_doc) + doc1_id, doc1_rev =
> self.db.create_doc(simple_doc) self.assertEqual([doc1_id],
> self.db.get_all_doc_ids()) doc2_id, _ =
> self.db.create_doc(nested_doc) self.assertEqual(sorted([doc1_id,
> doc2_id]), sorted(self.db.get_all_doc_ids())) +
> self.db.delete_doc(doc1_id, doc1_rev) +
> self.assertEqual(sorted([doc2_id]), +
> sorted(self.db.get_all_doc_ids()))
>
> def test_get_docs(self): doc1_id, doc1_rev =
> self.db.create_doc(simple_doc)
>
>

Note that 'whatschanged' is going to return this doc_id as being
updated, and get_doc is going to return the state that it is deleted.

So I'm not 100% sure on this.
John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk6xUAQACgkQJdeBCYSNAANgiQCgnmRrKeg4+dy0Ng8AnEYjcuCV
KaAAoJy725gHz3VLLmyfvPRIa3yY/p3B
=Cd3Y
-----END PGP SIGNATURE-----

Revision history for this message
Samuele Pedroni (pedronis) wrote :

>
> Note that 'whatschanged' is going to return this doc_id as being
> updated, and get_doc is going to return the state that it is deleted.
>
> So I'm not 100% sure on this.
> John

on the other hand indexes filter out deleted docs. It probably means that if think we need this kind of functionality we need both a way to get everything (though that may indeed be what whatchanged(0) is for) and a way to get only not deleted docs

> =:->
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.11 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iEYEARECAAYFAk6xUAQACgkQJdeBCYSNAANgiQCgnmRrKeg4+dy0Ng8AnEYjcuCV
> KaAAoJy725gHz3VLLmyfvPRIa3yY/p3B
> =Cd3Y
> -----END PGP SIGNATURE-----

Revision history for this message
Stuart Langridge (sil) wrote :

I personally think that get_all_docs should return only live docs, not deleted ones, since you almost always don't want deleted docs -- the only thing that really cares about them is the syncer.
If get_all_docs returns deleted ones too, I'd almost always have to do "for doc in [x for x in db.get_all_docs() if not x.deleted]" rather than just "for doc in db.get_all_docs()", which is not what I'd expect.

lp:~jameinel/u1db/get-all-docs updated
96. By John A Meinel

Merge trunk to get all the new goodness.

97. By John A Meinel

get_all_doc_ids now only returns documents that aren't marked deleted.

98. By John A Meinel

Have get_all_doc_ids return the current db_generation just like whats_changed.

So now the only real difference between the two is that get_all_doc_ids doesn't return
the ids for deleted documents.

Revision history for this message
John A Meinel (jameinel) wrote :

This now removes doc_ids that have been deleted. I also added the ability to get the database generation so that you can track 'liveness'.

So basically, this is whats_changed, but with deleted items removed.

I'm thinking this might be better expressed as whats_changed(0, include_deleted=False).

Though this may be a more obvious api for users.

Revision history for this message
Samuele Pedroni (pedronis) wrote :

+1, but indeed still unsure if this is the interface we want because it's different from both get_docs and the get_from_index

review: Approve
Revision history for this message
John A Meinel (jameinel) wrote :

Just marking this WIP, because we haven't decided if we actually want it or not, and I don't want it sitting in the queue.

Unmerged revisions

98. By John A Meinel

Have get_all_doc_ids return the current db_generation just like whats_changed.

So now the only real difference between the two is that get_all_doc_ids doesn't return
the ids for deleted documents.

97. By John A Meinel

get_all_doc_ids now only returns documents that aren't marked deleted.

96. By John A Meinel

Merge trunk to get all the new goodness.

95. By John A Meinel

Document and implement a get_all_doc_ids function.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'u1db/__init__.py'
2--- u1db/__init__.py 2011-11-14 13:52:12 +0000
3+++ u1db/__init__.py 2011-11-15 14:16:24 +0000
4@@ -39,6 +39,17 @@
5 """
6 raise NotImplementedError(self.whats_changed)
7
8+ def get_all_doc_ids(self):
9+ """Return the identifiers for all documents currently in the db.
10+
11+ This returns the database generation, so that you can use
12+ get_all_doc_ids to start, and then whats_changed to keep up to date. In
13+ the case of concurrent updates, we guarantee db_generation will be old
14+ enough that you won't miss any new updates.
15+ :return: (db_generation, [doc_id])
16+ """
17+ raise NotImplementedError(self.get_all_doc_ids)
18+
19 def get_doc(self, doc_id):
20 """Get the JSON string for the given document.
21
22
23=== modified file 'u1db/backends/inmemory.py'
24--- u1db/backends/inmemory.py 2011-11-14 14:53:22 +0000
25+++ u1db/backends/inmemory.py 2011-11-15 14:16:24 +0000
26@@ -91,6 +91,12 @@
27 def _has_conflicts(self, doc_id):
28 return doc_id in self._conflicts
29
30+ def get_all_doc_ids(self):
31+ db_gen = len(self._transaction_log)
32+ doc_ids = [k for k, rev_doc in self._docs.iteritems()
33+ if rev_doc[1] is not None]
34+ return db_gen, doc_ids
35+
36 def get_doc(self, doc_id):
37 doc_rev, doc = self._get_doc(doc_id)
38 if doc == 'null':
39
40=== modified file 'u1db/backends/sqlite_backend.py'
41--- u1db/backends/sqlite_backend.py 2011-11-14 13:52:12 +0000
42+++ u1db/backends/sqlite_backend.py 2011-11-15 14:16:24 +0000
43@@ -198,6 +198,12 @@
44 else:
45 return True
46
47+ def get_all_doc_ids(self):
48+ db_gen = self._get_generation()
49+ c = self._db_handle.cursor()
50+ c.execute("SELECT doc_id FROM document WHERE doc IS NOT NULL")
51+ return db_gen, [r[0] for r in c.fetchall()]
52+
53 def get_doc(self, doc_id):
54 doc_rev, doc = self._get_doc(doc_id)
55 if doc == 'null':
56@@ -655,6 +661,16 @@
57 doc = simplejson.dumps(raw_doc)
58 return doc_rev, doc
59
60+ def get_all_doc_ids(self):
61+ db_gen = self._get_generation()
62+ c = self._db_handle.cursor()
63+ # OnlyExpand stores the actual content in the document_fields content,
64+ # and stores NULL in document, except when it is deleted, and we store
65+ # '<deleted>' in document and nothing in document_fields.
66+ c.execute("SELECT doc_id FROM document WHERE doc IS NULL")
67+ return db_gen, [r[0] for r in c.fetchall()]
68+
69+
70 def get_from_index(self, index_name, key_values):
71 # The base implementation does all the complex index joining. But it
72 # doesn't manage to extract the actual document content correctly.
73
74=== modified file 'u1db/tests/test_backends.py'
75--- u1db/tests/test_backends.py 2011-11-14 13:52:12 +0000
76+++ u1db/tests/test_backends.py 2011-11-15 14:16:24 +0000
77@@ -52,6 +52,21 @@
78 self.assertRaises(errors.InvalidDocId,
79 self.db.put_doc, None, None, simple_doc)
80
81+ def test_get_all_doc_ids(self):
82+ self.assertEqual((0, []), self.db.get_all_doc_ids())
83+ doc1_id, _ = self.db.create_doc(simple_doc)
84+ self.assertEqual((1, [doc1_id]), self.db.get_all_doc_ids())
85+ doc2_id, _ = self.db.create_doc(nested_doc)
86+ db_gen, doc_ids = self.db.get_all_doc_ids()
87+ self.assertEqual((2, sorted([doc1_id, doc2_id])),
88+ (db_gen, sorted(doc_ids)))
89+
90+ def test_get_all_doc_ids_ignores_deleted(self):
91+ doc1_id, doc1_rev = self.db.create_doc(simple_doc)
92+ doc2_id, _ = self.db.create_doc(nested_doc)
93+ self.db.delete_doc(doc1_id, doc1_rev)
94+ self.assertEqual((3, [doc2_id]), self.db.get_all_doc_ids())
95+
96 def test_get_docs(self):
97 doc1_id, doc1_rev = self.db.create_doc(simple_doc)
98 doc2_id, doc2_rev = self.db.create_doc(nested_doc)

Subscribers

People subscribed via source and target branches