Merge lp:~ian-clatworthy/bzr/faster-dirstate-saving into lp:~bzr/bzr/trunk-old

Proposed by Ian Clatworthy
Status: Rejected
Rejected by: John A Meinel
Proposed branch: lp:~ian-clatworthy/bzr/faster-dirstate-saving
Merge into: lp:~bzr/bzr/trunk-old
Diff against target: 351 lines (has conflicts)
Text conflict in NEWS
To merge this branch: bzr merge lp:~ian-clatworthy/bzr/faster-dirstate-saving
Reviewer Review Type Date Requested Status
Martin Pool Needs Fixing
Review via email: mp+7537@code.launchpad.net

This proposal supersedes a proposal from 2009-05-28.

To post a comment you must log in.
Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote : Posted in a previous version of this proposal

Many commonly used commands like status, add and commit update the dirstate, triggering a dirstate serialisation & save. On huge trees like OpenOffice, this is slower than it needs to be. In particular, 'xxx status file' takes 0.4 seconds in hg and 1.0 seconds in bzr and a good percentage of the difference is due to the time we take to serialise the new dirstate.

This branch is an experiment/RFC in fixing that. It drops the time for 'bzr status file' by 30-35% down to 0.65-0.70 seconds. It does that by remembering the serialised form of entries and only re-serialising entries that are known to be changed. Right now, this smart remembering of what's changed is only effectively implemented for status, though the internal API is in place for extending that to other use cases.

Of course, there are other ways of skinning this cat. One option is to write a pyrex serialiser. That ought to be fast but it still doesn't solve the root problem: serialisation time is O(size-of-tree) currently because we only keep a modified vs unmodified flag at the whole-of-dirstate level. Another option is to append 'overlays' to the dirstate file, i.e. entries which have been added or changed vs the base entries. Deletes or renames would trigger a full clean write but the common cases of add and/or change would just append entries. That's non-trivial but potentially very fast.

More broadly, I think the important thing to be begin recording the changes as this patch allows. So my current thoughts are that we ought to start with this patch, make the changes to enable smarter recording for add and commit, and built from there. At any point, we can separately do a pyrex serialiser and it will complement this work.

Having said all that, dirstate is my least favourite part of the Bazaar code base: indexing into tuples using magic integers may be fast but it sucks from an understandability perspective (vs objects + attributes). There are people far more qualified than I to say how this ought to proceed and to write the code, but they're rather busy tackling other things. Regardless, it's been a good exercise for me in getting dirstate paged into my head for other work I'm doing. It's a step forward but I can't definitively say it's in the right direction.

Thoughts?

Revision history for this message
Andrew Bennetts (spiv) wrote : Posted in a previous version of this proposal

Brief thought: perhaps simply don't bother updating the dirstate file if the number of changes is very small? (Or do we already do this?)

Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote : Posted in a previous version of this proposal

> Brief thought: perhaps simply don't bother updating the dirstate file if the
> number of changes is very small? (Or do we already do this?)

Poolie has agreed with this so I'll go ahead and resubmit accordingly. The updated benchmark time for 'bzr status file' on OOo is 0.5 seconds, down from 1.0 second.

We just need to agree on a # of changes below which it's not worth saving. I suggest we should save for 3 or more changes - so we skip iff 2 or less files are changed. If someone has a better number, let me know.

Revision history for this message
John A Meinel (jameinel) wrote : Posted in a previous version of this proposal

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ian Clatworthy wrote:
>> Brief thought: perhaps simply don't bother updating the dirstate file if the
>> number of changes is very small? (Or do we already do this?)
>
> Poolie has agreed with this so I'll go ahead and resubmit accordingly. The updated benchmark time for 'bzr status file' on OOo is 0.5 seconds, down from 1.0 second.
>
> We just need to agree on a # of changes below which it's not worth saving. I suggest we should save for 3 or more changes - so we skip iff 2 or less files are changed. If someone has a better number, let me know.

A guideline is: "Don't save if it would cost as much data as you avoid
reading".

The cost of not updating the dirstate is having to re-read the file
who's sha is now different. So if the dirstate is 1MB, and the file is
10k, then you should not update if there are 10 files that are changed.

If you *want* to look at text_size and compare it to len(lines) we could
do that. Otherwise I'm fine with a heuristic of < 1% of the tree, or
something like that.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkoepA4ACgkQJdeBCYSNAANrVwCgwjbMhA60vNrtlpbciGdFw9pp
Y6MAn1FfgUIcKOSm5BjgXP5EjKJtbaVS
=i1ML
-----END PGP SIGNATURE-----

Revision history for this message
Martin Pool (mbp) wrote : Posted in a previous version of this proposal

2009/5/28 John A Meinel <email address hidden>:
> A guideline is: "Don't save if it would cost as much data as you avoid
> reading".
>
> The cost of not updating the dirstate is having to re-read the file
> who's sha is now different. So if the dirstate is 1MB, and the file is
> 10k, then you should not update if there are 10 files that are changed.
>
> If you *want* to look at text_size and compare it to len(lines) we could
> do that. Otherwise I'm fine with a heuristic of < 1% of the tree, or
> something like that.

It's possibly a bit more work, but as I commented on the bug, I'd
actually like to try turning off writing it altogether for logical
readonly operations.

--
Martin <http://launchpad.net/~mbp/>

Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote : Posted in a previous version of this proposal

This updated merge proposal makes it configurable as to how many changes are ok before the dirstate should be saved. As Martin requested, there's the ability to turn off saving altogether. After this lands, we can use that to implement various policies (e.g. don't save on a logical read-only operation) as we see fit.

This drops 'bzr st [file]' down by 0.5+ seconds on OOo. Nice. It doesn't yet make a difference for add or commit. Those are small follow-on patches after we're happy with this one.

Revision history for this message
Andrew Bennetts (spiv) wrote : Posted in a previous version of this proposal

I'm no dirstate expert, so depending on how confident you are with this patch you might like to wait for another review before merging.

Overall, I this change seems ok to me. It deserves careful benchmarking of course, but I know you've been doing that :)

It's interesting to me that _build_line_cache actually re-reads the lines from the file. I had expected that maybe we'd keep the lines around from the original read, but perhaps this is just as good. If we mmap'ed the dirstate then certainly re-reading from the mmap rather than keeping a copy would clearly make sense.

+ :param worth_saving_limit: when the exact number of changed entries
+ is known, only bother saving the dirstate if more than this
+ count of entries have changed. -1 means never save.

I'd personally prefer "None means never save", but that's not a big deal.

I believe the dirstate can record changes to tree state, not just cache stat info... so is there a risk with this patch that e.g. "bzr mv foo bar" will go unrecorded because it's under the _worth_saving_limit? I understand the value in not updating the cache for only a small number of changes, but AFAIK we always want to update the dirstate file when actual state is changed. I think the header_too=True flag is used to make sure this works, but that seems like a pretty indirect way of expressing the concept that this is a non-optional update.

I wonder: if you unconditionally set _worth_saving_limit to something very high (e.g. 1000000) does the test suite still pass? My guess is that except perhaps for one or two hyper-sensitive tests it ought to, and it might make for an interesting sanity check.

+ # We only enable smart saving is it hasn't already been disabled

"is" -> "if"

+ if self._use_smart_saving is not False:
+ self._use_smart_saving = True

So if _use_smart_saving is True, set it to True? Is that right? Perhaps you mean "is not None"?

(If you really are just testing for True/False, it's generally a bad idea to test for True or False with "is". Use "if [not] self._use_smart_saving:" instead, which is also easier to read.)

review: Approve
Revision history for this message
John A Meinel (jameinel) wrote : Posted in a previous version of this proposal

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
> Review: Approve
> I'm no dirstate expert, so depending on how confident you are with this patch you might like to wait for another review before merging.
>
> Overall, I this change seems ok to me. It deserves careful benchmarking of course, but I know you've been doing that :)
>
> It's interesting to me that _build_line_cache actually re-reads the lines from the file.

I'm a bit surprised that re-reading the file is significantly faster
than converting a tuple of strings back into a serialized string. If it
is significant, than I'd certainly rather a pyrex extension than reading
from disk...

>
> + :param worth_saving_limit: when the exact number of changed entries
> + is known, only bother saving the dirstate if more than this
> + count of entries have changed. -1 means never save.
>
> I'd personally prefer "None means never save", but that's not a big deal.
>
> I believe the dirstate can record changes to tree state, not just cache stat info...
> so is there a risk with this patch that e.g. "bzr mv foo bar" will go
unrecorded because it's under the _worth_saving_limit? I understand the
value in not updating the cache for only a small number of changes, but
AFAIK we always want to update the dirstate file when actual state is
changed. I think the header_too=True flag is used to make sure this
works, but that seems like a pretty indirect way of expressing the
concept that this is a non-optional update.

The *big* think to watch out for is that there is "IN_MEMORY_MODIFIED"
that happens from a stat cache update, which can be trivially ignored.
And there is "IN_MEMORY_MODIFIED" that happens from "bzr add" which
*must* be written out.

I didn't look at the code to see if Ian clearly distinguished them, if
he has, then we are fine.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkoogDcACgkQJdeBCYSNAAPFWQCfeAnQLM2UPq6K+BVsleEy+szw
UIsAoKeT7L7Jybvz8vHUZI32kw9XSmNP
=SWQG
-----END PGP SIGNATURE-----

Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote : Posted in a previous version of this proposal

Andrew Bennetts wrote:
> Review: Approve
>
> It's interesting to me that _build_line_cache actually re-reads the lines from the file. I had expected that maybe we'd keep the lines around from the original read, but perhaps this is just as good. If we mmap'ed the dirstate then certainly re-reading from the mmap rather than keeping a copy would clearly make sense.
>
>
Andrew,

Thanks for the review!

I did carefully benchmark a few approaches here and, as discussed on
IRC, not doing anything unless we absolutely have to gives the best
result for status. For other operations (that don't leverage the ability
to track some changes yet), the best time to read is probably just
before we mutate the dirstate. FWIW, profiling showed file reading time
to be a small factor - it was rebuilding the keys in the right order
that mattered. Before mutation, we simply grab them straight out of the
dirstate. After then, we need to rebuild them from the file content.

> I believe the dirstate can record changes to tree state, not just cache stat info... so is there a risk with this patch that e.g. "bzr mv foo bar" will go unrecorded because it's under the _worth_saving_limit? I understand the value in not updating the cache for only a small number of changes, but AFAIK we always want to update the dirstate file when actual state is changed. I think the header_too=True flag is used to make sure this works, but that seems like a pretty indirect way of expressing the concept that this is a non-optional update.
>
>
With this patch, there's no risk here. When follow-on patches arrive for
operations other than status, I'll need to handle those sorts of cases.

> I wonder: if you unconditionally set _worth_saving_limit to something very high (e.g. 1000000) does the test suite still pass? My guess is that except perhaps for one or two hyper-sensitive tests it ought to, and it might make for an interesting sanity check.
>
>
I tried this during development. Some low level tests fail that "make
change then read disk content". Otherwise, it looked ok to me.

> + # We only enable smart saving is it hasn't already been disabled
>
> "is" -> "if"
>
> + if self._use_smart_saving is not False:
> + self._use_smart_saving = True
>
> So if _use_smart_saving is True, set it to True? Is that right? Perhaps you mean "is not None"?
>
>
As discussed on IRC, I'll make this "if xxx is None".

Once again, thanks for the review.

Ian C.

Revision history for this message
John A Meinel (jameinel) wrote : Posted in a previous version of this proposal

I'd just like to make it clear that I'm "hesitant" on this patch. I'll admit I haven't looked at it closely, but it seems a bit along the lines of a band-aid for now sort of thing.

Especially given the complexity of keeping multiple inventories in sync...

221 + if self._dirblock_state == DirState.IN_MEMORY_UNMODIFIED:
222 + keys = []
223 + for directory in self._dirblocks:
224 + keys.extend([e[0] for e in directory[1]])
225 + else:

What is the use case for calling _build_line_cache when the current state is IN_MEMORY_UNMODIFIED? Certainly save() is already defined as a no-op in that case. I'd rather avoid having extra code paths that aren't really being used.
I guess maybe it is if the header changes and the blocks don't...

I'll just mention that if you look over the code, the only time _header_state = ...MODIFIED is when _dirblock_state = ... MODIFIED

The only reason it has a separate state flag, is because you can read the header without reading the rest of the dirblocks. (At least to date, certainly the knob exists for that to not be true.)
Though if you consider what is in the header:

1) num parents / parent ids (if this changes, the dirblock state certainly should)
2) ghost parents (see 1)
3) crc (Also indicative of actual content changes)
4) num_entries (if this changes, we better have dirblock changes)

review: Needs Information
Revision history for this message
Robert Collins (lifeless) wrote : Posted in a previous version of this proposal

re: safety.

Can I propose that we have an explicit API for just updating the hash
cache (we may already). And that the current state of MODIFIED *always*
refer to having semantic changes rather than hash cache updates.

Then, the hash cache updating code should not alter the dirstate state.
Rather it should just accumulate into a 'hash cache changes counter'.
And save can check that.

This may be very similar to the code. The key thing is that I hope this
will structure things so as to avoid needing to be careful in commands.

-Rob

Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote :

Thanks to Andrew, John and Robert for their feedback on my earlier merge proposal. This updated one takes their comments into account, particularly the bit about optional vs mandatory (hash change vs semantically important) updating. It also drops the "line cache" as there are better ways (e.g. pyrex) or solving serialisation speed (which we can look into later).

As before, this improves 'bzr status' performance on OpenOffice.org by 0.5 seconds after changing a small number of files. As mentioned in NEWS, full status drops from 1.6 to 1.1 seconds while status on just a FILE drops from 1.0 to 0.5 seconds.

Revision history for this message
Martin Pool (mbp) wrote :

It looks like you're using _worth_saving_limit as both a method and a variable on similar classes, which is at least potentially confusing if not actually buggy....

Revision history for this message
Martin Pool (mbp) wrote :

> It looks like you're using _worth_saving_limit as both a method and a variable
> on similar classes, which is at least potentially confusing if not actually
> buggy....

lifeless: poolie: the last one I read looked like it could cause confusion about the internal state; jam's comments were along those lines and I agreed.
lifeless: I don't know if its been improved since then; and while its a functional improvement it is a risky area of code.

Combined with my comments I think this may need to be resubmitted.

review: Needs Fixing
Revision history for this message
Martin Pool (mbp) wrote :

I'm going to say Reject to get it out of the queue; the idea is good but this needs updates.

Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote :

I was happy to take this out of the queue while we focussed on 2.0. Right now though, I'm pretty confused here as to why this is rejected. Can someone do a proper review please and tell me exactly what's wrong?

To my understanding, I addressed the first round of comments. If I need to rename a variable or method, that's approve-after-tweak, not reject surely.

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ian Clatworthy wrote:
> I was happy to take this out of the queue while we focussed on 2.0. Right now though, I'm pretty confused here as to why this is rejected. Can someone do a proper review please and tell me exactly what's wrong?
>
> To my understanding, I addressed the first round of comments. If I need to rename a variable or method, that's approve-after-tweak, not reject surely.
>

To clarify

I was cleaning up the queue a bit, where we had lots of stuff marked one
way or another in votes, but not in the actual merge status. The last
statement on this was "I'll Reject for now" from Martin, but he had not
actually set the status, so I did.

My understanding was that this was meant to be, please address some of
the issues and resubmit.

Unfortunately, there is no "Resubmit" object for a merge proposal.
Potentially we could have used "Work in Progress", which I think takes
it out of the queue.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkrMmCsACgkQJdeBCYSNAANeAQCdFxROuxDkVbQ233ixrF+13CMK
mugAoLMztcg7TTwuLvEx3HS5VfgidKks
=wbea
-----END PGP SIGNATURE-----

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'NEWS'
2--- NEWS 2009-08-30 23:51:10 +0000
3+++ NEWS 2009-08-31 04:37:34 +0000
4@@ -771,6 +771,7 @@
5 Improvements
6 ************
7
8+<<<<<<< TREE
9 * ``bzr annotate`` can now be significantly faster. The time for
10 ``bzr annotate NEWS`` is down to 7s from 22s in 1.16. Files with long
11 histories and lots of 'duplicate insertions' will be improved more than
12@@ -786,6 +787,12 @@
13 * Initial commit performance in ``--2a`` repositories has been improved by
14 making it cheaper to build the initial CHKMap. (John Arbash Meinel)
15
16+=======
17+* ``bzr status [FILE]`` is now faster. On OpenOffice.org after changing a
18+ file, the time improves from 1.6 to 1.1 seconds for full status and
19+ 1.0 to 0.5 seconds for status on the file. (Ian Clatworthy)
20+
21+>>>>>>> MERGE-SOURCE
22 * Resolving a revno to a revision id on a branch accessed via ``bzr://``
23 or ``bzr+ssh://`` is now much faster and involves no VFS operations.
24 This speeds up commands like ``bzr pull -r 123``. (Andrew Bennetts)
25
26=== modified file 'bzrlib/_dirstate_helpers_pyx.pyx'
27--- bzrlib/_dirstate_helpers_pyx.pyx 2009-08-28 05:00:33 +0000
28+++ bzrlib/_dirstate_helpers_pyx.pyx 2009-08-31 04:37:34 +0000
29@@ -911,7 +911,7 @@
30 else:
31 entry[1][0] = ('l', '', stat_value.st_size,
32 False, DirState.NULLSTAT)
33- self._dirblock_state = DirState.IN_MEMORY_MODIFIED
34+ self._mark_modified([entry])
35 return link_or_sha1
36
37
38
39=== modified file 'bzrlib/dirstate.py'
40--- bzrlib/dirstate.py 2009-08-28 05:00:33 +0000
41+++ bzrlib/dirstate.py 2009-08-31 04:37:34 +0000
42@@ -363,11 +363,14 @@
43 HEADER_FORMAT_2 = '#bazaar dirstate flat format 2\n'
44 HEADER_FORMAT_3 = '#bazaar dirstate flat format 3\n'
45
46- def __init__(self, path, sha1_provider):
47+ def __init__(self, path, sha1_provider, worth_saving_limit=0):
48 """Create a DirState object.
49
50 :param path: The path at which the dirstate file on disk should live.
51 :param sha1_provider: an object meeting the SHA1Provider interface.
52+ :param worth_saving_limit: when the exact number of hash changed
53+ entries is known, only bother saving the dirstate if more than
54+ this count of entries have changed. -1 means never save.
55 """
56 # _header_state and _dirblock_state represent the current state
57 # of the dirstate metadata and the per-row data respectiely.
58@@ -410,11 +413,47 @@
59 # during commit.
60 self._last_block_index = None
61 self._last_entry_index = None
62+ # The set of known hash changes
63+ self._known_hash_changes = set()
64+ # How many hash changed entries can we have without saving
65+ self._worth_saving_limit = worth_saving_limit
66+ # If True, consider the worth saving limit when deciding whether to
67+ # save the dirstate or not. If False, ignore it. If None, it can be
68+ # set True but isn't True yet.
69+ self._use_smart_saving = None
70
71 def __repr__(self):
72 return "%s(%r)" % \
73 (self.__class__.__name__, self._filename)
74
75+ def _mark_modified(self, hash_changed_entries=None, header_too=False):
76+ """Mark this dirstate as modified.
77+
78+ :param hash_changed_entries: if non-None, mark just these entries as
79+ having their hash modified.
80+ :param header_too: mark the header modified as well, not just the
81+ dirblocks.
82+ """
83+ #trace.mutter_callsite(3, "modified hash entries: %s", hash_changed_entries)
84+ if hash_changed_entries:
85+ self._known_hash_changes.update([e[0] for e in hash_changed_entries])
86+ # We only enable smart saving is it hasn't already been disabled
87+ if self._use_smart_saving is None:
88+ self._use_smart_saving = True
89+ else:
90+ # We don't know exactly what changed so disable smart saving
91+ self._use_smart_saving = False
92+ self._dirblock_state = DirState.IN_MEMORY_MODIFIED
93+ if header_too:
94+ self._header_state = DirState.IN_MEMORY_MODIFIED
95+
96+ def _mark_unmodified(self):
97+ """Mark this dirstate as unmodified."""
98+ self._header_state = DirState.IN_MEMORY_UNMODIFIED
99+ self._dirblock_state = DirState.IN_MEMORY_UNMODIFIED
100+ self._use_smart_saving = None
101+ self._known_hash_changes = set()
102+
103 def add(self, path, file_id, kind, stat, fingerprint):
104 """Add a path to be tracked.
105
106@@ -546,7 +585,7 @@
107 if kind == 'directory':
108 # insert a new dirblock
109 self._ensure_block(block_index, entry_index, utf8path)
110- self._dirblock_state = DirState.IN_MEMORY_MODIFIED
111+ self._mark_modified()
112 if self._id_index:
113 self._id_index.setdefault(entry_key[2], set()).add(entry_key)
114
115@@ -1018,8 +1057,7 @@
116
117 self._ghosts = []
118 self._parents = [parents[0]]
119- self._dirblock_state = DirState.IN_MEMORY_MODIFIED
120- self._header_state = DirState.IN_MEMORY_MODIFIED
121+ self._mark_modified(header_too=True)
122
123 def _empty_parent_info(self):
124 return [DirState.NULL_PARENT_DETAILS] * (len(self._parents) -
125@@ -1555,8 +1593,7 @@
126 # the active tree.
127 raise errors.InconsistentDeltaDelta(delta, "error from _get_entry.")
128
129- self._dirblock_state = DirState.IN_MEMORY_MODIFIED
130- self._header_state = DirState.IN_MEMORY_MODIFIED
131+ self._mark_modified(header_too=True)
132 self._id_index = None
133 return
134
135@@ -1735,7 +1772,7 @@
136 and stat_value.st_ctime < self._cutoff_time):
137 entry[1][0] = ('f', sha1, entry[1][0][2], entry[1][0][3],
138 packed_stat)
139- self._dirblock_state = DirState.IN_MEMORY_MODIFIED
140+ self._mark_modified()
141
142 def _sha_cutoff_time(self):
143 """Return cutoff time.
144@@ -1799,14 +1836,13 @@
145 """Serialise the entire dirstate to a sequence of lines."""
146 if (self._header_state == DirState.IN_MEMORY_UNMODIFIED and
147 self._dirblock_state == DirState.IN_MEMORY_UNMODIFIED):
148- # read whats on disk.
149+ # read what's on disk.
150 self._state_file.seek(0)
151 return self._state_file.readlines()
152 lines = []
153 lines.append(self._get_parents_line(self.get_parent_ids()))
154 lines.append(self._get_ghosts_line(self._ghosts))
155- # append the root line which is special cased
156- lines.extend(map(self._entry_to_line, self._iter_entries()))
157+ lines.extend(self._get_entry_lines())
158 return self._get_output_lines(lines)
159
160 def _get_ghosts_line(self, ghost_ids):
161@@ -1817,6 +1853,10 @@
162 """Create a line for the state file for parents information."""
163 return '\0'.join([str(len(parent_ids))] + parent_ids)
164
165+ def _get_entry_lines(self):
166+ """Create lines for entries."""
167+ return map(self._entry_to_line, self._iter_entries())
168+
169 def _get_fields_to_entry(self):
170 """Get a function which converts entry fields into a entry record.
171
172@@ -2175,17 +2215,21 @@
173 return len(self._parents) - len(self._ghosts)
174
175 @staticmethod
176- def on_file(path, sha1_provider=None):
177+ def on_file(path, sha1_provider=None, worth_saving_limit=0):
178 """Construct a DirState on the file at path "path".
179
180 :param path: The path at which the dirstate file on disk should live.
181 :param sha1_provider: an object meeting the SHA1Provider interface.
182 If None, a DefaultSHA1Provider is used.
183+ :param worth_saving_limit: when the exact number of hash changed
184+ entries is known, only bother saving the dirstate if more than
185+ this count of entries have changed. -1 means never save.
186 :return: An unlocked DirState object, associated with the given path.
187 """
188 if sha1_provider is None:
189 sha1_provider = DefaultSHA1Provider()
190- result = DirState(path, sha1_provider)
191+ result = DirState(path, sha1_provider,
192+ worth_saving_limit=worth_saving_limit)
193 return result
194
195 def _read_dirblocks_if_needed(self):
196@@ -2283,9 +2327,7 @@
197 trace.mutter('Not saving DirState because '
198 '_changes_aborted is set.')
199 return
200- if (self._header_state == DirState.IN_MEMORY_MODIFIED or
201- self._dirblock_state == DirState.IN_MEMORY_MODIFIED):
202-
203+ if self._worth_saving():
204 grabbed_write_lock = False
205 if self._lock_state != 'w':
206 grabbed_write_lock, new_lock = self._lock_token.temporary_write_lock()
207@@ -2299,12 +2341,12 @@
208 # We couldn't grab a write lock, so we switch back to a read one
209 return
210 try:
211+ lines = self.get_lines()
212 self._state_file.seek(0)
213- self._state_file.writelines(self.get_lines())
214+ self._state_file.writelines(lines)
215 self._state_file.truncate()
216 self._state_file.flush()
217- self._header_state = DirState.IN_MEMORY_UNMODIFIED
218- self._dirblock_state = DirState.IN_MEMORY_UNMODIFIED
219+ self._mark_unmodified()
220 finally:
221 if grabbed_write_lock:
222 self._lock_token = self._lock_token.restore_read_lock()
223@@ -2313,6 +2355,25 @@
224 # not changed contents. Since restore_read_lock may
225 # not be an atomic operation.
226
227+ def _worth_saving(self):
228+ """Is it worth saving the dirstate or not?"""
229+ if self._header_state == DirState.IN_MEMORY_MODIFIED:
230+ return True
231+ if (self._dirblock_state == DirState.IN_MEMORY_MODIFIED and
232+ self._worth_saving_limit != -1):
233+ # If we're using smart saving and only a small number of
234+ # entries have changed their hash, don't bother saving. John has
235+ # suggested using a heuristic here based on the size of the
236+ # changed files and/or tree. For now, we go with a configurable
237+ # number of changes, keeping the calculation time
238+ # as low overhead as possible. (This also keeps all existing
239+ # tests passing as the default is 0, i.e. always save.)
240+ if self._use_smart_saving:
241+ return len(self._known_hash_changes) > self._worth_saving_limit
242+ else:
243+ return True
244+ return False
245+
246 def _set_data(self, parent_ids, dirblocks):
247 """Set the full dirstate data in memory.
248
249@@ -2326,8 +2387,7 @@
250 """
251 # our memory copy is now authoritative.
252 self._dirblocks = dirblocks
253- self._header_state = DirState.IN_MEMORY_MODIFIED
254- self._dirblock_state = DirState.IN_MEMORY_MODIFIED
255+ self._mark_modified(header_too=True)
256 self._parents = list(parent_ids)
257 self._id_index = None
258 self._packed_stat_index = None
259@@ -2353,7 +2413,7 @@
260 self._make_absent(entry)
261 self.update_minimal(('', '', new_id), 'd',
262 path_utf8='', packed_stat=entry[1][0][4])
263- self._dirblock_state = DirState.IN_MEMORY_MODIFIED
264+ self._mark_modified()
265 if self._id_index is not None:
266 self._id_index.setdefault(new_id, set()).add(entry[0])
267
268@@ -2493,8 +2553,7 @@
269 self._entries_to_current_state(new_entries)
270 self._parents = [rev_id for rev_id, tree in trees]
271 self._ghosts = list(ghosts)
272- self._header_state = DirState.IN_MEMORY_MODIFIED
273- self._dirblock_state = DirState.IN_MEMORY_MODIFIED
274+ self._mark_modified(header_too=True)
275 self._id_index = id_index
276
277 def _sort_entries(self, entry_list):
278@@ -2637,7 +2696,7 @@
279 current_old[0][1].decode('utf8'))
280 self._make_absent(current_old)
281 current_old = advance(old_iterator)
282- self._dirblock_state = DirState.IN_MEMORY_MODIFIED
283+ self._mark_modified()
284 self._id_index = None
285 self._packed_stat_index = None
286 if tracing:
287@@ -2692,7 +2751,7 @@
288 if update_tree_details[0][0] == 'a': # absent
289 raise AssertionError('bad row %r' % (update_tree_details,))
290 update_tree_details[0] = DirState.NULL_PARENT_DETAILS
291- self._dirblock_state = DirState.IN_MEMORY_MODIFIED
292+ self._mark_modified()
293 return last_reference
294
295 def update_minimal(self, key, minikind, executable=False, fingerprint='',
296@@ -2857,7 +2916,7 @@
297 if not present:
298 self._dirblocks.insert(block_index, (subdir_key[0], []))
299
300- self._dirblock_state = DirState.IN_MEMORY_MODIFIED
301+ self._mark_modified()
302
303 def _maybe_remove_row(self, block, index, id_index):
304 """Remove index if it is absent or relocated across the row.
305@@ -3158,7 +3217,7 @@
306 else:
307 entry[1][0] = ('l', '', stat_value.st_size,
308 False, DirState.NULLSTAT)
309- state._dirblock_state = DirState.IN_MEMORY_MODIFIED
310+ state._mark_modified([entry])
311 return link_or_sha1
312
313
314
315=== modified file 'bzrlib/workingtree_4.py'
316--- bzrlib/workingtree_4.py 2009-08-25 04:43:21 +0000
317+++ bzrlib/workingtree_4.py 2009-08-31 04:37:35 +0000
318@@ -219,7 +219,7 @@
319 local_path = self.bzrdir.get_workingtree_transport(None
320 ).local_abspath('dirstate')
321 self._dirstate = dirstate.DirState.on_file(local_path,
322- self._sha1_provider())
323+ self._sha1_provider(), self._worth_saving_limit())
324 return self._dirstate
325
326 def _sha1_provider(self):
327@@ -234,6 +234,15 @@
328 else:
329 return None
330
331+ def _worth_saving_limit(self):
332+ """How many hash changes are ok before we must save the dirstate.
333+
334+ :return: an integer. -1 means never save.
335+ """
336+ # XXX: In the future, we could return -1 for logical read-only
337+ # operations like status. For now, just use a small number.
338+ return 10
339+
340 def filter_unversioned_files(self, paths):
341 """Filter out paths that are versioned.
342
343@@ -839,7 +848,7 @@
344 rollback_rename()
345 raise
346 result.append((from_rel, to_rel))
347- state._dirblock_state = dirstate.DirState.IN_MEMORY_MODIFIED
348+ state._mark_modified()
349 self._make_dirty(reset_inventory=False)
350
351 return result