Bazaar

Merge lp:~spiv/bzr/graceful-broken-lockdir-619872-2.0 into lp:bzr/2.0

graceful-broken-lockdir-619872-2.0
Merge into 2.0

Proposed by Andrew Bennetts on 2010-09-06

Status:	Merged
Approved by:	Martin Pool on 2010-09-06
Approved revision:	no longer in the source branch.
Merged at revision:	4757
Proposed branch:	lp:~spiv/bzr/graceful-broken-lockdir-619872-2.0
Merge into:	lp:bzr/2.0
Diff against target:	256 lines (+172/-2) 5 files modified bzrlib/errors.py (+12/-0) bzrlib/lockdir.py (+44/-2) bzrlib/tests/test_errors.py (+7/-0) bzrlib/tests/test_lockdir.py (+101/-0) bzrlib/tests/test_rio.py (+8/-0)
To merge this branch:	bzr merge lp:~spiv/bzr/graceful-broken-lockdir-619872-2.0
Related bugs:	Link a bug report

Reviewer	Review Type	Date Requested	Status
Martin Pool		2010-09-06	Approve on 2010-09-06
Review via email: mp+34657@code.launchpad.net

Commit message

more gracefully handle corrupt lockdirs (missing/null held files)

Description of the change

This patch addresses bug 619872. It adds a LockCorrupt error so that the
lockdir code can clearly distinguish apparently-held-but-damaged locks from
other conditions like not-held. Specifically this deals with the case where the
info file is present, but unparseable, e.g. contains only NUL bytes. This
immediately improves the error reporting to the user when a corrupt
lock/held/info file is encountered (avoids a traceback, and suggests the user
try 'bzr break-lock').

Further, it improves break_lock to cope gracefully with LockCorrupt, so that it
will prompt “Break (corrupt $lock_repr)” [similar to the usual “Break
($lock_info)”], and successfully clears the corrupt lock.

I've added quite a few tests to try make sure we're covering all the interesting
cases. Ideally any time there is a held/ directory in a LockDir without a
parseable info file we'd raise LockCorrupt, but actually there are many cases
already where we silently treat an empty held/ directory as the same as an
unheld lock:
* many places assume NoSuchFile when reading held/info implies there is no lock
* attempt_lock assumes that if rename(new-dir, 'held') raises no error, then
   there was no lock, but on POSIX rename(new-dir, empty-dir) silently passes
   [but not on Windows, IIUC, so it would raise LockContention there... but
   wait_lock would then try to peek after the timeout and get LockCorrupt!]

I don't think we can fix these cases without possibly impacting performance, and
arguably at least some of these cases are harmless, so I've left this behaviour
untouched, although I've explicitly allowed some tests to tolerate LockCorrupt
if a future implementation chooses to notice these issues. Most importantly at
least there are now tests that we don't explode in these cases.

I've written this patch against 2.0.x, although perhaps this is a bit too much
like a new feature to belong on the stable branch. (Certainly if we had
translations this patch would violate Ubuntu's SRU policy by introducing new
strings.) I haven't yet written the NEWS entries for this patch, but if we
decide it is ok for 2.0 I certainly will!

Revision history for this message

Martin Pool (mbp) wrote on 2010-09-06:

That looks good.

I do wonder how we end up with lock files containing nulls.

I think the LoggingUIFactory should be moved into somewhere more generally reusable, and definitely not copy&pasted.

The error could have a link to the bug reporting the kind of corruption we saw.

review: Approve

Revision history for this message

Martin Pool (mbp) wrote on 2010-09-08:

sent to pqm by email

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Alexander Belchenko

Andrew Bennetts

Bazaar Codereview Subscribers

Benoit Pierre

Martin Pool

Matt Nordhoff

bzr PQM

pascalprost

 === modified file 'bzrlib/errors.py'
 --- bzrlib/errors.py	2009-08-27 05:22:14 +0000
 +++ bzrlib/errors.py	2010-09-06 07:46:12 +0000
@@ -1054,6 +1054,18 @@
          self.target = target
++class LockCorrupt(LockError):
++
++    _fmt = ("Lock is apparently held, but corrupted: %(corruption_info)s\n"
++            "Use 'bzr break-lock' to clear it")
++
++    internal_error = False
++
++    def __init__(self, corruption_info, file_data=None):
++        self.corruption_info = corruption_info
++        self.file_data = file_data
++
++
  class LockNotHeld(LockError):
      _fmt = "Lock not held: %(lock)s"
 === modified file 'bzrlib/lockdir.py'
 --- bzrlib/lockdir.py	2010-02-26 06:51:51 +0000
 +++ bzrlib/lockdir.py	2010-09-06 07:46:12 +0000
@@ -118,6 +118,7 @@
          LockBreakMismatch,
          LockBroken,
          LockContention,
++        LockCorrupt,
          LockFailed,
          LockNotHeld,
          NoSuchFile,
@@ -345,7 +346,13 @@
          it possibly being still active.
          """
          self._check_not_locked()
--        holder_info = self.peek()
++        try:
++            holder_info = self.peek()
++        except LockCorrupt, e:
++            # The lock info is corrupt.
++            if bzrlib.ui.ui_factory.get_boolean("Break (corrupt %r)" % (self,)):
++                self.force_break_corrupt(e.file_data)
++            return
          if holder_info is not None:
              lock_info = '\n'.join(self._format_lock_info(holder_info))
              if bzrlib.ui.ui_factory.get_boolean("Break %s" % lock_info):
@@ -392,6 +399,35 @@
          for hook in self.hooks['lock_broken']:
              hook(result)
++    def force_break_corrupt(self, corrupt_info_lines):
++        """Release a lock that has been corrupted.
++
++        This is very similar to force_break, it except it doesn't assume that
++        self.peek() can work.
++
++        :param corrupt_info_lines: the lines of the corrupted info file, used
++            to check that the lock hasn't changed between reading the (corrupt)
++            info file and calling force_break_corrupt.
++        """
++        # XXX: this copes with unparseable info files, but what about missing
++        # info files?  Or missing lock dirs?
++        self._check_not_locked()
++        tmpname = '%s/broken.%s.tmp' % (self.path, rand_chars(20))
++        self.transport.rename(self._held_dir, tmpname)
++        # check that we actually broke the right lock, not someone else;
++        # there's a small race window between checking it and doing the
++        # rename.
++        broken_info_path = tmpname + self.__INFO_NAME
++        f = self.transport.get(broken_info_path)
++        broken_lines = f.readlines()
++        if broken_lines != corrupt_info_lines:
++            raise LockBreakMismatch(self, broken_lines, corrupt_info_lines)
++        self.transport.delete(broken_info_path)
++        self.transport.rmdir(tmpname)
++        result = lock.LockResult(self.transport.abspath(self.path))
++        for hook in self.hooks['lock_broken']:
++            hook(result)
++
      def _check_not_locked(self):
          """If the lock is held by this instance, raise an error."""
          if self._lock_held:
@@ -456,7 +492,13 @@
          return s.to_string()
      def _parse_info(self, info_file):
--        stanza = rio.read_stanza(info_file.readlines())
++        lines = info_file.readlines()
++        try:
++            stanza = rio.read_stanza(lines)
++        except ValueError, e:
++            mutter('Corrupt lock info file: %r', lines)
++            raise LockCorrupt("could not parse lock info file: " + str(e),
++                              lines)
          if stanza is None:
              # see bug 185013; we fairly often end up with the info file being
              # empty after an interruption; we could log a message here but
 === modified file 'bzrlib/tests/test_errors.py'
 --- bzrlib/tests/test_errors.py	2009-08-21 02:37:18 +0000
 +++ bzrlib/tests/test_errors.py	2010-09-06 07:46:12 +0000
@@ -132,6 +132,13 @@
              "cannot be broken.",
              str(error))
++    def test_lock_corrupt(self):
++        error = errors.LockCorrupt("corruption info")
++        self.assertEqualDiff("Lock is apparently held, but corrupted: "
++            "corruption info\n"
++            "Use 'bzr break-lock' to clear it",
++            str(error))
++
      def test_knit_data_stream_incompatible(self):
          error = errors.KnitDataStreamIncompatible(
              'stream format', 'target format')
 === modified file 'bzrlib/tests/test_lockdir.py'
 --- bzrlib/tests/test_lockdir.py	2010-02-26 06:51:51 +0000
 +++ bzrlib/tests/test_lockdir.py	2010-09-06 07:46:12 +0000
@@ -568,6 +568,62 @@
          finally:
              bzrlib.ui.ui_factory = orig_factory
++    def test_break_lock_corrupt_info(self):
++        """break_lock works even if the info file is corrupt (and tells the UI
++        that it is corrupt).
++        """
++        ld = self.get_lock()
++        ld2 = self.get_lock()
++        ld.create()
++        ld.lock_write()
++        ld.transport.put_bytes_non_atomic('test_lock/held/info', '\0')
++        class LoggingUIFactory(bzrlib.ui.SilentUIFactory):
++            def __init__(self):
++                self.prompts = []
++            def get_boolean(self, prompt):
++                self.prompts.append(('boolean', prompt))
++                return True
++        ui = LoggingUIFactory()
++        orig_factory = bzrlib.ui.ui_factory
++        bzrlib.ui.ui_factory = ui
++        try:
++            ld2.break_lock()
++            self.assertLength(1, ui.prompts)
++            self.assertEqual('boolean', ui.prompts[0][0])
++            self.assertStartsWith(ui.prompts[0][1], 'Break (corrupt LockDir')
++            self.assertRaises(LockBroken, ld.unlock)
++        finally:
++            bzrlib.ui.ui_factory = orig_factory
++
++    def test_break_lock_missing_info(self):
++        """break_lock works even if the info file is missing (and tells the UI
++        that it is corrupt).
++        """
++        ld = self.get_lock()
++        ld2 = self.get_lock()
++        ld.create()
++        ld.lock_write()
++        ld.transport.delete('test_lock/held/info')
++        class LoggingUIFactory(bzrlib.ui.SilentUIFactory):
++            def __init__(self):
++                self.prompts = []
++            def get_boolean(self, prompt):
++                self.prompts.append(('boolean', prompt))
++                return True
++        ui = LoggingUIFactory()
++        orig_factory = bzrlib.ui.ui_factory
++        bzrlib.ui.ui_factory = ui
++        try:
++            ld2.break_lock()
++            self.assertRaises(LockBroken, ld.unlock)
++            self.assertLength(0, ui.prompts)
++        finally:
++            bzrlib.ui.ui_factory = orig_factory
++        # Suppress warnings due to ld not being unlocked
++        # XXX: if lock_broken hook was invoked in this case, this hack would
++        # not be necessary.  - Andrew Bennetts, 2010-09-06.
++        del self._lock_actions[:]
++
      def test_create_missing_base_directory(self):
          """If LockDir.path doesn't exist, it can be created
@@ -688,6 +744,51 @@
               'locked (unknown)'],
              formatted_info)
++    def test_corrupt_lockdir_info(self):
++        """We can cope with corrupt (and thus unparseable) info files."""
++        # This seems like a fairly common failure case too - see
++        # <https://bugs.edge.launchpad.net/bzr/+bug/619872> for instance.
++        # In particular some systems tend to fill recently created files with
++        # nul bytes after recovering from a system crash.
++        t = self.get_transport()
++        t.mkdir('test_lock')
++        t.mkdir('test_lock/held')
++        t.put_bytes('test_lock/held/info', '\0')
++        lf = LockDir(t, 'test_lock')
++        self.assertRaises(errors.LockCorrupt, lf.peek)
++        # Currently attempt_lock gives LockContention, but LockCorrupt would be
++        # a reasonable result too.
++        self.assertRaises(
++            (errors.LockCorrupt, errors.LockContention), lf.attempt_lock)
++        self.assertRaises(errors.LockCorrupt, lf.validate_token, 'fake token')
++
++    def test_missing_lockdir_info(self):
++        """We can cope with absent info files."""
++        t = self.get_transport()
++        t.mkdir('test_lock')
++        t.mkdir('test_lock/held')
++        lf = LockDir(t, 'test_lock')
++        # In this case we expect the 'not held' result from peek, because peek
++        # cannot be expected to notice that there is a 'held' directory with no
++        # 'info' file.
++        self.assertEqual(None, lf.peek())
++        # And lock/unlock may work or give LockContention (but not any other
++        # error).
++        try:
++            lf.attempt_lock()
++        except LockContention:
++            # LockContention is ok, and expected on Windows
++            pass
++        else:
++            # no error is ok, and expected on POSIX (because POSIX allows
++            # os.rename over an empty directory).
++            lf.unlock()
++        # Currently raises TokenMismatch, but LockCorrupt would be reasonable
++        # too.
++        self.assertRaises(
++            (errors.TokenMismatch, errors.LockCorrupt),
++            lf.validate_token, 'fake token')
++
  class TestLockDirHooks(TestCaseWithTransport):
 === modified file 'bzrlib/tests/test_rio.py'
 --- bzrlib/tests/test_rio.py	2009-03-23 14:59:43 +0000
 +++ bzrlib/tests/test_rio.py	2010-09-06 07:46:12 +0000
@@ -188,6 +188,14 @@
          self.assertEqual(s, None)
          self.assertTrue(s is None)
++    def test_read_nul_byte(self):
++        """File consisting of a nul byte causes an error."""
++        self.assertRaises(ValueError, read_stanza, ['\0'])
++
++    def test_read_nul_bytes(self):
++        """File consisting of many nul bytes causes an error."""
++        self.assertRaises(ValueError, read_stanza, ['\0' * 100])
++
      def test_read_iter(self):
          """Read several stanzas from file"""
          tmpf = TemporaryFile()