OpenStack Compute (nova)

Bug #1353939
Comment #31

Comment 31 for bug 1353939

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-14: Related fix merged to nova (stable/queens)

#31

Reviewed: https://review.opendev.org/668111
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=16b0fb01afac7b81094aa891588f4bd9017ee235
Submitter: Zuul
Branch: stable/queens

commit 16b0fb01afac7b81094aa891588f4bd9017ee235
Author: Kashyap Chamarthy <email address hidden>
Date: Mon Feb 25 13:26:24 2019 +0100

libvirt: Rework 'EBUSY' (SIGKILL) error handling code path

    Change ID I128bf6b939 (libvirt: handle code=38 + sigkill (ebusy) in
    _destroy()) handled the case where a QEMU process "refuses to die" within
    a given timeout period set by libvirt.

    Originally, libvirt sent SIGTERM (allowing the process to clean-up
    resources), then waited 10 seconds, if the guest didn't go away. Then
    it sent, the more lethal, SIGKILL and waited another 5 seconds for it to
    take effect.

    From libvirt v4.7.0 onwards, libvirt increased[1][2] the time it waits
    for a guest hard shutdown to complete. It now waits for 30 seconds for
    SIGKILL to work (instead of 5). Also, additional wait time is added if
    there are assigned PCI devices, as some of those tend to slow things
    down.

In this change:

      - Increment the counter to retry the _destroy() call from 3 to 6, thus
        increasing the total time from 15 to 30 seconds, before SIGKILL
        takes effect. And it matches the (more graceful) behaviour of
        libvirt v4.7.0. This also gives breathing room for Nova instances
        running in environments with large compute nodes with high instance
        creation or delete churn, where the current timout may not be
        sufficient.

- Retry the _destroy() API call _only_ if MIN_LIBVIRT_VERSION is lower
than 4.7.0.

    [1] https://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=9a4e4b9
        (process: wait longer 5->30s on hard shutdown)
    [2] https://libvirt.org/git/?p=libvirt.git;a=commit;h=be2ca04 ("process:
        wait longer on kill per assigned Hostdev")

Conflicts:
nova/virt/libvirt/driver.py

(Trivial conflict: Rocky didn't have the QEMU-native TLS feature
yet.)

Conflicts (stable/queens):
nova/tests/unit/virt/libvirt/test_driver.py

Related-bug: #1353939

    Change-Id: If2035cac931c42c440d61ba97ebc7e9e92141a28
    Signed-off-by: Kashyap Chamarthy <email address hidden>
    (cherry picked from commit 10d50ca4e210039aeae84cb9bd5d18895948af54)
    (cherry picked from commit 75985e25bc147369efb90d4fa9f046631766c14c)

Reviewed:  https://review.opendev.org/668111
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=16b0fb01afac7b81094aa891588f4bd9017ee235
Submitter: Zuul
Branch:    stable/queens

commit 16b0fb01afac7b81094aa891588f4bd9017ee235
Author: Kashyap Chamarthy <kchamart@redhat.com>
Date:   Mon Feb 25 13:26:24 2019 +0100

libvirt: Rework 'EBUSY' (SIGKILL) error handling code path
    
    Change ID I128bf6b939 (libvirt: handle code=38 + sigkill (ebusy) in
    _destroy()) handled the case where a QEMU process "refuses to die" within
    a given timeout period set by libvirt.
    
    Originally, libvirt sent SIGTERM (allowing the process to clean-up
    resources), then waited 10 seconds, if the guest didn't go away.  Then
    it sent, the more lethal, SIGKILL and waited another 5 seconds for it to
    take effect.
    
    From libvirt v4.7.0 onwards, libvirt increased[1][2] the time it waits
    for a guest hard shutdown to complete.  It now waits for 30 seconds for
    SIGKILL to work (instead of 5).  Also, additional wait time is added if
    there are assigned PCI devices, as some of those tend to slow things
    down.
    
    In this change:
    
      - Increment the counter to retry the _destroy() call from 3 to 6, thus
        increasing the total time from 15 to 30 seconds, before SIGKILL
        takes effect.  And it matches the (more graceful) behaviour of
        libvirt v4.7.0.  This also gives breathing room for Nova instances
        running in environments with large compute nodes with high instance
        creation or delete churn, where the current timout may not be
        sufficient.
    
      - Retry the _destroy() API call _only_ if MIN_LIBVIRT_VERSION is lower
        than 4.7.0.
    
    [1] https://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=9a4e4b9
        (process: wait longer 5->30s on hard shutdown)
    [2] https://libvirt.org/git/?p=libvirt.git;a=commit;h=be2ca04 ("process:
        wait longer on kill per assigned Hostdev")
    
    Conflicts:
        nova/virt/libvirt/driver.py
    
        (Trivial conflict: Rocky didn't have the QEMU-native TLS feature
        yet.)
    
    Conflicts (stable/queens):
        nova/tests/unit/virt/libvirt/test_driver.py
    
    Related-bug: #1353939
    
    Change-Id: If2035cac931c42c440d61ba97ebc7e9e92141a28
    Signed-off-by: Kashyap Chamarthy <kchamart@redhat.com>
    (cherry picked from commit 10d50ca4e210039aeae84cb9bd5d18895948af54)
    (cherry picked from commit 75985e25bc147369efb90d4fa9f046631766c14c)