Source side post Live Migration Logic cannot disconnect multipath iSCSI devices cleanly

Bug #1357368 reported by Jeegn Chen
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Undecided
Jeegn Chen
Juno
Fix Released
Undecided
Unassigned
nova (Ubuntu)
Fix Released
Undecided
Unassigned
Trusty
Fix Released
Undecided
Unassigned

Bug Description

[Impact]

When a volume is attached to a VM in the source compute node through multipath, the related files in /dev/disk/by-path/ are like this

stack@ubuntu-server12:~/devstack$ ls /dev/disk/by-path/*24
/dev/disk/by-path/ip-192.168.3.50:3260-iscsi-iqn.1992-04.com.emc:cx.fnm00124500890.a5-lun-24
/dev/disk/by-path/ip-192.168.4.51:3260-iscsi-iqn.1992-04.com.emc:cx.fnm00124500890.b4-lun-24

The information on its corresponding multipath device is like this
stack@ubuntu-server12:~/devstack$ sudo multipath -l 3600601602ba03400921130967724e411
3600601602ba03400921130967724e411 dm-3 DGC,VRAID
size=1.0G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=-1 status=active
| `- 19:0:0:24 sdl 8:176 active undef running
`-+- policy='round-robin 0' prio=-1 status=enabled
  `- 18:0:0:24 sdj 8:144 active undef running

But when the VM is migrated to the destination, the related information is like the following example since we CANNOT guarantee that all nodes are able to access the same iSCSI portals and the same target LUN number. And the information is used to overwrite connection_info in the DB before the post live migration logic is executed.

stack@ubuntu-server13:~/devstack$ ls /dev/disk/by-path/*24
/dev/disk/by-path/ip-192.168.3.51:3260-iscsi-iqn.1992-04.com.emc:cx.fnm00124500890.b5-lun-100
/dev/disk/by-path/ip-192.168.4.51:3260-iscsi-iqn.1992-04.com.emc:cx.fnm00124500890.b4-lun-100

stack@ubuntu-server13:~/devstack$ sudo multipath -l 3600601602ba03400921130967724e411
3600601602ba03400921130967724e411 dm-3 DGC,VRAID
size=1.0G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=-1 status=active
| `- 19:0:0:100 sdf 8:176 active undef running
`-+- policy='round-robin 0' prio=-1 status=enabled
  `- 18:0:0:100 sdg 8:144 active undef running

As a result, if post live migration in source side uses <IP>, <IQN> and <TARGET LUN Number> to find the devices to clean up, it may use 192.168.3.51, iqn.1992-04.com.emc:cx.fnm00124500890.a5 and 100.
However, the correct one should be 192.168.3.50, iqn.1992-04.com.emc:cx.fnm00124500890.a5 and 24.

Similar philosophy in (https://bugs.launchpad.net/nova/+bug/1327497) can be used to fix it: Leverage the unchanged multipath_id to find correct devices to delete.

[Test Case]

Live migrate an instance which uses iSCSI multipath. Verify the correct target is removed on source hypervisor.

[Regression Potential]

Not much, its included in the next release (Juno). The change introduces a check to use a field already used by fiber multipath connections which was not used by iscsi multipath code path on cleanup. If it fails it would keep remaining behavior of not cleaning up iscsi sessions/paths.

Jeegn Chen (jeegn-chen)
Changed in nova:
assignee: nobody → Jeegn Chen (jeegn-chen)
Jeegn Chen (jeegn-chen)
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/114539

Changed in nova:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/114539
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=aa9104ccedb3ff13cc34a498b11f5e8ff100fd99
Submitter: Jenkins
Branch: master

commit aa9104ccedb3ff13cc34a498b11f5e8ff100fd99
Author: Jeegn Chen <email address hidden>
Date: Fri Aug 15 21:40:14 2014 +0800

    Clean up iSCSI multipath devices in Post Live Migration

    When a volume is attached to a VM in the source compute node through
    multipath, the related files in /dev/disk/by-path/ are like this

    stack@ubuntu-server12:~/devstack$ ls /dev/disk/by-path/*24
    /dev/disk/by-path/ip-192.168.3.50:3260-iscsi-iqn.1992-04.com.emc:cx.
    fnm00124500890.a5-lun-24
    /dev/disk/by-path/ip-192.168.4.51:3260-iscsi-iqn.1992-04.com.emc:cx.
    fnm00124500890.b4-lun-24

    The information on its corresponding multipath device is like this
    stack@ubuntu-server12:~/devstack$ sudo multipath -l 3600601602ba034
    00921130967724e411
    3600601602ba03400921130967724e411 dm-3 DGC,VRAID
    size=1.0G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
    |-+- policy='round-robin 0' prio=-1 status=active
    | `- 19:0:0:24 sdl 8:176 active undef running
    `-+- policy='round-robin 0' prio=-1 status=enabled
      `- 18:0:0:24 sdj 8:144 active undef running

    But when the VM is migrated to the destination, the related information is
    like the following example since we CANNOT guarantee that all nodes are able
    to access the same iSCSI portals and the same target LUN number. And the
    information is used to overwrite connection_info in the DB before the post
    live migration logic is executed.

    stack@ubuntu-server13:~/devstack$ ls /dev/disk/by-path/*24
    /dev/disk/by-path/ip-192.168.3.51:3260-iscsi-iqn.1992-04.com.emc:cx.
    fnm00124500890.b5-lun-100
    /dev/disk/by-path/ip-192.168.4.51:3260-iscsi-iqn.1992-04.com.emc:cx.
    fnm00124500890.b4-lun-100

    stack@ubuntu-server13:~/devstack$ sudo multipath -l 3600601602ba034
    00921130967724e411
    3600601602ba03400921130967724e411 dm-3 DGC,VRAID
    size=1.0G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
    |-+- policy='round-robin 0' prio=-1 status=active
    | `- 19:0:0:100 sdf 8:176 active undef running
    `-+- policy='round-robin 0' prio=-1 status=enabled
      `- 18:0:0:100 sdg 8:144 active undef running

    As a result, if post live migration in source side uses <IP>, <IQN> and
    <TARGET LUN Number> to find the devices to clean up, it may use 192.168.3.51,
    iqn.1992-04.com.emc:cx.fnm00124500890.a5 and 100.
    However, the correct one should be 192.168.3.50, iqn.1992-04.com.emc:cx.
    fnm00124500890.a5 and 24.

    Similar philosophy in (https://bugs.launchpad.net/nova/+bug/1327497) can be
    used to fix it: Leverage the unchanged multipath_id to find correct devices
    to delete.

    Change-Id: I875293c3ade9423caa2b8afe9eca25a74606d262
    Closes-Bug: #1357368

Changed in nova:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/juno)

Fix proposed to branch: stable/juno
Review: https://review.openstack.org/134116

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/juno)

Reviewed: https://review.openstack.org/134116
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9c3ec16576e2f7c9d5aff6e4b620d708e6636568
Submitter: Jenkins
Branch: stable/juno

commit 9c3ec16576e2f7c9d5aff6e4b620d708e6636568
Author: Jeegn Chen <email address hidden>
Date: Fri Aug 15 21:40:14 2014 +0800

    Clean up iSCSI multipath devices in Post Live Migration

    When a volume is attached to a VM in the source compute node through
    multipath, the related files in /dev/disk/by-path/ are like this

    stack@ubuntu-server12:~/devstack$ ls /dev/disk/by-path/*24
    /dev/disk/by-path/ip-192.168.3.50:3260-iscsi-iqn.1992-04.com.emc:cx.
    fnm00124500890.a5-lun-24
    /dev/disk/by-path/ip-192.168.4.51:3260-iscsi-iqn.1992-04.com.emc:cx.
    fnm00124500890.b4-lun-24

    The information on its corresponding multipath device is like this
    stack@ubuntu-server12:~/devstack$ sudo multipath -l 3600601602ba034
    00921130967724e411
    3600601602ba03400921130967724e411 dm-3 DGC,VRAID
    size=1.0G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
    |-+- policy='round-robin 0' prio=-1 status=active
    | `- 19:0:0:24 sdl 8:176 active undef running
    `-+- policy='round-robin 0' prio=-1 status=enabled
      `- 18:0:0:24 sdj 8:144 active undef running

    But when the VM is migrated to the destination, the related information is
    like the following example since we CANNOT guarantee that all nodes are able
    to access the same iSCSI portals and the same target LUN number. And the
    information is used to overwrite connection_info in the DB before the post
    live migration logic is executed.

    stack@ubuntu-server13:~/devstack$ ls /dev/disk/by-path/*24
    /dev/disk/by-path/ip-192.168.3.51:3260-iscsi-iqn.1992-04.com.emc:cx.
    fnm00124500890.b5-lun-100
    /dev/disk/by-path/ip-192.168.4.51:3260-iscsi-iqn.1992-04.com.emc:cx.
    fnm00124500890.b4-lun-100

    stack@ubuntu-server13:~/devstack$ sudo multipath -l 3600601602ba034
    00921130967724e411
    3600601602ba03400921130967724e411 dm-3 DGC,VRAID
    size=1.0G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
    |-+- policy='round-robin 0' prio=-1 status=active
    | `- 19:0:0:100 sdf 8:176 active undef running
    `-+- policy='round-robin 0' prio=-1 status=enabled
      `- 18:0:0:100 sdg 8:144 active undef running

    As a result, if post live migration in source side uses <IP>, <IQN> and
    <TARGET LUN Number> to find the devices to clean up, it may use 192.168.3.51,
    iqn.1992-04.com.emc:cx.fnm00124500890.a5 and 100.
    However, the correct one should be 192.168.3.50, iqn.1992-04.com.emc:cx.
    fnm00124500890.a5 and 24.

    Similar philosophy in (https://bugs.launchpad.net/nova/+bug/1327497) can be
    used to fix it: Leverage the unchanged multipath_id to find correct devices
    to delete.

    Change-Id: I875293c3ade9423caa2b8afe9eca25a74606d262
    Closes-Bug: #1357368
    (cherry picked from commit aa9104ccedb3ff13cc34a498b11f5e8ff100fd99)

tags: added: in-stable-juno
Thierry Carrez (ttx)
Changed in nova:
milestone: none → kilo-1
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: kilo-1 → 2015.1.0
Revision history for this message
Martin Pitt (pitti) wrote :

There is an SRU waiting in the trusty-proposed queue for this. Please clarify in which Ubuntu release(s) this is already fixed, or upload the fix to yakkety, so that the trusty SRU can proceed.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hi Martin. This was fixed upstream in nova for OpenStack Juno which mapped to Utopic. So it is already fixed in Utopic and all releases after that.

Revision history for this message
Martin Pitt (pitti) wrote : Please test proposed package

Hello Jeegn, or anyone else affected,

Accepted nova into trusty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/nova/1:2014.1.5-0ubuntu1.5 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in nova (Ubuntu):
status: New → Fix Released
Changed in nova (Ubuntu Trusty):
status: New → Fix Committed
tags: added: verification-needed
description: updated
tags: removed: in-stable-juno
tags: added: in-stable-juno verification-done
removed: verification-needed
Revision history for this message
Martin Pitt (pitti) wrote : Update Released

The verification of the Stable Release Update for nova has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package nova - 1:2014.1.5-0ubuntu1.5

---------------
nova (1:2014.1.5-0ubuntu1.5) trusty; urgency=medium

  * Fix live migration usage of the wrong connector (LP: #1475411)
    - d/p/Fix-live-migrations-usage-of-the-wrong-connector-inf.patch
  * Fix wrong used ProcessExecutionError exception (LP: #1308839)
    - d/p/Fix-wrong-used-ProcessExecutionError-exception.patch
  * Clean up iSCSI multipath devices in Post Live Migration (LP: #1357368)
    - d/p/Clean-up-iSCSI-multipath-devices-in-Post-Live-Migrat.patch
  * Detach iSCSI latest path for latest disk (LP: #1374999)
    - d/p/Detach-iSCSI-latest-path-for-latest-disk.patch

 -- Billy Olsen <email address hidden> Fri, 29 Apr 2016 15:35:01 -0700

Changed in nova (Ubuntu Trusty):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.