2.6.19: EFAIL on MPATH failback

From: Charles C. Bennett, Jr.
Date: Tue Feb 13 2007 - 09:50:34 EST



Hi All -

I have two systems running 2.6.19 from fedora-updates.

System 1: vanilla x86 SMP server box with 2x Emulex LP-101
System 2: vanilla x86 SMP server box with 2x QLogic 2310F

Both boxes are multipath to Hitachi USP TagmaStore. All LUNs including
LUN0 (/, /boot, swap, /var via dm-linear/kpartx) are multipathed.

Userspace multipath-tools is 0.4.7, latest from fedora-updates.

With no IO load, multipath failover and failback work flawlessly after
fibre pull/reinsert.

Under IO load, failback fails on both nodes.

With QLogic:
Feb 12 12:48:12 arc-node-112 kernel: qla2xxx 0000:06:01.0: LOOP DOWN
detected (4
).
Feb 12 12:48:47 arc-node-112 kernel: rport-1:0-0: blocked FC remote
port time o
ut: removing target and saving binding
Feb 12 12:48:47 arc-node-112 kernel: Synchronizing SCSI cache for disk
sde:
Feb 12 12:48:47 arc-node-112 kernel: FAILED
Feb 12 12:48:47 arc-node-112 kernel: status = 0, message = 00, host =
1, drive
r = 00
Feb 12 12:48:47 arc-node-112 kernel: <5>Synchronizing SCSI cache for
disk sdf:

Feb 12 12:48:47 arc-node-112 kernel: FAILED
Feb 12 12:48:47 arc-node-112 kernel: status = 0, message = 00, host =
1, drive
r = 00
Feb 12 12:48:47 arc-node-112 kernel: <5>Synchronizing SCSI cache for
disk sdg:

Feb 12 12:48:47 arc-node-112 kernel: FAILED
Feb 12 12:48:47 arc-node-112 kernel: status = 0, message = 00, host =
1, drive
r = 00
Feb 12 12:48:47 arc-node-112 kernel: <5>Synchronizing SCSI cache for
disk sdh:

Feb 12 12:48:47 arc-node-112 kernel: FAILED
Feb 12 12:48:47 arc-node-112 kernel: status = 0, message = 00, host =
1, drive
r = 00
Feb 12 12:49:15 arc-node-112 kernel: <6>qla2xxx 0000:06:01.0: LIP
reset occure
d (f7f7).
Feb 12 12:49:16 arc-node-112 kernel: qla2xxx 0000:06:01.0: LOOP UP
detected (2 G
bps).
Feb 12 12:49:18 arc-node-112 kernel: scsi 1:0:0:0: Direct-Access
HITACHI OP
EN-V 5004 PQ: 0 ANSI: 3
Feb 12 12:49:18 arc-node-112 kernel: SCSI device sdi: 211077120 512-byte
hdwr se
ctors (108071 MB)
Feb 12 12:49:18 arc-node-112 kernel: sdi: Write Protect is off
Feb 12 12:49:18 arc-node-112 kernel: SCSI device sdi: drive cache: write
back
Feb 12 12:49:18 arc-node-112 kernel: SCSI device sdi: 211077120 512-byte
hdwr se
ctors (108071 MB)
Feb 12 12:49:18 arc-node-112 kernel: sdi: Write Protect is off
Feb 12 12:49:18 arc-node-112 kernel: SCSI device sdi: drive cache: write
back
Feb 12 12:49:18 arc-node-112 kernel: sdi: sdi1 sdi2 sdi3 sdi4 < sdi5
sdi6 sdi7
sdi8 >
Feb 12 12:49:18 arc-node-112 kernel: sd 1:0:0:0: Attached scsi disk sdi
Feb 12 12:49:18 arc-node-112 kernel: sd 1:0:0:0: Attached scsi generic
sg4 type
0
Feb 12 12:49:18 arc-node-112 kernel: scsi 1:0:0:0: Direct-Access
HITACHI OP
EN-V 5004 PQ: 0 ANSI: 3
Feb 12 12:49:18 arc-node-112 kernel: kobject_add failed for 1:0:0:0 with
-EEXIST
, don't try to register things with the same name in the same directory.
Feb 12 12:49:18 arc-node-112 kernel: [<c0405018>] dump_trace+0x69/0x1b6
Feb 12 12:49:18 arc-node-112 kernel: [<c040517d>] show_trace_log_lvl
+0x18/0x2c
Feb 12 12:49:18 arc-node-112 kernel: [<c0405778>] show_trace+0xf/0x11
Feb 12 12:49:18 arc-node-112 kernel: [<c0405875>] dump_stack+0x15/0x17
Feb 12 12:49:18 arc-node-112 kernel: [<c04eef7a>] kobject_add
+0x16d/0x196
Feb 12 12:49:19 arc-node-112 kernel: [<c055cdd3>] device_add+0x9f/0x46f
Feb 12 12:49:19 arc-node-112 kernel: [<f8863168>] scsi_sysfs_add_sdev
+0x2a/0x1d
f [scsi_mod]
Feb 12 12:49:19 arc-node-112 kernel: [<f8861847>]
scsi_probe_and_add_lun+0x82e/
0x93e [scsi_mod]
Feb 12 12:49:19 arc-node-112 kernel: [<f8862259>] __scsi_scan_target
+0x447/0x60
a [scsi_mod]
Feb 12 12:49:19 arc-node-112 kernel: [<f88626c0>] scsi_scan_target
+0x69/0x7b [s
csi_mod]
Feb 12 12:49:19 arc-node-112 kernel: [<f88b5a13>] fc_scsi_scan_rport
+0x53/0x71
[scsi_transport_fc]
Feb 12 12:49:19 arc-node-112 kernel: [<c04368c7>] run_workqueue
+0x97/0xdd
Feb 12 12:49:19 arc-node-112 kernel: [<c0437284>] worker_thread
+0xd9/0x10d
Feb 12 12:49:19 arc-node-112 kernel: [<c0439810>] kthread+0xc0/0xec
Feb 12 12:49:19 arc-node-112 kernel: [<c0404c03>] kernel_thread_helper
+0x7/0x10
Feb 12 12:49:19 arc-node-112 kernel: =======================
Feb 12 12:49:19 arc-node-112 kernel: error 1
Feb 12 12:49:19 arc-node-112 kernel: scsi 1:0:0:0: Unexpected response
from lun
0 while scanning, scan aborted

Similar on Emulex:
Feb 12 12:52:03 arc-node-109 kernel: lpfc 0000:06:01.0: 1:1305 Link Down
Event x2 received Data: x2 x20 x110
Feb 12 12:52:03 arc-node-109 kernel: lpfc 0000:06:01.0: 1:1305 Link Down
Event x4 received Data: x4 x4 x100
Feb 12 12:52:29 arc-node-109 mgetty[3772]: failed dev=ttyS0, pid=3772,
login time out
Feb 12 12:52:33 arc-node-109 kernel: rport-1:0-2: blocked FC remote
port time out: removing target and saving binding
Feb 12 12:52:33 arc-node-109 kernel: lpfc 0000:06:01.0: 1:0203 Devloss
timeout on WWPN 50:6:e:80:4:2b:e5:44 NPort x102ae Data: x8 x7 x4
Feb 12 12:52:33 arc-node-109 kernel: Synchronizing SCSI cache for disk
sde:
Feb 12 12:52:33 arc-node-109 kernel: FAILED
Feb 12 12:52:33 arc-node-109 kernel: status = 0, message = 00, host =
1, driver = 00
Feb 12 12:52:33 arc-node-109 kernel: <5>Synchronizing SCSI cache for
disk sdf:
Feb 12 12:52:33 arc-node-109 kernel: FAILED
Feb 12 12:52:33 arc-node-109 kernel: status = 0, message = 00, host =
1, driver = 00
Feb 12 12:52:33 arc-node-109 kernel: <5>Synchronizing SCSI cache for
disk sdg:
Feb 12 12:52:33 arc-node-109 kernel: FAILED
Feb 12 12:52:33 arc-node-109 kernel: status = 0, message = 00, host =
1, driver = 00
Feb 12 12:52:33 arc-node-109 kernel: <5>Synchronizing SCSI cache for
disk sdh:
Feb 12 12:52:33 arc-node-109 kernel: FAILED
Feb 12 12:52:33 arc-node-109 kernel: status = 0, message = 00, host =
1, driver = 00
Feb 12 12:53:00 arc-node-109 kernel: <3>lpfc 0000:06:01.0: 1:1303 Link
Up Event x5 received Data: x5 x0 x8 x2
Feb 12 12:53:02 arc-node-109 kernel: scsi 1:0:0:0: Direct-Access
HITACHI OPEN-V 5004 PQ: 0 ANSI: 3
Feb 12 12:53:03 arc-node-109 kernel: SCSI device sdi: 211077120 512-byte
hdwr sectors (108071 MB)
Feb 12 12:53:03 arc-node-109 kernel: sdi: Write Protect is off
Feb 12 12:53:03 arc-node-109 kernel: SCSI device sdi: drive cache: write
back
Feb 12 12:53:03 arc-node-109 kernel: SCSI device sdi: 211077120 512-byte
hdwr sectors (108071 MB)
Feb 12 12:53:03 arc-node-109 kernel: sdi: Write Protect is off
Feb 12 12:53:03 arc-node-109 kernel: SCSI device sdi: drive cache: write
back
Feb 12 12:53:03 arc-node-109 kernel: sdi: sdi1 sdi2 sdi3 sdi4 < sdi5
sdi6 sdi7 sdi8 >
Feb 12 12:53:03 arc-node-109 kernel: sd 1:0:0:0: Attached scsi disk sdi
Feb 12 12:53:03 arc-node-109 kernel: sd 1:0:0:0: Attached scsi generic
sg4 type 0
Feb 12 12:53:03 arc-node-109 kernel: scsi 1:0:0:0: Direct-Access
HITACHI OPEN-V 5004 PQ: 0 ANSI: 3
Feb 12 12:53:03 arc-node-109 kernel: kobject_add failed for 1:0:0:0 with
-EEXIST, don't try to register things with the same name in the same
directory.
Feb 12 12:53:03 arc-node-109 kernel: [<c0405018>] dump_trace+0x69/0x1b6
Feb 12 12:53:03 arc-node-109 kernel: [<c040517d>] show_trace_log_lvl
+0x18/0x2c
Feb 12 12:53:03 arc-node-109 kernel: [<c0405778>] show_trace+0xf/0x11
Feb 12 12:53:03 arc-node-109 kernel: [<c0405875>] dump_stack+0x15/0x17
Feb 12 12:53:03 arc-node-109 kernel: [<c04eef7a>] kobject_add
+0x16d/0x196
Feb 12 12:53:03 arc-node-109 kernel: [<c055cdd3>] device_add+0x9f/0x46f
Feb 12 12:53:03 arc-node-109 kernel: [<f8863168>] scsi_sysfs_add_sdev
+0x2a/0x1df [scsi_mod]
Feb 12 12:53:03 arc-node-109 kernel: [<f8861847>]
scsi_probe_and_add_lun+0x82e/0x93e [scsi_mod]
Feb 12 12:53:03 arc-node-109 kernel: [<f8862259>] __scsi_scan_target
+0x447/0x60a [scsi_mod]
Feb 12 12:53:04 arc-node-109 kernel: [<f88626c0>] scsi_scan_target
+0x69/0x7b [scsi_mod]
Feb 12 12:53:04 arc-node-109 kernel: [<f88b5a13>] fc_scsi_scan_rport
+0x53/0x71 [scsi_transport_fc]
Feb 12 12:53:04 arc-node-109 kernel: [<c04368c7>] run_workqueue
+0x97/0xdd
Feb 12 12:53:04 arc-node-109 kernel: [<c0437284>] worker_thread
+0xd9/0x10d
Feb 12 12:53:04 arc-node-109 kernel: [<c0439810>] kthread+0xc0/0xec
Feb 12 12:53:04 arc-node-109 kernel: [<c0404c03>] kernel_thread_helper
+0x7/0x10
Feb 12 12:53:04 arc-node-109 kernel: =======================
Feb 12 12:53:04 arc-node-109 kernel: error 1
Feb 12 12:53:04 arc-node-109 kernel: scsi 1:0:0:0: Unexpected response
from lun 0 while scanning, scan aborted


Have I configured something poorly? Are there known fixes out there?


Thanks in Advance,
ccb

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/