[GIT PULL] libsas fixes for 3.4-rc4

From: Dan Williams
Date: Fri Apr 20 2012 - 18:29:16 EST

Next message: Andrew Morton: "Re: [RFC 2/3] sched: add type checks to for_each_cpu_mask()"
Previous message: Alan Cox: "Re: [PATCH v2] [SCSI] scsi_dh: change scsi_dh_detach export toEXPORT_SYMBOL"
Next in thread: James Bottomley: "Re: [GIT PULL] libsas fixes for 3.4-rc4"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Linus,

The following changes since commit cd8df932d894f3128c884e3ae1b2b484540513db:

[SCSI] qla4xxx: Update driver version to 5.02.00-k15 (2012-02-29 17:03:03 -0600)

are available in the git repository at:

git://git.kernel.org/pub/scm/linux/kernel/git/djbw/isci.git tags/libsas-fixes

for you to fetch changes up to 3385b6baa9f3bbf69d4c1fc58342936e75d095b1:

Revert "[SCSI] libsas: fix sas port naming" (2012-04-19 23:48:12 -0700)

----------------------------------------------------------------
libsas-fixes for 3.4-rc4

Regression fixes to stabilize the new workqueue and ata asynchronous
error handling implementation that was merged for v3.4-rc1.

1/ fix regression in sas_drain_work() which was stomping on 'work'
entries while the workqueue was manipulating them. User sees
random crashes when trying to use scsi_transport_sas attributes for
resets, or during discovery.

2/ (2) longstanding bugs related to the fact that libata (inventor and
primary host_eh_scheduled user) had built-in assumptions of 1:1
Scsi_Host-to-ata_port relationship. The libsas 1:N arrangement
magnified these problems when it gained async eh and began scheduling
eh in more scenarios (sas-transports resets) in 3.4-rc1.

3/ lifetime fixes for the rphy since code that has a domain_device
reference expects to be able to de-reference rphy parameters.

4/ (3) fixes for expander discovery bugs, one a recent regression with
ata-eh clobbering expander-phy data as it polled leading to system
crashes, a long standing bug that caused libsas to be
incompatible with expanders that advertised "PHY_VACANT" in low order
phy indexes, and a quirk for expanders that sometimes fail to zero
the sas address when no device is attached.

5/ fix for a long-standing bug whereby hotunplug events during initial
host scan can cause a system crash

6/ fix for a mvsas regression caused by the new end-device naming in
libsas making the incorrect assumption that at all phy ids
exported by an lldd are unique.

----------------------------------------------------------------

These patches, save for the new "scsi: fix eh wakeup (scsi_schedule_eh
vs scsi_restart_operations)" and "Revert "[SCSI] libsas: fix sas port
naming", were all originally posted before the merge
window opened, and have also appeared in -next for the same timeframe.

The commit dates are not that aged (9 days old) because they were
rebased out of larger set of updates that were pending for 3.4.

There is a mix of pure regression fixes and fixes for long-standing bugs
in libsas. Some of the long-standing bug fixes are made worse / easier
to trigger by the new async error handling scheme.

The largest patch in the series is "libata, libsas: introduce sched_eh
and end_eh port ops" it has been on the list since March 10th.

Jack Wang has independently tested this set with pm8001 and reports
success. [1]

Apologies if scsi-rc-fixes was in the process of picking these up. With
-rc4 looming I lost my nerve and pulled the trigger.

--
Dan

[1]: http://www.spinics.net/lists/linux-scsi/msg58761.html

Dan Williams (11):
libsas: introduce sas_work to fix sas_drain_work vs sas_queue_work
libata, libsas: introduce sched_eh and end_eh port ops
libsas: fix sas_get_port_device regression
libsas: unify domain_device sas_rphy lifetimes
libsas: fix ata_eh clobbering ex_phys via smp_ata_check_ready
libata: make ata_print_id atomic
libsas, libata: fix start of life for a sas ata_port
scsi: fix eh wakeup (scsi_schedule_eh vs scsi_restart_operations)
libsas: fix false positive 'device attached' conditions
scsi_transport_sas: fix delete vs scan race
Revert "[SCSI] libsas: fix sas port naming"

Maciej Trela (1):
libsas: cleanup spurious calls to scsi_schedule_eh

Thomas Jackson (1):
libsas: fix sas_find_bcast_phy() in the presence of 'vacant' phys

drivers/ata/libata-core.c | 8 +++-
drivers/ata/libata-eh.c | 57 +++++++++++++++++++++------
drivers/ata/libata-scsi.c | 35 +++++++++--------
drivers/ata/libata.h | 2 +-
drivers/scsi/ipr.c | 6 ++-
drivers/scsi/libsas/sas_ata.c | 72 +++++++++++++++++++++--------------
drivers/scsi/libsas/sas_discover.c | 67 ++++++++++++++++++--------------
drivers/scsi/libsas/sas_event.c | 36 +++++++++---------
drivers/scsi/libsas/sas_expander.c | 56 +++++++++++++++++++++------
drivers/scsi/libsas/sas_init.c | 25 ++++++------
drivers/scsi/libsas/sas_internal.h | 6 +--
drivers/scsi/libsas/sas_phy.c | 21 ++++------
drivers/scsi/libsas/sas_port.c | 17 +++------
drivers/scsi/libsas/sas_scsi_host.c | 28 ++++++++++----
drivers/scsi/scsi_error.c | 14 +++++++
drivers/scsi/scsi_transport_sas.c | 6 ++-
include/linux/libata.h | 7 +++-
include/scsi/libsas.h | 44 ++++++++++++++++++---
include/scsi/sas_ata.h | 9 ++++-
19 files changed, 344 insertions(+), 172 deletions(-)

commit 3385b6baa9f3bbf69d4c1fc58342936e75d095b1
Author: Dan Williams <dan.j.williams@xxxxxxxxx>
Date: Thu Apr 19 23:48:12 2012 -0700

Revert "[SCSI] libsas: fix sas port naming"

This reverts commit a692b0eec5efae382dfa800e8b4b083f172921a7.

Tom reports:

[ 8.741033] ------------[ cut here ]------------
[ 8.741038] WARNING: at fs/sysfs/dir.c:508 sysfs_add_one+0xc1/0xf0()
[ 8.741040] Hardware name: To Be Filled By O.E.M.
[ 8.741041] sysfs: cannot create duplicate filename

...and missing 2 out of 4 drives connected to mvsas. Commit a692b0ee
made the assumption that all the phy ids an lldd registers to libsas are
unique. However, in the "multi-chip" case mvsas does a rather annoying
duplication of phy ids in the array passed to libsas. So, for example,
chip0 has phy0-3 at ha phy index 0-3 and chip1 has its phy0-3 at ha phy
index 4-7. The more natural model would be to create a scsi_host (and
sas_ha) per chip (controller), but for now revert the naming fix which
unfortunately means dealing with unpredictable end-device names for a
bit longer.

Cc: Xiangliang Yu <yuxiangl@xxxxxxxxxxx>
Cc: Patrick Thomson <patrick.s.thomson@xxxxxxxxx>
Reported-by: Tom Rini <trini@xxxxxx>
Tested-by: Tom Rini <trini@xxxxxx>
Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx>

commit e81dcce46fdbb2c968d7314c2f19da3c2bba24d1
Author: Dan Williams <dan.j.williams@xxxxxxxxx>
Date: Tue Mar 20 10:58:38 2012 -0700

scsi_transport_sas: fix delete vs scan race

The following crash results from cases where the end_device has been
removed before scsi_sysfs_add_sdev has had a chance to run.

BUG: unable to handle kernel NULL pointer dereference at 0000000000000098
IP: [<ffffffff8115e100>] sysfs_create_dir+0x32/0xb6
...
Call Trace:
[<ffffffff8125e4a8>] kobject_add_internal+0x120/0x1e3
[<ffffffff81075149>] ? trace_hardirqs_on+0xd/0xf
[<ffffffff8125e641>] kobject_add_varg+0x41/0x50
[<ffffffff8125e70b>] kobject_add+0x64/0x66
[<ffffffff8131122b>] device_add+0x12d/0x63a
[<ffffffff814b65ea>] ? _raw_spin_unlock_irqrestore+0x47/0x56
[<ffffffff8107de15>] ? module_refcount+0x89/0xa0
[<ffffffff8132f348>] scsi_sysfs_add_sdev+0x4e/0x28a
[<ffffffff8132dcbb>] do_scan_async+0x9c/0x145

...teach sas_rphy_remove to wait for async scanning to quiesce before
removing the end_device. It seems this is a more general problem [1],
but this patch only addresses sas transport.

[1]: 23edb6e [SCSI] mpt2sas: Do not set sas_device->starget to NULL from
the slave_destroy callback when all the LUNS have been deleted

Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx>

commit 55c53f6aed389e9e789df8d8e65d728ac125dba1
Author: Dan Williams <dan.j.williams@xxxxxxxxx>
Date: Tue Mar 20 10:50:27 2012 -0700

libsas: fix false positive 'device attached' conditions

Normalize phy->attached_sas_addr to return a zero-address in the case
when device-type == NO_DEVICE or the linkrate is invalid to handle
expanders that put non-zero sas addresses in the discovery response:

sas: ex 5001b4da000f903f phy02:U:0 attached: 0100000000000000 (no device)
sas: ex 5001b4da000f903f phy01:U:0 attached: 0100000000000000 (no device)
sas: ex 5001b4da000f903f phy03:U:0 attached: 0100000000000000 (no device)
sas: ex 5001b4da000f903f phy00:U:0 attached: 0100000000000000 (no device)

Reported-by: Andrzej Jakowski <andrzej.jakowski@xxxxxxxxx>
Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx>

commit fcc1ce20ffbc553b25b6c635f4bb838940f58d2d
Author: Dan Williams <dan.j.williams@xxxxxxxxx>
Date: Fri Apr 6 16:35:36 2012 -0700

scsi: fix eh wakeup (scsi_schedule_eh vs scsi_restart_operations)

Rapid ata hotplug on a libsas controller results in cases where libsas
is waiting indefinitely on eh to perform an ata probe.

A race exists between scsi_schedule_eh() and scsi_restart_operations()
in the case when scsi_restart_operations() issues i/o to other devices
in the sas domain. When this happens the host state transitions from
SHOST_RECOVERY (set by scsi_schedule_eh) back to SHOST_RUNNING and
->host_busy is non-zero so we put the eh thread to sleep even though
->host_eh_scheduled is active.

Before putting the error handler to sleep we need to check if the
host_state needs to return to SHOST_RECOVERY for another trip through
eh.

Cc: Tejun Heo <tj@xxxxxxxxxx>
Reported-by: Tom Jackson <thomas.p.jackson@xxxxxxxxx>
Tested-by: Tom Jackson <thomas.p.jackson@xxxxxxxxx>
Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx>

commit fcf62bdd26101fe6ae8760c5e9eb4d5e49e0a5ec
Author: Dan Williams <dan.j.williams@xxxxxxxxx>
Date: Wed Mar 21 21:09:07 2012 -0700

libsas, libata: fix start of life for a sas ata_port

This changes the ordering of initialization and probing events from:
1/ allocate rphy in PORTE_BYTES_DMAED, DISCE_REVALIDATE_DOMAIN
2/ allocate ata_port and schedule port probe in DISCE_PROBE
...to:
1/ allocate ata_port in PORTE_BYTES_DMAED, DISCE_REVALIDATE_DOMAIN
2/ allocate rphy in PORTE_BYTES_DMAED, DISCE_REVALIDATE_DOMAIN
3/ schedule port probe in DISCE_PROBE

This ordering prevents PHYE_SIGNAL_LOSS_EVENTS from sneaking in to
destrory ata devices before they have been fully initialized:

BUG: unable to handle kernel paging request at 0000000000003b10
IP: [<ffffffffa0053d7e>] sas_ata_end_eh+0x12/0x5e [libsas]
...
[<ffffffffa004d1af>] sas_unregister_common_dev+0x78/0xc9 [libsas]
[<ffffffffa004d4d4>] sas_unregister_dev+0x4f/0xad [libsas]
[<ffffffffa004d5b1>] sas_unregister_domain_devices+0x7f/0xbf [libsas]
[<ffffffffa004c487>] sas_deform_port+0x61/0x1b8 [libsas]
[<ffffffffa004bed0>] sas_phye_loss_of_signal+0x29/0x2b [libsas]

...and kills the awkward "sata domain_device briefly existing in the
domain without an ata_port" state.

Reported-by: Michal Kosciowski <michal.kosciowski@xxxxxxxxx>
Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx>

commit cb7e940b56fc8a67a6a17bc7935268f7b128f90d
Author: Dan Williams <dan.j.williams@xxxxxxxxx>
Date: Wed Mar 21 21:09:05 2012 -0700

libata: make ata_print_id atomic

This variable is incremented from multiple contexts (module_init via
libata-lldds and the libsas discovery thread). Make it atomic to head
off any chance of libsas and libata creating duplicate ids.

Acked-by: Jacek Danecki <jacek.danecki@xxxxxxxxx>
Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx>

commit 6ec4dacc7c11b5999abe78f9a7e0125062b1d660
Author: Dan Williams <dan.j.williams@xxxxxxxxx>
Date: Tue Mar 20 13:24:29 2012 -0700

libsas: fix ata_eh clobbering ex_phys via smp_ata_check_ready

The check_ready implementation in the expander-attached ata device case
polls on sas_ex_phy_discover(). The effect is that the ex_phy fields
(critically ->attached_sas_addr) can change. When ata_eh ends and
libsas comes along to revalidate the domain
sas_unregister_devs_sas_addr() can fail to lookup devices to remove, or
fail to re-add an ata device that ata_eh marked as disabled. So change
the code to skip the sas_address and change count updates when ata_eh is
active.

Cc: Jack Wang <jack_wang@xxxxxxxxx>
Tested-by: Maciej Patelczyk <maciej.patelczyk@xxxxxxxxx>
Tested-by: Bartek Nowakowski <bartek.nowakowski@xxxxxxxxx>
Tested-by: Jacek Danecki <jacek.danecki@xxxxxxxxx>
Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx>

commit db25a56d901cfc259240d6b6cf999170d7f35fff
Author: Dan Williams <dan.j.williams@xxxxxxxxx>
Date: Tue Mar 20 10:53:24 2012 -0700

libsas: unify domain_device sas_rphy lifetimes

Since the domain_device can out live the scsi_target we need the rphy to
follow suit otherwise we run into issues like:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
IP: [<ffffffffa011561b>] sas_ata_printk+0x43/0x6f [libsas]
PGD 0
Oops: 0000 [#1] SMP
CPU 1
Modules linked in: ses enclosure isci libsas scsi_transport_sas fuse sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf microcode pcspkr igb joydev iTCO_wdt ioatdma iTCO_vendor_support i2c_i801 i2c_core dca wmi hed ipv6 pata_acpi ata_generic [last unloaded: scsi_wait_scan]

Pid: 129, comm: kworker/u:3 Not tainted 3.3.0-rc5-isci+ #1 Intel Corporation SandyBridge Platform/To be filled by O.E.M.
RIP: 0010:[<ffffffffa011561b>] [<ffffffffa011561b>] sas_ata_printk+0x43/0x6f [libsas]
RSP: 0018:ffff88042232dd70 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ffff8804283165b8 RCX: ffff88042232dda0
RDX: ffff88042232dd78 RSI: ffff8804283165b8 RDI: ffffffffa01188d7
RBP: ffff88042232ddd0 R08: ffff880388454000 R09: ffff8803edfde1f8
R10: ffff8803edfde1f8 R11: ffff8803edfde1f8 R12: ffff880428316750
R13: ffff880388454000 R14: ffff8803f88b31d0 R15: ffff8803f8b21d50
FS: 0000000000000000(0000) GS:ffff88042ee20000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000050 CR3: 0000000001a05000 CR4: 00000000000406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kworker/u:3 (pid: 129, threadinfo ffff88042232c000, task ffff88042230c920)
Stack:
0000000000000000 ffff880400000018 ffff88042232dde0 ffff88042232dda0
ffffffffa01188c4 ffff88042ee93af0 ffff88042232ddb0 ffffffff8100e047
ffff88042232de10 ffff880420e5a2c8 ffff8803f8b21d50 ffff8803edfde1f8
Call Trace:
[<ffffffff8100e047>] ? load_TLS+0xb/0xf
[<ffffffffa01156ad>] async_sas_ata_eh+0x66/0x95 [libsas]
[<ffffffff810655e1>] async_run_entry_fn+0x9e/0x131

Reported-by: Tom Jackson <thomas.p.jackson@xxxxxxxxx>
Tested-by: Tom Jackson <thomas.p.jackson@xxxxxxxxx>
Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx>

commit 6be254f019fd8dadc63cc63ded75d2422e2057b7
Author: Dan Williams <dan.j.williams@xxxxxxxxx>
Date: Mon Mar 12 11:38:26 2012 -0700

libsas: fix sas_get_port_device regression

Commit 899fcf4 "[SCSI] libsas: set attached device type and target
protocols for local phys" setup 'phy' to be dereferenced after
list_for_each_entry(phy, &port->phy_list, port_phy_el) (i.e. phy ==
&port->phy_list) resulting in reports like:

BUG: unable to handle kernel NULL pointer dereference at 00000000000002b0
IP: [<ffffffffa00ce948>] sas_discover_domain+0x29e/0x4fb [libsas]

...fix by deferring sas_phy_set_target() to the end of
sas_get_port_device().

Reported-by: Tom Jackson <thomas.p.jackson@xxxxxxxxx>
Tested-by: Tom Jackson <thomas.p.jackson@xxxxxxxxx>
Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx>

commit 71cb71d183256fbe77f35558606989c8f47c4ff0
Author: Thomas Jackson <thomas.p.jackson@xxxxxxxxx>
Date: Fri Feb 17 18:33:10 2012 -0800

libsas: fix sas_find_bcast_phy() in the presence of 'vacant' phys

If an expander reports 'PHY VACANT' for a phy index prior to the one
that generated a BCN libsas fails rediscovery. Since a vacant phy is
defined as a valid phy index that will never have an attached device
just continue the search.

Cc: <stable@xxxxxxxxxxxxxxx>
Signed-off-by: Thomas Jackson <thomas.p.jackson@xxxxxxxxx>
Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx>

commit 705885cb7b906ebddafbaedd693c355f8350ac4e
Author: Dan Williams <dan.j.williams@xxxxxxxxx>
Date: Thu Mar 1 18:44:25 2012 -0800

libata, libsas: introduce sched_eh and end_eh port ops

When managing shost->host_eh_scheduled libata assumes that there is a
1:1 shost-to-ata_port relationship. libsas creates a 1:N relationship
so it needs to manage host_eh_scheduled cumulatively at the host level.
The sched_eh and end_eh port port ops allow libsas to track when domain
devices enter/leave the "eh-pending" state under ha->lock (previously
named ha->state_lock, but it is no longer just a lock for ha->state
changes).

Since host_eh_scheduled indicates eh without backing commands pinning
the device it can be deallocated at any time. Move the taking of the
domain_device reference under the port_lock to guarantee that the
ata_port stays around for the duration of eh.

Cc: Tejun Heo <tj@xxxxxxxxxx>
Acked-by: Jacek Danecki <jacek.danecki@xxxxxxxxx>
Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx>

commit 3c1dbbd2529c659745c047c449037e4f94d326cb
Author: Maciej Trela <maciej.trela@xxxxxxxxx>
Date: Sun Mar 4 17:58:55 2012 -0800

libsas: cleanup spurious calls to scsi_schedule_eh

eh is woken up automatically by the presence of failed commands,
scsi_schedule_eh is reserved for cases where there are no failed
commands. This guarantees that host_eh_sceduled is only incremented
when an explicit eh request is made.

Reviewed-by: Jacek Danecki <jacek.danecki@xxxxxxxxx>
Signed-off-by: Maciej Trela <maciej.trela@xxxxxxxxx>
[fixed spurious delete of sas_ata_task_abort]
Signed-off-by: Artur Wojcik <artur.wojcik@xxxxxxxxx>
Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx>

commit 63494f1cc2022fd9271c0af3399df3bc7dbec55c
Author: Dan Williams <dan.j.williams@xxxxxxxxx>
Date: Fri Mar 9 11:00:06 2012 -0800

libsas: introduce sas_work to fix sas_drain_work vs sas_queue_work

When requeuing work to a draining workqueue the last work instance may
not be idle, so sas_queue_work() must not touch work->entry. Introduce
sas_work with a drain_node list_head to have a private list for
collecting work deferred due to drain collision.

Fixes reports like:
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff810410d4>] process_one_work+0x2e/0x338

Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Andrew Morton: "Re: [RFC 2/3] sched: add type checks to for_each_cpu_mask()"
Previous message: Alan Cox: "Re: [PATCH v2] [SCSI] scsi_dh: change scsi_dh_detach export toEXPORT_SYMBOL"
Next in thread: James Bottomley: "Re: [GIT PULL] libsas fixes for 3.4-rc4"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]