Re: [PATCH v5 0/7] libsas and drivers: NCQ error handling

From: Niklas Cassel
Date: Tue Oct 04 2022 - 09:06:04 EST


On Tue, Sep 27, 2022 at 03:04:51PM +0800, John Garry wrote:
> As reported in [0], the pm8001 driver NCQ error handling more or less
> duplicates what libata does in link error handling, as follows:
> - abort all commands
> - do autopsy with read log ext 10 command
> - reset the target to recover, if necessary
>
> Indeed for the hisi_sas driver we want to add similar handling for NCQ
> errors.
>
> This series add a new libsas API - sas_ata_device_link_abort() - to handle
> host NCQ errors, and fixes up pm8001 and hisi_sas drivers to use it.
>
> A difference in the pm8001 driver NCQ error handling is that we send
> SATA_ABORT per-task prior to read log ext10, but I feel that this should
> not make a difference to the error handling.
>
> Damien kindly tested previous the series for pm8001, but any further pm8001
> testing would be appreciated as I have since tweaked pm8001 handling again.
> This is because the pm8001 driver hangs on my arm64 machine read log ext10
> command.
>
> Finally with these changes we can make the libsas task alloc/free APIs
> private, which they should always have been.
>
> Based on mkp-scsi @ 6.1/scsi-staging 57569c37f0ad ("scsi: iscsi:
> iscsi_tcp: Fix null-ptr-deref while calling getpeername()")
>
> [0] https://lore.kernel.org/linux-scsi/8fb3b093-55f0-1fab-81f4-e8519810a978@xxxxxxxxxx/
>
> Changes since v4:
> - Add Jason's tags (thanks)
> - Rebase
>
> Changes since v3:
> - Add Damien's tags (thanks)
> - Modify hisi_sas processing as follows:
> - use sas_task_abort() for rejected IO
> - Modify abort task processing to issue softreset in certain circumstances
> - rebase
>
> Changes since v2:
> - Stop sending SATA_ABORT all for pm8001 handling
> - Make "reset" optional in sas_ata_device_link_abort()
> - Drop Jack's ACK
>
> John Garry (5):
> scsi: libsas: Add sas_ata_device_link_abort()
> scsi: hisi_sas: Move slot variable definition in hisi_sas_abort_task()
> scsi: pm8001: Modify task abort handling for SATA task
> scsi: pm8001: Use sas_ata_device_link_abort() to handle NCQ errors
> scsi: libsas: Make sas_{alloc, alloc_slow, free}_task() private
>
> Xingui Yang (2):
> scsi: hisi_sas: Add SATA_DISK_ERR bit handling for v3 hw
> scsi: hisi_sas: Modify v3 HW SATA disk error state completion
> processing
>
> drivers/scsi/hisi_sas/hisi_sas.h | 1 +
> drivers/scsi/hisi_sas/hisi_sas_main.c | 26 +++-
> drivers/scsi/hisi_sas/hisi_sas_v3_hw.c | 53 ++++++-
> drivers/scsi/libsas/sas_ata.c | 12 ++
> drivers/scsi/libsas/sas_init.c | 3 -
> drivers/scsi/libsas/sas_internal.h | 4 +
> drivers/scsi/pm8001/pm8001_hwi.c | 186 ++++---------------------
> drivers/scsi/pm8001/pm8001_sas.c | 8 ++
> drivers/scsi/pm8001/pm8001_sas.h | 4 -
> drivers/scsi/pm8001/pm80xx_hwi.c | 177 +++--------------------
> include/scsi/libsas.h | 4 -
> include/scsi/sas_ata.h | 6 +
> 12 files changed, 143 insertions(+), 341 deletions(-)
>
> --
> 2.35.3
>

For pm80xx (pm8001 changes untested):
Tested-by: Niklas Cassel <niklas.cassel@xxxxxxx>



Notes unrelated to this patch:

Both before and after this series, this driver prints:
[ 215.845053] ata21.00: exception Emask 0x0 SAct 0xfc0000 SErr 0x0 action 0x6
[ 215.852308] ata21.00: failed command: WRITE FPDMA QUEUED
[ 215.857801] ata21.00: cmd 61/00:00:00:3a:d3/01:00:b3:04:00/40 tag 18 ncq dma 131072 out
res 43/04:00:ff:3a:d3/00:00:b3:04:00/40 Emask 0x400 (NCQ error) <F>
[ 215.874396] ata21.00: status: { DRDY SENSE ERR }
[ 215.879192] ata21.00: error: { ABRT }
[ 215.882997] ata21.00: failed command: WRITE FPDMA QUEUED
[ 215.888479] ata21.00: cmd 61/00:00:00:3b:d3/01:00:b3:04:00/40 tag 19 ncq dma 131072 out
res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
[ 215.904814] ata21.00: failed command: WRITE FPDMA QUEUED
[ 215.910311] ata21.00: cmd 61/00:00:00:3c:d3/01:00:b3:04:00/40 tag 20 ncq dma 131072 out
res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
[ 215.932679] ata21.00: failed command: WRITE FPDMA QUEUED
[ 215.941203] ata21.00: cmd 61/00:00:00:3d:d3/01:00:b3:04:00/40 tag 21 ncq dma 131072 out
res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
[ 215.963616] ata21.00: failed command: WRITE FPDMA QUEUED
[ 215.972150] ata21.00: cmd 61/00:00:00:3e:d3/01:00:b3:04:00/40 tag 22 ncq dma 131072 out
res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
[ 215.994532] ata21.00: failed command: WRITE FPDMA QUEUED
[ 216.003124] ata21.00: cmd 61/00:00:00:3f:d3/01:00:b3:04:00/40 tag 23 ncq dma 131072 out
res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)

HSM (Host State Machine) violation errors.

For the same SATA drive connected via AHCI this will instead give:

[ 3796.944923] ata14.00: exception Emask 0x0 SAct 0x80800003 SErr 0xc0000 action 0x0
[ 3796.959375] ata14.00: irq_stat 0x40000008
[ 3796.970140] ata14: SError: { CommWake 10B8B }
[ 3796.981231] ata14.00: failed command: WRITE FPDMA QUEUED
[ 3796.993237] ata14.00: cmd 61/00:08:00:7e:73/02:00:8e:08:00/40 tag 1 ncq dma 262144 out
res 43/04:01:00:00:00/00:00:00:00:00/40 Emask 0x1 (device error)
[ 3797.017984] ata14.00: status: { DRDY SENSE ERR }
[ 3797.026833] ata14.00: error: { ABRT }
[ 3797.034664] ata14.00: failed command: WRITE FPDMA QUEUED
[ 3797.043015] ata14.00: cmd 61/00:b8:00:60:73/0a:00:8e:08:00/40 tag 23 ncq dma 1310720 out
res 43/04:00:df:67:73/00:00:8e:08:00/40 Emask 0x400 (NCQ error) <F>
[ 3797.065224] ata14.00: status: { DRDY SENSE ERR }
[ 3797.072914] ata14.00: error: { ABRT }
[ 3797.079598] ata14.00: failed command: WRITE FPDMA QUEUED
[ 3797.087920] ata14.00: cmd 61/00:f8:00:6a:73/0a:00:8e:08:00/40 tag 31 ncq dma 1310720 out
res 43/04:00:00:00:00/00:00:00:00:00/00 Emask 0x1 (device error)
[ 3797.109800] ata14.00: status: { DRDY SENSE ERR }
[ 3797.117451] ata14.00: error: { ABRT }

device error errors.


Except for the I/O that caused the NCQ error, the remaining outstanding I/Os,
regardless if they were aborted by the drive, as a side-effect of reading the
NCQ error log (see 13.7.4 Queued Error Log (10h) in SATA 3.5a spec),
or if they were aborted by the host (by sas_ata_device_link_abort()),
I don't think it is correct to report these as HSM violation errors.

HSM violation errors are e.g. when you try to issue a command to a drive
that has ATA_BUSY bit set.


Kind regards,
Niklas