Re: [PATCH v5 0/7] libsas and drivers: NCQ error handling

From: John Garry
Date: Tue Oct 04 2022 - 10:05:04 EST


On 04/10/2022 14:05, Niklas Cassel wrote:
2.35.3

For pm80xx (pm8001 changes untested):
Tested-by: Niklas Cassel<niklas.cassel@xxxxxxx>



Thanks!


Notes unrelated to this patch:

Both before and after this series, this driver prints:
[ 215.845053] ata21.00: exception Emask 0x0 SAct 0xfc0000 SErr 0x0 action 0x6
[ 215.852308] ata21.00: failed command: WRITE FPDMA QUEUED
[ 215.857801] ata21.00: cmd 61/00:00:00:3a:d3/01:00:b3:04:00/40 tag 18 ncq dma 131072 out
res 43/04:00:ff:3a:d3/00:00:b3:04:00/40 Emask 0x400 (NCQ error) <F>
[ 215.874396] ata21.00: status: { DRDY SENSE ERR }
[ 215.879192] ata21.00: error: { ABRT }
[ 215.882997] ata21.00: failed command: WRITE FPDMA QUEUED
[ 215.888479] ata21.00: cmd 61/00:00:00:3b:d3/01:00:b3:04:00/40 tag 19 ncq dma 131072 out
res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
[ 215.904814] ata21.00: failed command: WRITE FPDMA QUEUED
[ 215.910311] ata21.00: cmd 61/00:00:00:3c:d3/01:00:b3:04:00/40 tag 20 ncq dma 131072 out
res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
[ 215.932679] ata21.00: failed command: WRITE FPDMA QUEUED
[ 215.941203] ata21.00: cmd 61/00:00:00:3d:d3/01:00:b3:04:00/40 tag 21 ncq dma 131072 out
res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
[ 215.963616] ata21.00: failed command: WRITE FPDMA QUEUED
[ 215.972150] ata21.00: cmd 61/00:00:00:3e:d3/01:00:b3:04:00/40 tag 22 ncq dma 131072 out
res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
[ 215.994532] ata21.00: failed command: WRITE FPDMA QUEUED
[ 216.003124] ata21.00: cmd 61/00:00:00:3f:d3/01:00:b3:04:00/40 tag 23 ncq dma 131072 out
res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)

HSM (Host State Machine) violation errors.

For the same SATA drive connected via AHCI this will instead give:

[ 3796.944923] ata14.00: exception Emask 0x0 SAct 0x80800003 SErr 0xc0000 action 0x0
[ 3796.959375] ata14.00: irq_stat 0x40000008
[ 3796.970140] ata14: SError: { CommWake 10B8B }
[ 3796.981231] ata14.00: failed command: WRITE FPDMA QUEUED
[ 3796.993237] ata14.00: cmd 61/00:08:00:7e:73/02:00:8e:08:00/40 tag 1 ncq dma 262144 out
res 43/04:01:00:00:00/00:00:00:00:00/40 Emask 0x1 (device error)
[ 3797.017984] ata14.00: status: { DRDY SENSE ERR }
[ 3797.026833] ata14.00: error: { ABRT }
[ 3797.034664] ata14.00: failed command: WRITE FPDMA QUEUED
[ 3797.043015] ata14.00: cmd 61/00:b8:00:60:73/0a:00:8e:08:00/40 tag 23 ncq dma 1310720 out
res 43/04:00:df:67:73/00:00:8e:08:00/40 Emask 0x400 (NCQ error) <F>
[ 3797.065224] ata14.00: status: { DRDY SENSE ERR }
[ 3797.072914] ata14.00: error: { ABRT }
[ 3797.079598] ata14.00: failed command: WRITE FPDMA QUEUED
[ 3797.087920] ata14.00: cmd 61/00:f8:00:6a:73/0a:00:8e:08:00/40 tag 31 ncq dma 1310720 out
res 43/04:00:00:00:00/00:00:00:00:00/00 Emask 0x1 (device error)
[ 3797.109800] ata14.00: status: { DRDY SENSE ERR }
[ 3797.117451] ata14.00: error: { ABRT }

device error errors.


Except for the I/O that caused the NCQ error, the remaining outstanding I/Os,
regardless if they were aborted by the drive, as a side-effect of reading the
NCQ error log (see 13.7.4 Queued Error Log (10h) in SATA 3.5a spec),
or if they were aborted by the host (by sas_ata_device_link_abort()),
I don't think it is correct to report these as HSM violation errors.

HSM violation errors are e.g. when you try to issue a command to a drive
that has ATA_BUSY bit set.

We had a similar issue for hisi_sas and solved in patch 4/7: don't set ATA_ERR in the fis for those IO which complete with error, but abort the IO via sas_abort_task().

For pm80xx the IO is either rejected (actually completes with rejection) or is aborted via internal abort command. Maybe we can do similar for pm8001 as we allow the IO to complete in both cases with error. I'll check.

Thanks,
John