Re: [PATCH 00/16] scsi: libsas and users: Factor out LLDD TMF code

From: Damien Le Moal
Date: Thu Feb 03 2022 - 04:44:49 EST


On 2/1/22 00:58, John Garry wrote:
> On 28/01/2022 09:09, John Garry wrote:
>>> I ran some more tests. In particular, I ran libzbc compliance tests on a
>>> 20TB SMR drives. All tests pass with 5.17-rc1, but after applying your
>>> series, I see command timeout that take forever to recover from, with
>>> the drive revalidation failing after that.
>>>
>>> [  385.102073] sas: Enter sas_scsi_recover_host busy: 1 failed: 1
>>> [  385.108026] sas: sas_scsi_find_task: aborting task 0x000000007068ed73
>>> [  405.561099] pm80xx0:: pm8001_exec_internal_task_abort  757:TMF task
>>> timeout.
>>> [  405.568236] sas: sas_scsi_find_task: task 0x000000007068ed73 is
>>> aborted
>>> [  405.574930] sas: sas_eh_handle_sas_errors: task 0x000000007068ed73 is
>>> aborted
>>> [  411.192602] ata21.00: qc timeout (cmd 0xec)
>>> [  431.672122] pm80xx0:: pm8001_exec_internal_task_abort  757:TMF task
>>> timeout.
>>> [  431.679282] ata21.00: failed to IDENTIFY (I/O error, err_mask=0x4)
>>> [  431.685544] ata21.00: revalidation failed (errno=-5)
>>> [  441.911948] ata21.00: qc timeout (cmd 0xec)
>>> [  462.391545] pm80xx0:: pm8001_exec_internal_task_abort  757:TMF task
>>> timeout.
>>> [  462.398696] ata21.00: failed to IDENTIFY (I/O error, err_mask=0x4)
>>> [  462.404992] ata21.00: revalidation failed (errno=-5)
>>> [  492.598769] ata21.00: qc timeout (cmd 0xec)
>>> ...
>>>
>>> So there is a problem. Need to dig into this. I see this issue only with
>>> libzbc passthrough tests. fio runs with libaio are fine.
>>
>> Thanks for the notice. I think that I also saw a hang, but, IIRC, it
>> happened on mainline for me - but it's hard to know if I broke something
>> if it is already broke in another way. That is why I wanted this card
>> working properly...
>
> Hi Damien,
>
> From testing mainline, I can see a hang on my arm64 system for SAS
> disks. I think that the reason is the we don't finish some commands in
> EH properly for pm8001:
> - In EH, we attempt to abort the task in sas_scsi_find_task() ->
> lldd_abort_task()
> The default return from pm8001_exec_internal_tmf_task() is
> -TMF_RESP_FUNC_FAILED, so if the TMF does not execute properly we return
> this value
> - sas_scsi_find_task() cannot handle -TMF_RESP_FUNC_FAILED, and returns
> -TMF_RESP_FUNC_FAILED directly to sas_eh_handle_sas_errors(), which,
> again, does not handle -TMF_RESP_FUNC_FAILED. So we don't progress to
> ever finish the comand.
>
> This looks like the correct fix for mainline:
>
> --- a/drivers/scsi/pm8001/pm8001_sas.c
> +++ b/drivers/scsi/pm8001/pm8001_sas.c
> @@ -766,7 +766,7 @@ static int pm8001_exec_internal_tmf_task(struct
> domain_device *dev,
> pm8001_dev, DS_OPERATIONAL);
> wait_for_completion(&completion_setstate);
> }
> - res = -TMF_RESP_FUNC_FAILED;
> + res = TMF_RESP_FUNC_FAILED;
>
> That's effectively the same as what I have in this series in
> sas_execute_tmf().
>
> However your testing is a SATA device, which I'll check further.

This did not help. Still seeing 100% reproducible hangs.

I did a lot of testing/digging today, and the hang cause seems to be
missing task completions. At random, a task times out as its completion
does not come, and subsequent abort trial for the task fail, revalidate
fails and the device is dropped (capacity goes to 0). But at that point,
doing rmmod/modprobe to reset the device does not work. sync cache
command issued at rmmod time never completes. I end up needing to power
cycle the machine every time...

No clue about the root cause yet, but it definitely seem to be related
to NCQ/high QD operation. If I force my tests to use non-NCQ commands,
everything is fine and the tests run to completion without any issue.

I wonder if their is a tag management bug somewhere...

I did stumble on something very ugly in libsas too: sas_ata_qc_issue()
drops and retake the ata port lock. No other ATA driver do that since
the ata completion also take that lock. The ata port lock is taken
before ata_qc_issue() is called with IRQ disabled (spin_lock_irqsave()).
So doing a spin_unlock()/spin_lock() in sas_ata_qc_issue() (called from
ata_qc_issue()) seems like a very bad idea. I removed that and
everything work the same way (the lld execute does not sleep). But that
did not solve the hang problem.

Of note is this is all with your libsas patches applied. Without the
patches, I have KASAN screaming at me about use-after-free in completion
context. With your patches, KASAN is silent.

Another thing: this driver does not allow changing the max qd... Very
annoying.

echo 1 > /sys/block/sdX/device/queue_depth

has no effect. QD stays at 32 for an ATA drive. Need to look into that too.


--
Damien Le Moal
Western Digital Research