Re: [PATCH 00/16] scsi: libsas and users: Factor out LLDD TMF code

From: Damien Le Moal
Date: Fri Feb 04 2022 - 06:27:56 EST


On 2/4/22 19:36, John Garry wrote:
> On 04/02/2022 03:02, Damien Le Moal wrote:
>>> The inconsistency is this line says:
>>> [ 137.193944] pm80xx0:: pm80xx_chip_sata_req 4581:no data
>>> Which seems to be sensical for NCQ_NON_DATA command, but then, this line
>>> seems wrong:
>>> [ 137.228015] pm80xx0:: mpi_sata_completion 2515:FPDMA len = 8
>>>
>>> I need to go and check the specs what the FIS reply format is for
>>> NCQ_NON_DATA.
>>>
>>>
>>> [ 137.187184] pm80xx0:: pm8001_queue_command 408:pm8001_task_exec device
>>> [ 137.193944] pm80xx0:: pm80xx_chip_sata_req 4581:no data
>>> [ 137.199339] pm80xx0:: pm80xx_chip_sata_req 4682:Sending Normal SATA
>>> command 0x63 inb 4
>>> [ 137.207577] pm80xx0:: pm8001_mpi_msg_consume 1446:: CI=46 PI=47
>>> msgHeader=8104200d
>>> [ 137.215399] pm80xx0:: mpi_sata_completion 2481:IO_SUCCESS
>>> [ 137.220961] pm80xx0:: mpi_sata_completion 2503:SAS_PROTO_RESPONSE
>>> len = 20
>>> [ 137.228015] pm80xx0:: mpi_sata_completion 2515:FPDMA len = 8
>>> [ 137.233878] pm80xx0:: pm8001_mpi_msg_free_set 1403: CI=47 PI=47
>>> [ 137.236696] pm80xx0:: pm8001_queue_command 408:pm8001_task_exec device
>>> [ 137.247102] pm80xx0:: pm80xx_chip_sata_req 4585:DMA
>>> [ 137.252186] pm80xx0:: pm80xx_chip_sata_req 4593:FPDMA
>>> [ 137.257400] pm80xx0:: pm80xx_chip_sata_req 4682:Sending Normal SATA
>>> command 0x65 inb f
>>> [ 167.506280] sas: Enter sas_scsi_recover_host busy: 1 failed: 1
>>> [ 167.512363] sas: sas_scsi_find_task: aborting task 0x00000000aa372627
>>> [ 167.519049] pm80xx0:: pm8001_chip_abort_task 4607:cmd_tag = 2, abort
>>> task tag = 0x1
>>> [ 187.969173] pm80xx0:: pm8001_exec_internal_task_abort 753:TMF task
>>> timeout.
>
> As I mentioned, having this fail is a red flag. If I was pushed to guess
> what has happened, I'd say the FW is faulting due to some erroneous
> driver behaviour.

I am still thinking that there is something wrong with the handling of
NCQ NON DATA command. There are several places where the code determines
non-data vs pio vs dma vs fpdma (ncq), and NCQ NON DATA always falls in
the fpdma bucket, which is wrong.

>> This is the submission path, not completion. The code is:
>>
>> (gdb) list *(pm8001_queue_command+0x842)
>> 0x3d42 is in pm8001_queue_command (drivers/scsi/pm8001/pm8001_sas.c:491).
>> 486 atomic_dec(&pm8001_dev->running_req);
>> 487 goto err_out_tag;
>> 488 }
>> 489 /* TODO: select normal or high priority */
>> 490 spin_lock(&t->task_state_lock);
>> 491 t->task_state_flags |= SAS_TASK_AT_INITIATOR;
>> 492 spin_unlock(&t->task_state_lock);
>> 493 } while (0);
>> 494 rc = 0;
>> 495 goto out_done;
>>
>> So the task is already completed when the submission path tries to set
>> the state flag ? Debugging...
>
> Yeah, that's how it looks.
>
> I already mentioned this problem here:
>
> https://lore.kernel.org/linux-scsi/0cc0c435-b4f2-9c76-258d-865ba50a29dd@xxxxxxxxxx/
>
> Maybe we should just fix it now to rule it out of possibly causing other
> issues... I was reluctant to fix it as many places seems to need to be
> touched. Let me check it.

Here is my current fix:

diff --git a/drivers/scsi/pm8001/pm8001_sas.c
b/drivers/scsi/pm8001/pm8001_sas.c
index 1b95c73d12d1..16c101577dd3 100644
--- a/drivers/scsi/pm8001/pm8001_sas.c
+++ b/drivers/scsi/pm8001/pm8001_sas.c
@@ -453,13 +453,18 @@ int pm8001_queue_command(struct sas_task *task,
gfp_t gfp_flags)
ccb->ccb_tag = tag;
ccb->task = t;
ccb->device = pm8001_dev;
+
+ /* TODO: select normal or high priority */
+ atomic_inc(&pm8001_dev->running_req);
+ spin_lock(&t->task_state_lock);
+ t->task_state_flags |= SAS_TASK_AT_INITIATOR;
+ spin_unlock(&t->task_state_lock);
+
switch (task_proto) {
case SAS_PROTOCOL_SMP:
- atomic_inc(&pm8001_dev->running_req);
rc = pm8001_task_prep_smp(pm8001_ha, ccb);
break;
case SAS_PROTOCOL_SSP:
- atomic_inc(&pm8001_dev->running_req);
if (is_tmf)
rc = pm8001_task_prep_ssp_tm(pm8001_ha,
ccb, tmf);
@@ -468,7 +473,6 @@ int pm8001_queue_command(struct sas_task *task,
gfp_t gfp_flags)
break;
case SAS_PROTOCOL_SATA:
case SAS_PROTOCOL_STP:
- atomic_inc(&pm8001_dev->running_req);
rc = pm8001_task_prep_ata(pm8001_ha, ccb);
break;
default:
@@ -480,13 +484,12 @@ int pm8001_queue_command(struct sas_task *task,
gfp_t gfp_flags)

if (rc) {
pm8001_dbg(pm8001_ha, IO, "rc is %x\n", rc);
+ spin_lock(&t->task_state_lock);
+ t->task_state_flags &= ~SAS_TASK_AT_INITIATOR;
+ spin_unlock(&t->task_state_lock);
atomic_dec(&pm8001_dev->running_req);
goto err_out_tag;
}
- /* TODO: select normal or high priority */
- spin_lock(&t->task_state_lock);
- t->task_state_flags |= SAS_TASK_AT_INITIATOR;
- spin_unlock(&t->task_state_lock);
} while (0);
rc = 0;
goto out_done;

With this, No KASAN complaint. I will send a proper patch ASAP.

Of note is that I cannot see what the flag SAS_TASK_AT_INITIATOR is for.
It is set and unset only, never tested anywhere in libsas nor pm8001
driver. This flag seems totally useless to me, unless this is something
that the HW can see ?

--
Damien Le Moal
Western Digital Research