Re: Re: [PATCH] ahci: libahci: clear pending interrupt status

From: Szuying Chen
Date: Mon Aug 14 2023 - 00:34:31 EST


On 8/10/23 17:59, Damien Le Moal wrote:
>On 8/10/23 18:31, Szuying Chen wrote:
>>> On 8/10/23 14:12, Damien Le Moal wrote:
>>> On 8/10/23 14:05, Szuying Chen wrote:
>>> > When ISR handle interface fatal error with error recovery after clear PxIS
>>> > and PxIE. Another FIS(SDB FIS with err) that set PxIS.IFS to 1 is recevied
>>> > during error recovery, which causing the HBA not issue any new commands
>>> > after cmd.ST set 1.
>>>
>>> This is not very clear. If there was a fatal error, the drive should be in
>>> error state and no other SDB FIS can be received as the drive does absolutely
>>> nothing while in error state (it only waits for a read log 10h command to be>
>>> issued to get it out of error state). So if you are seeing 2 SDB FIS with
>>> errors one after the other, you have a buggy device...
>>>
>>> However, I may be misunderstanding your issue here. Could you provide more
>>> details and a dmesg output example of the issue ?
>>
>> According to AHCI 1.3.1 specification ch6.1.9, when an R_ERR is received
>> on an H2D Data FIS in normal operation, the HBA sets PxIS.IFS to 1
>> (fatal error) and halts operation. Referring to SATA 3.0 specification we
>> know the device will halt queued command processing and transmit SDB FIS to
>> host with ERR bit in Status field set to one(set PxIS.TFES to 1).
>
>Sure, but that SBD FIS should be completely ignored by the adapter since it
>stopped operation. If you see it, then it means that the handling of the first
>error was incomplete.
>
The SDB FIS happens between pxIS.IE was cleared and PxCMD.ST not cleared
to 0. Because the PxCMD.ST still setting, so the SDB FIS can't ignored.

>> In our case, the ISR handles fatal errors(PxIS.IFS) and enters error
>> recovery after cleaning up PxIS and PxIE. Then a SDB FIS is received
>> with interrupt bit(PxIS.TFES) set to 1. According to AHCI 1.3.1
>> specification ch6.2.2, HBA can't issue(cmd.ST set to 1) any new commands
>> under PxIS.TFES alive during error recovery.
>
>But how come you see a new command being issued ? This entire situation should
>result in a port reset with the first error. I do not see how this is possible.
>Are you saying that the port reset is not cleaning things up properly ? Could
>you share the dmesg output of this case ?
>
>
dmesg:
[ 654.244958] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
domain=0x0019 address=0x0 flags=0x0000]
[ 654.460522] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
domain=0x0019 address=0x0 flags=0x0000]
[ 656.180375] ata7.00: exception Emask 0x10 SAct 0x7c0 SErr 0x200100
action 0x6 frozen
[ 656.180384] ata7.00: irq_stat 0x08000000, interface fatal error
[ 656.180386] ata7: SError: { UnrecovData BadCRC }
[ 656.180389] ata7.00: failed command: READ FPDMA QUEUED
[ 656.180390] ata7.00: cmd 60/00:30:28:a9:9c/0a:00:01:00:00/40 tag 6 ncq
dma 1310720 in
res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
[ 656.180395] ata7.00: status: { DRDY }
[ 656.180397] ata7.00: failed command: READ FPDMA QUEUED
[ 656.180398] ata7.00: cmd 60/00:38:28:b3:9c/0a:00:01:00:00/40 tag 7 ncq
dma 1310720 in
res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
[ 656.180402] ata7.00: status: { DRDY }
[ 656.180403] ata7.00: failed command: READ FPDMA QUEUED
[ 656.180404] ata7.00: cmd 60/00:40:28:bd:9c/0a:00:01:00:00/40 tag 8 ncq
dma 1310720 in
res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
[ 656.180408] ata7.00: status: { DRDY }
[ 656.180410] ata7.00: failed command: READ FPDMA QUEUED
[ 656.180411] ata7.00: cmd 60/00:48:28:c7:9c/0a:00:01:00:00/40 tag 9 ncq
dma 1310720 in
res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
[ 656.180414] ata7.00: status: { DRDY }
[ 656.180416] ata7.00: failed command: READ FPDMA QUEUED
[ 656.180417] ata7.00: cmd 60/00:50:28:d1:9c/0a:00:01:00:00/40 tag 10 ncq
dma 1310720 in
res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
[ 656.180421] ata7.00: status: { DRDY }
[ 656.180423] ata7: hard resetting link
[ 659.210328] amd_iommu_report_page_fault: 48 callbacks suppressed
[ 659.210333] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
domain=0x0019 address=0x0 flags=0x0000]
[ 659.210345] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
domain=0x0019 address=0x0 flags=0x0000]
[ 659.210352] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
domain=0x0019 address=0x0 flags=0x0000]
[ 659.811069] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
domain=0x0019 address=0x0 flags=0x0000]
[ 659.811080] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
domain=0x0019 address=0x0 flags=0x0000]
[ 659.811086] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
domain=0x0019 address=0x0 flags=0x0000]
[ 659.827669] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
domain=0x0019 address=0x0 flags=0x0000]
[ 659.827676] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
domain=0x0019 address=0x0 flags=0x0000]
[ 659.827680] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
domain=0x0019 address=0x0 flags=0x0000]
[ 659.878319] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
domain=0x0019 address=0x0 flags=0x0000]
[ 664.426805] amd_iommu_report_page_fault: 248 callbacks suppressed
[ 664.426809] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
domain=0x0019 address=0x0 flags=0x0000]
[ 664.426819] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
domain=0x0019 address=0x0 flags=0x0000]
[ 664.426825] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
domain=0x0019 address=0x0 flags=0x0000]
[ 664.676294] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
domain=0x0019 address=0x0 flags=0x0000]
[ 664.676305] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
domain=0x0019 address=0x0 flags=0x0000]
[ 664.676313] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
domain=0x0019 address=0x0 flags=0x0000]
[ 664.693501] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
domain=0x0019 address=0x0 flags=0x0000]
[ 664.693513] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
domain=0x0019 address=0x0 flags=0x0000]
[ 664.693521] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
domain=0x0019 address=0x0 flags=0x0000]
[ 664.711480] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
domain=0x0019 address=0x0 flags=0x0000]
[ 666.180398] ata7: softreset failed (1st FIS failed)
[ 666.180405] ata7: hard resetting link
[ 670.054987] amd_iommu_report_page_fault: 99 callbacks suppressed
[ 670.054992] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
domain=0x0019 address=0x0 flags=0x0000]
[ 670.055003] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
domain=0x0019 address=0x0 flags=0x0000]
[ 670.055010] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
domain=0x0019 address=0x0 flags=0x0000]


dmesg (add clear pending Interrupt):
[ 120.236847] ahci_error_intr: interface fatal error
[ 120.236854] ahci_error_intr: freeze (set PxIE to 0)
[ 120.265268] ata7: limiting SATA link speed to 3.0 Gbps
[ 120.265275] ata7.00: exception Emask 0x10 SAct 0x3f0 SErr 0x400000
action 0x6 frozen
[ 120.265279] ata7.00: irq_stat 0x08000000, interface fatal error
[ 120.265281] ata7: SError: { Handshk }
[ 120.265285] ata7.00: failed command: WRITE FPDMA QUEUED
[ 120.265287] ata7.00: cmd 61/00:20:a8:b9:50/0a:00:00:00:00/40 tag 4 ncq
dma 1310720 ou
res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
[ 120.265295] ata7.00: status: { DRDY }
[ 120.265297] ata7.00: failed command: WRITE FPDMA QUEUED
[ 120.265299] ata7.00: cmd 61/20:28:a8:c3:50/03:00:00:00:00/40 tag 5 ncq
dma 409600 out
res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
[ 120.265305] ata7.00: status: { DRDY }
[ 120.265307] ata7.00: failed command: WRITE FPDMA QUEUED
[ 120.265309] ata7.00: cmd 61/00:30:c8:c6:50/0a:00:00:00:00/40 tag 6 ncq
dma 1310720 ou
res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
[ 120.265314] ata7.00: status: { DRDY }
[ 120.265316] ata7.00: failed command: WRITE FPDMA QUEUED
[ 120.265318] ata7.00: cmd 61/00:38:c8:d0:50/0a:00:00:00:00/40 tag 7 ncq
dma 1310720 ou
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
[ 120.265324] ata7.00: status: { DRDY }
[ 120.265325] ata7.00: failed command: WRITE FPDMA QUEUED
[ 120.265327] ata7.00: cmd 61/00:40:c8:da:50/0a:00:00:00:00/40 tag 8 ncq
dma 1310720 ou
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
[ 120.265333] ata7.00: status: { DRDY }
[ 120.265334] ata7.00: failed command: WRITE FPDMA QUEUED
[ 120.265336] ata7.00: cmd 61/00:48:c8:e4:50/0a:00:00:00:00/40 tag 9 ncq
dma 1310720 ou
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
[ 120.265342] ata7.00: status: { DRDY }
[ 120.265346] ata7: hard resetting link
[ 120.265348] ata7: ahci_hardreset *****
[ 120.265352] ahci 0000:0a:00.0: ahci_stop_engine: PORT_IRQ_STAT 0x40000008
PORT_CMD 0x0004c617
[ 120.265356] ahci 0000:0a:00.0: ahci_stop_engine: setting HBA to idle
(setting PxCMD.ST to 0)
[ 120.937522] ahci 0000:0a:00.0: ahci_start_engine: check
PORT_SCR_ERR 0x00000000, and to clear
[ 120.937534] ahci 0000:0a:00.0: ahci_start_engine: check
PORT_IRQ_STAT 0x40000008, and to clear
[ 120.937544] ata7: ahci_softreset *****
[ 120.937550] ahci 0000:0a:00.0: ahci_stop_engine: setting HBA to idle
(setting PxCMD.ST to 0)
[ 120.937555] ahci 0000:0a:00.0: ahci_start_engine: check
PORT_SCR_ERR 0x00000000, and to clear
[ 120.937559] ahci 0000:0a:00.0: ahci_start_engine: check
PORT_IRQ_STAT 0x00000000, and to clear
[ 121.097231] ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[ 121.113493] ata7.00: configured for UDMA/133
[ 121.113570] ata7: EH complete

I have printk PxIS & PxCMD values to check pending interrupt.
It SBD FIS with ERR to setting PxIS.TFES and PxIS.SDBS bit.

>>
>>> >
>>> > Signed-off-by: Szuying Chen <Chloe_Chen@xxxxxxxxxxxxxx>
>>> > ---
>>> > drivers/ata/libahci.c | 12 ++++++++++++
>>> > 1 file changed, 12 insertions(+)
>>> >
>>> > diff --git a/drivers/ata/libahci.c b/drivers/ata/libahci.c
>>> > index 06aec35f88f2..0ae51fd95d46 100644
>>> > --- a/drivers/ata/libahci.c
>>> > +++ b/drivers/ata/libahci.c
>>> > @@ -679,9 +679,21 @@ static int ahci_scr_write(struct ata_link *link, unsigned int sc_reg, u32 val)
>>> >
>>> > void ahci_start_engine(struct ata_port *ap)
>>> > {
>>> > + struct ahci_host_priv *hpriv = ap->host->private_data;
>>> > void __iomem *port_mmio = ahci_port_base(ap);
>>> > u32 tmp;
>>> >
>>> > + /* clear SError */
>>> > + tmp = readl(port_mmio + PORT_SCR_ERR);
>>> > + writel(tmp, port_mmio + PORT_SCR_ERR);
>>> > +
>>> > + /* clear port IRQ */
>>> > + tmp = readl(port_mmio + PORT_IRQ_STAT);
>>> > + if (tmp)
>>> > + writel(tmp, port_mmio + PORT_IRQ_STAT);
>>> > +
>>> > + writel(1 << ap->port_no, hpriv->mmio + PORT_IRQ_STAT);
>>> > +
>>> > /* start DMA */
>>> > tmp = readl(port_mmio + PORT_CMD);
>>> > tmp |= PORT_CMD_START;
>>> > --
>>> > 2.39.2
>>> >
>>>
>>> --
>>> Damien Le Moal
>>> Western Digital Research
>>>
>> Thanks.
>>
>
>--
>Damien Le Moal
>Western Digital Research