Re: Re: [PATCH] ahci: libahci: clear pending interrupt status

From: Niklas Cassel
Date: Tue Aug 22 2023 - 04:41:30 EST


On Mon, Aug 14, 2023 at 12:17:20PM +0800, Szuying Chen wrote:
> On 8/10/23 17:59, Damien Le Moal wrote:
> >On 8/10/23 18:31, Szuying Chen wrote:
> >>> On 8/10/23 14:12, Damien Le Moal wrote:
> >>> On 8/10/23 14:05, Szuying Chen wrote:
> >>> > When ISR handle interface fatal error with error recovery after clear PxIS
> >>> > and PxIE. Another FIS(SDB FIS with err) that set PxIS.IFS to 1 is recevied
> >>> > during error recovery, which causing the HBA not issue any new commands
> >>> > after cmd.ST set 1.
> >>>
> >>> This is not very clear. If there was a fatal error, the drive should be in
> >>> error state and no other SDB FIS can be received as the drive does absolutely
> >>> nothing while in error state (it only waits for a read log 10h command to be>
> >>> issued to get it out of error state). So if you are seeing 2 SDB FIS with
> >>> errors one after the other, you have a buggy device...
> >>>
> >>> However, I may be misunderstanding your issue here. Could you provide more
> >>> details and a dmesg output example of the issue ?
> >>
> >> According to AHCI 1.3.1 specification ch6.1.9, when an R_ERR is received
> >> on an H2D Data FIS in normal operation, the HBA sets PxIS.IFS to 1
> >> (fatal error) and halts operation. Referring to SATA 3.0 specification we
> >> know the device will halt queued command processing and transmit SDB FIS to
> >> host with ERR bit in Status field set to one(set PxIS.TFES to 1).
> >
> >Sure, but that SBD FIS should be completely ignored by the adapter since it
> >stopped operation. If you see it, then it means that the handling of the first
> >error was incomplete.
> >
> The SDB FIS happens between pxIS.IE was cleared and PxCMD.ST not cleared
> to 0. Because the PxCMD.ST still setting, so the SDB FIS can't ignored.
>
> >> In our case, the ISR handles fatal errors(PxIS.IFS) and enters error
> >> recovery after cleaning up PxIS and PxIE. Then a SDB FIS is received
> >> with interrupt bit(PxIS.TFES) set to 1. According to AHCI 1.3.1
> >> specification ch6.2.2, HBA can't issue(cmd.ST set to 1) any new commands
> >> under PxIS.TFES alive during error recovery.
> >
> >But how come you see a new command being issued ? This entire situation should
> >result in a port reset with the first error. I do not see how this is possible.
> >Are you saying that the port reset is not cleaning things up properly ? Could
> >you share the dmesg output of this case ?
> >
> >
> dmesg:
> [ 654.244958] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
> domain=0x0019 address=0x0 flags=0x0000]
> [ 654.460522] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
> domain=0x0019 address=0x0 flags=0x0000]
> [ 656.180375] ata7.00: exception Emask 0x10 SAct 0x7c0 SErr 0x200100

Hello Szuying Chen,

From your logs we can see that SErr 0x200100:
This means that PxSERR.DIAG.C was set.

This means that:
"Calculated Different CRC than Received: When this occurs, the HBA returns
R_ERR and returns to idle."

(Since PxIS.IFS is set, we know that it was a D2H Data FIS that had invalid
CRC. (If it was a D2H non-Data FIS PxIS.INFS would have been set instead.))


So your commit message is wrong. In your case it is the HBA that sends
a R_ERR to the device, so you should look at "6.1.10 CRC error in received
FIS" and not "6.1.9 Device responds to FIS with R_ERR".

"When a CRC error occurs on a D2H Data FIS, the HBA sets PxIS.IFS to ‘1’.
After a CRC error on a D2H Data FIS, the device may send a D2H Register FIS
to indicate the reason for the error. The host should allow posting of
non-Data FISes before PxCMD.ST is cleared to ‘0’."

Looking at the state machine "5.3.11.2 DR:Receive" you will jump to state
ERR:Fatal. (Meaning that we need to clear PxCMD.ST in order to issue new
commands.)


The HBA should this trigger an error irq when noticing the bad CRC.

It is possible that the HBA triggers another error IRQ when receiving
a D2H with TFES bit set, if it is received before PxCMD.ST is cleared
to 0, see NDR:Accept:
FIS Type is D2H Register and PxCMD.ST=0, jump to P:RegFISUpdate, which
does not trigger any IRQ.
If PxCMD.ST is set, the state machine will instead jump to RegFIS:Entry.


So, in order to understand what is going on, can you please add
unconditional prints in ahci_error_intr(), so we can actually see
how many times this is called. (Right now it seems to only be called
once). Please also print when PxCMD.ST is cleared to 0.
(So that we can verify that ahci_error_intr() is never called after
that). Also please remove all prints from other drivers, e.g. nouveau.


Kind regards,
Niklas



> action 0x6 frozen
> [ 656.180384] ata7.00: irq_stat 0x08000000, interface fatal error
> [ 656.180386] ata7: SError: { UnrecovData BadCRC }
> [ 656.180389] ata7.00: failed command: READ FPDMA QUEUED
> [ 656.180390] ata7.00: cmd 60/00:30:28:a9:9c/0a:00:01:00:00/40 tag 6 ncq
> dma 1310720 in
> res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
> [ 656.180395] ata7.00: status: { DRDY }
> [ 656.180397] ata7.00: failed command: READ FPDMA QUEUED
> [ 656.180398] ata7.00: cmd 60/00:38:28:b3:9c/0a:00:01:00:00/40 tag 7 ncq
> dma 1310720 in
> res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
> [ 656.180402] ata7.00: status: { DRDY }
> [ 656.180403] ata7.00: failed command: READ FPDMA QUEUED
> [ 656.180404] ata7.00: cmd 60/00:40:28:bd:9c/0a:00:01:00:00/40 tag 8 ncq
> dma 1310720 in
> res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
> [ 656.180408] ata7.00: status: { DRDY }
> [ 656.180410] ata7.00: failed command: READ FPDMA QUEUED
> [ 656.180411] ata7.00: cmd 60/00:48:28:c7:9c/0a:00:01:00:00/40 tag 9 ncq
> dma 1310720 in
> res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
> [ 656.180414] ata7.00: status: { DRDY }
> [ 656.180416] ata7.00: failed command: READ FPDMA QUEUED
> [ 656.180417] ata7.00: cmd 60/00:50:28:d1:9c/0a:00:01:00:00/40 tag 10 ncq
> dma 1310720 in
> res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
> [ 656.180421] ata7.00: status: { DRDY }
> [ 656.180423] ata7: hard resetting link
> [ 659.210328] amd_iommu_report_page_fault: 48 callbacks suppressed
> [ 659.210333] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
> domain=0x0019 address=0x0 flags=0x0000]
> [ 659.210345] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
> domain=0x0019 address=0x0 flags=0x0000]
> [ 659.210352] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
> domain=0x0019 address=0x0 flags=0x0000]
> [ 659.811069] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
> domain=0x0019 address=0x0 flags=0x0000]
> [ 659.811080] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
> domain=0x0019 address=0x0 flags=0x0000]
> [ 659.811086] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
> domain=0x0019 address=0x0 flags=0x0000]
> [ 659.827669] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
> domain=0x0019 address=0x0 flags=0x0000]
> [ 659.827676] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
> domain=0x0019 address=0x0 flags=0x0000]
> [ 659.827680] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
> domain=0x0019 address=0x0 flags=0x0000]
> [ 659.878319] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
> domain=0x0019 address=0x0 flags=0x0000]
> [ 664.426805] amd_iommu_report_page_fault: 248 callbacks suppressed
> [ 664.426809] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
> domain=0x0019 address=0x0 flags=0x0000]
> [ 664.426819] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
> domain=0x0019 address=0x0 flags=0x0000]
> [ 664.426825] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
> domain=0x0019 address=0x0 flags=0x0000]
> [ 664.676294] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
> domain=0x0019 address=0x0 flags=0x0000]
> [ 664.676305] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
> domain=0x0019 address=0x0 flags=0x0000]
> [ 664.676313] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
> domain=0x0019 address=0x0 flags=0x0000]
> [ 664.693501] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
> domain=0x0019 address=0x0 flags=0x0000]
> [ 664.693513] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
> domain=0x0019 address=0x0 flags=0x0000]
> [ 664.693521] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
> domain=0x0019 address=0x0 flags=0x0000]
> [ 664.711480] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
> domain=0x0019 address=0x0 flags=0x0000]
> [ 666.180398] ata7: softreset failed (1st FIS failed)
> [ 666.180405] ata7: hard resetting link
> [ 670.054987] amd_iommu_report_page_fault: 99 callbacks suppressed
> [ 670.054992] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
> domain=0x0019 address=0x0 flags=0x0000]
> [ 670.055003] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
> domain=0x0019 address=0x0 flags=0x0000]
> [ 670.055010] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
> domain=0x0019 address=0x0 flags=0x0000]
>
>
> dmesg (add clear pending Interrupt):
> [ 120.236847] ahci_error_intr: interface fatal error
> [ 120.236854] ahci_error_intr: freeze (set PxIE to 0)
> [ 120.265268] ata7: limiting SATA link speed to 3.0 Gbps
> [ 120.265275] ata7.00: exception Emask 0x10 SAct 0x3f0 SErr 0x400000
> action 0x6 frozen
> [ 120.265279] ata7.00: irq_stat 0x08000000, interface fatal error
> [ 120.265281] ata7: SError: { Handshk }
> [ 120.265285] ata7.00: failed command: WRITE FPDMA QUEUED
> [ 120.265287] ata7.00: cmd 61/00:20:a8:b9:50/0a:00:00:00:00/40 tag 4 ncq
> dma 1310720 ou
> res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
> [ 120.265295] ata7.00: status: { DRDY }
> [ 120.265297] ata7.00: failed command: WRITE FPDMA QUEUED
> [ 120.265299] ata7.00: cmd 61/20:28:a8:c3:50/03:00:00:00:00/40 tag 5 ncq
> dma 409600 out
> res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
> [ 120.265305] ata7.00: status: { DRDY }
> [ 120.265307] ata7.00: failed command: WRITE FPDMA QUEUED
> [ 120.265309] ata7.00: cmd 61/00:30:c8:c6:50/0a:00:00:00:00/40 tag 6 ncq
> dma 1310720 ou
> res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
> [ 120.265314] ata7.00: status: { DRDY }
> [ 120.265316] ata7.00: failed command: WRITE FPDMA QUEUED
> [ 120.265318] ata7.00: cmd 61/00:38:c8:d0:50/0a:00:00:00:00/40 tag 7 ncq
> dma 1310720 ou
> res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
> [ 120.265324] ata7.00: status: { DRDY }
> [ 120.265325] ata7.00: failed command: WRITE FPDMA QUEUED
> [ 120.265327] ata7.00: cmd 61/00:40:c8:da:50/0a:00:00:00:00/40 tag 8 ncq
> dma 1310720 ou
> res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
> [ 120.265333] ata7.00: status: { DRDY }
> [ 120.265334] ata7.00: failed command: WRITE FPDMA QUEUED
> [ 120.265336] ata7.00: cmd 61/00:48:c8:e4:50/0a:00:00:00:00/40 tag 9 ncq
> dma 1310720 ou
> res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
> [ 120.265342] ata7.00: status: { DRDY }
> [ 120.265346] ata7: hard resetting link
> [ 120.265348] ata7: ahci_hardreset *****
> [ 120.265352] ahci 0000:0a:00.0: ahci_stop_engine: PORT_IRQ_STAT 0x40000008
> PORT_CMD 0x0004c617
> [ 120.265356] ahci 0000:0a:00.0: ahci_stop_engine: setting HBA to idle
> (setting PxCMD.ST to 0)
> [ 120.937522] ahci 0000:0a:00.0: ahci_start_engine: check
> PORT_SCR_ERR 0x00000000, and to clear
> [ 120.937534] ahci 0000:0a:00.0: ahci_start_engine: check
> PORT_IRQ_STAT 0x40000008, and to clear
> [ 120.937544] ata7: ahci_softreset *****
> [ 120.937550] ahci 0000:0a:00.0: ahci_stop_engine: setting HBA to idle
> (setting PxCMD.ST to 0)
> [ 120.937555] ahci 0000:0a:00.0: ahci_start_engine: check
> PORT_SCR_ERR 0x00000000, and to clear
> [ 120.937559] ahci 0000:0a:00.0: ahci_start_engine: check
> PORT_IRQ_STAT 0x00000000, and to clear
> [ 121.097231] ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
> [ 121.113493] ata7.00: configured for UDMA/133
> [ 121.113570] ata7: EH complete
>
> I have printk PxIS & PxCMD values to check pending interrupt.
> It SBD FIS with ERR to setting PxIS.TFES and PxIS.SDBS bit.
>
> >>
> >>> >
> >>> > Signed-off-by: Szuying Chen <Chloe_Chen@xxxxxxxxxxxxxx>
> >>> > ---
> >>> > drivers/ata/libahci.c | 12 ++++++++++++
> >>> > 1 file changed, 12 insertions(+)
> >>> >
> >>> > diff --git a/drivers/ata/libahci.c b/drivers/ata/libahci.c
> >>> > index 06aec35f88f2..0ae51fd95d46 100644
> >>> > --- a/drivers/ata/libahci.c
> >>> > +++ b/drivers/ata/libahci.c
> >>> > @@ -679,9 +679,21 @@ static int ahci_scr_write(struct ata_link *link, unsigned int sc_reg, u32 val)
> >>> >
> >>> > void ahci_start_engine(struct ata_port *ap)
> >>> > {
> >>> > + struct ahci_host_priv *hpriv = ap->host->private_data;
> >>> > void __iomem *port_mmio = ahci_port_base(ap);
> >>> > u32 tmp;
> >>> >
> >>> > + /* clear SError */
> >>> > + tmp = readl(port_mmio + PORT_SCR_ERR);
> >>> > + writel(tmp, port_mmio + PORT_SCR_ERR);
> >>> > +
> >>> > + /* clear port IRQ */
> >>> > + tmp = readl(port_mmio + PORT_IRQ_STAT);
> >>> > + if (tmp)
> >>> > + writel(tmp, port_mmio + PORT_IRQ_STAT);
> >>> > +
> >>> > + writel(1 << ap->port_no, hpriv->mmio + PORT_IRQ_STAT);
> >>> > +
> >>> > /* start DMA */
> >>> > tmp = readl(port_mmio + PORT_CMD);
> >>> > tmp |= PORT_CMD_START;
> >>> > --
> >>> > 2.39.2
> >>> >
> >>>
> >>> --
> >>> Damien Le Moal
> >>> Western Digital Research
> >>>
> >> Thanks.
> >>
> >
> >--
> >Damien Le Moal
> >Western Digital Research