Re: [PATCH v2] mmc: sdhci-pci-gli: GL975x: Mask rootport's replay timer timeout during suspend

From: Kai-Heng Feng
Date: Thu Jan 18 2024 - 01:41:23 EST


On Sat, Jan 13, 2024 at 1:37 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
>
> On Fri, Jan 12, 2024 at 01:14:42PM +0800, Kai-Heng Feng wrote:
> > On Sat, Jan 6, 2024 at 5:19 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> > >
> > > On Thu, Dec 21, 2023 at 11:21:47AM +0800, Kai-Heng Feng wrote:
> > > > Spamming `lspci -vv` can still observe the replay timer timeout error
> > > > even after commit 015c9cbcf0ad ("mmc: sdhci-pci-gli: GL9750: Mask the
> > > > replay timer timeout of AER"), albeit with a lower reproduce rate.
> > >
> > > I'm not sure what this is telling me. By "spamming `lspci -vv`, do
> > > you mean that if you run lspci continually, you still see Replay Timer
> > > Timeout logged, e.g.,
> > >
> > > CESta: ... Timeout+
> >
> > Yes it's logged and the AER IRQ is raised.
>
> IIUC the AER IRQ is the important thing.
>
> Neither 015c9cbcf0ad nor this patch affects logging in
> PCI_ERR_COR_STATUS, so the lspci output won't change and mentioning it
> here doesn't add useful information.

You are right. That's just a way to access config space to reproduce the issue.

>
> I'd suggest more specific wording than "spamming `lspci -vv`", e.g.,
>
> 015c9cbcf0ad ("mmc: sdhci-pci-gli: GL9750: Mask the replay timer
> timeout of AER") masks Replay Timer Timeout errors at the GL975x
> Endpoint. When the Endpoint detects these errors, it still logs
> them in its PCI_ERR_COR_STATUS, but masking prevents it from sending
> ERR_COR messages upstream.
>
> The Downstream Port leading to a GL975x Endpoint is unaffected by
> 015c9cbcf0ad. Previously, when that Port detected a Replay Timer
> Timeout, it sent an ERR_COR message upstream, which eventually
> caused an AER IRQ, which prevented the system from suspending.
>
> Mask Replay Timer Timeout errors at the Downstream Port. The errors
> will still be logged in PCI_ERR_COR_STATUS, but no ERR_COR will be
> sent.

That's phrased much better then mine :)

>
> > > 015c9cbcf0ad uses hard-coded PCI_GLI_9750_CORRERR_MASK offset and
> > > PCI_GLI_9750_CORRERR_MASK_REPLAY_TIMER_TIMEOUT value, which look like
> > > they *could* be PCI_ERR_COR_MASK and PCI_ERR_COR_REP_TIMER, but
> > > without the lspci output I can't tell for sure. If they are, it would
> > > be nice to use the generic macros instead of defining new ones so it's
> > > easier to analyze PCI_ERR_COR_MASK usage.
> > >
> > > If 015c9cbcf0ad is updating the generic PCI_ERR_COR_MASK, it should
> > > only prevent sending ERR_COR. It should not affect the *logging* in
> > > PCI_ERR_COR_STATUS (see PCIe r6.0, sec 6.2.3.2.2), so it shouldn't
> > > affect the lspci output.
> >
> > PCI_GLI_9750_CORRERR_MASK is specific to GLI 975x devices, so it
> > doesn't conform to generic PCI_ERR_COR_STATUS behavior.
>
> *Could* 015c9cbcf0ad have used the generic PCI_ERR_COR_MASK to
> accomplish the same effect? Is there an advantage to using the
> device-specific PCI_GLI_9750_CORRERR_MASK?
>
> If masking via PCI_ERR_COR_MASK would work, that would be much better
> because the PCI core can see, manage, and make that visible, e.g., via
> sysfs. The core doesn't do that today, but people are working on it.

I don't think so. Please see below.

>
> > > If 015c9cbcf0ad is actually updating PCI_ERR_COR_MASK, it would be
> > > nice to clean that up, too. And maybe PCI_ERR_COR_REP_TIMER should be
> > > masked/restored at the same place for both the Downstream Port and the
> > > Endpoint?
> >
> > Since PCI_ERR_COR_REP_TIMER is already masked before 015c9cbcf0ad,
> > so I didn't think that's necessary. Do you think it should still be
> > masked just to be safe?
>
> Did you mean "PCI_ERR_COR_REP_TIMER is already masked *by*
> 015c9cbcf0ad"?

No. The PCI_ERR_COR_REP_TIMER is masked with or without 015c9cbcf0ad.
That means before 015c9cbcf0ad, Reply Timeout error was reported with
PCI_ERR_COR_REP_TIMER masked.

So using PCI_GLI_9750_CORRERR_MASK is necessary for the endpoint.

>
> If masking PCI_ERR_COR_REP_TIMER using the generic PCI_ERR_COR_MASK in
> the GL975x would have the same effect as masking it with
> PCI_GLI_9750_CORRERR_MASK, then I think you should *only* use the
> generic PCI_ERR_COR_MASK.
>
> No need to do both if the generic one is sufficient. And I think both
> should be done in the same place since they're basically solving the
> same problem, just at both ends of the link.

Do you mean only mask PCI_GLI_9750_CORRERR_MASK during suspend?
That will not be ideal because accessing its config space (e.g. `lspci
-vv`) will have many errors logged.

Kai-Heng

>
> Bjorn