Re: [PATCH v2] mmc: sdhci-pci-gli: GL975x: Mask rootport's replay timer timeout during suspend

From: Kai-Heng Feng
Date: Fri Jan 12 2024 - 00:15:11 EST


On Sat, Jan 6, 2024 at 5:19 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
>
> On Thu, Dec 21, 2023 at 11:21:47AM +0800, Kai-Heng Feng wrote:
> > Spamming `lspci -vv` can still observe the replay timer timeout error
> > even after commit 015c9cbcf0ad ("mmc: sdhci-pci-gli: GL9750: Mask the
> > replay timer timeout of AER"), albeit with a lower reproduce rate.
>
> I'm not sure what this is telling me. By "spamming `lspci -vv`, do
> you mean that if you run lspci continually, you still see Replay Timer
> Timeout logged, e.g.,
>
> CESta: ... Timeout+

Yes it's logged and the AER IRQ is raised.

>
> 015c9cbcf0ad uses hard-coded PCI_GLI_9750_CORRERR_MASK offset and
> PCI_GLI_9750_CORRERR_MASK_REPLAY_TIMER_TIMEOUT value, which look like
> they *could* be PCI_ERR_COR_MASK and PCI_ERR_COR_REP_TIMER, but
> without the lspci output I can't tell for sure. If they are, it would
> be nice to use the generic macros instead of defining new ones so it's
> easier to analyze PCI_ERR_COR_MASK usage.
>
> If 015c9cbcf0ad is updating the generic PCI_ERR_COR_MASK, it should
> only prevent sending ERR_COR. It should not affect the *logging* in
> PCI_ERR_COR_STATUS (see PCIe r6.0, sec 6.2.3.2.2), so it shouldn't
> affect the lspci output.

PCI_GLI_9750_CORRERR_MASK is specific to GLI 975x devices, so it
doesn't conform to generic PCI_ERR_COR_STATUS behavior.

The Timeout is masked with or without commit 015c9cbcf0ad:
CEMsk: ... Timeout+


>
> > Such AER interrupt can still prevent the system from suspending, so let
> > root port mask and unmask replay timer timeout during suspend and
> > resume, respectively.
>
> 015c9cbcf0ad looks like it masks PCI_ERR_COR_REP_TIMER in the gl975x
> Endpoint, while this patch masks it in the upstream bridge (which
> might be either a Root Port or a Switch Downstream Port, so the
> subject and this sentence are not quite right).

OK, will change it to upstream bridge in next revision.

>
> 015c9cbcf0ad says it is related to a hardware defect, and maybe this
> patch is also (mention it if so). Both patches can prevent ERR_COR
> messages and the eventual AER interrupt, depending on whether the
> Downstream Port or the Endpoint detects the Replay Timer Timeout.
> Maybe this should have a Fixes: tag for 015c9cbcf0ad to try to keep
> these together?

Sure. This patch is intend to cover more ground based on 015c9cbcf0ad.


>
> If 015c9cbcf0ad is actually updating PCI_ERR_COR_MASK, it would be
> nice to clean that up, too. And maybe PCI_ERR_COR_REP_TIMER should be
> masked/restored at the same place for both the Downstream Port and the
> Endpoint?

Since PCI_ERR_COR_REP_TIMER is already masked before 015c9cbcf0ad, so
I didn't think that's necessary.
Do you think it should still be masked just to be safe?

>
> > +#ifdef CONFIG_PCIEAER
> > +static void mask_replay_timer_timeout(struct pci_dev *pdev)
> > +{
> > + struct pci_dev *parent = pci_upstream_bridge(pdev);
> > + u32 val;
> > +
> > + if (!parent || !parent->aer_cap)
> > + return;
> > +
> > + pci_read_config_dword(parent, parent->aer_cap + PCI_ERR_COR_MASK, &val);
> > + val |= PCI_ERR_COR_REP_TIMER;
> > + pci_write_config_dword(parent, parent->aer_cap + PCI_ERR_COR_MASK, val);
> > +}
> > +
> > +static void unmask_replay_timer_timeout(struct pci_dev *pdev)
> > +{
> > + struct pci_dev *parent = pci_upstream_bridge(pdev);
> > + u32 val;
> > +
> > + if (!parent || !parent->aer_cap)
> > + return;
> > +
> > + pci_read_config_dword(pdev, parent->aer_cap + PCI_ERR_COR_MASK, &val);
> > + val &= ~PCI_ERR_COR_REP_TIMER;
>
> I think I would save the previous PCI_ERR_COR_REP_TIMER value and
> restore it here, so it is preserved if there is ever a generic
> interface via sysfs or similar to manage correctable error masking.

Makes sense, will do in next revision.

Kai-Heng

>
> > + pci_write_config_dword(pdev, parent->aer_cap + PCI_ERR_COR_MASK, val);
> > +}