Re: Question: Clearing error bits in the root port post enumeration

From: Vidya Sagar
Date: Mon Nov 06 2023 - 22:15:11 EST




On 11/3/2023 11:50 PM, Bjorn Helgaas wrote:
External email: Use caution opening links or attachments


On Tue, Oct 31, 2023 at 12:26:31PM +0000, Vidya Sagar wrote:
Hi folks,

I would like to know your comments on the following scenario where
we are observing the root port logging errors because of the
enumeration flow being followed.

DUT information:
- Has a root port and an endpoint connected to it
- Uses ECAM mechanism to access the configuration space
- Booted through ACPI flow
- Has a Firmware-First approach for handling the errors
- System is configured to treat Unsupported Requests as
AdvisoryNon-Fatal errors

As we all know, when a configuration read request comes in for a
device number that is not implemented, a UR would be returned as per
the PCIe spec.

As part of the enumeration flow on DUT, when the kernel reads offset
0x0 of B:D:F=0:0:0, the root port responds with its valid Vendor-ID
and Device-ID values. But, when B:D:F=0:1:0 is probed, since there
is no device present there, the root port responds with an
Unsupported Request and simultaneously logs the same in the Device
Status register (i.e. bit-3). Because of it, there is a UR logged
in the Device Status register of the RP by the time enumeration is
complete.

In the case of AER capability natively owned by the kernel, the AER
driver's init call would clear all such pending bits.

Since we are going with the Firmware-First approach, and the system
is configured to treat Unsupported Requests as AdvisoryNon-Fatal
errors, only a correctable error interrupt can be raised to the
Firmware which takes care of clearing the corresponding status
registers. The firmware can't know about the UnsupReq bit being set
as the interrupt it received is for a correctable error hence it
clears only bits related to correctable error.

All these events leave a freshly booted system with the following
bits set.

Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR- (MAbort)
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend- (UnsupReq)
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol- (UnsupReq)

Since the reason for UR is well understood at this point, I would
like to weigh in on the idea of clearing the aforementioned bits in
the root port once the enumeration is done particularly to cater to
the configurations where Firmware-First approach is in place.
Please let me know your comments on this approach.

I think Secondary status (PCI_SEC_STATUS) is always owned by the OS
and is not affected by _OSC negotiation, right? Linux does basically
nothing with that today, but I think it *could* clear the "Received
Master Abort" bit.
Yes. PCI_SEC_STATUS is always owned by the OS and _OSC negotiation doesn't really affect that.


I'm not very familiar with Advisory Non-Fatal errors. I'm curious
about the UESta situation: why can't firmware know about UnsupReq
being set? I assume PCI_ERR_COR_ADV_NFAT is the Correctable Error
Status bit the firmware *does* see and clear.
Yes, PCI_ERR_COR_ADV_NFAT is indeed cleared by the firmware.

But isn't the whole point of Advisory Non-Fatal errors that an error
that is logged as an Uncorrectable Error and that normally would be
signaled with ERR_NONFATAL is signaled with ERR_COR instead? So
doesn't PCI_ERR_COR_ADV_NFAT being set imply that some
PCI_ERR_UNCOR_STATUS must be set as well? If so, I would think
firmware *could* figure that out and clear the PCI_ERR_UNCOR_STATUS
bit.
So, are you suggesting that let the firmware only clear the PCI_ERR_UNCOR_STATUS also? if so, then, I can even make the firmware clear the PCI_SEC_STATUS also thereby leaving the firmware responsible for clearing all the error bits. Does that sound ok?

Bjorn