Re: [RFC PATCH 0/6] Add support for root port RAS error handling

From: Dan Williams
Date: Thu Mar 14 2024 - 21:45:54 EST


Li Ming wrote:
> Protocol errors signaled to a CXL root port may be captured by a Root
> Complex Event Collector(RCEC). If those errors are not cleared and
> reported the system owner loses forensic information for system failure
> analysis.
>
> Per CXL r3.1 section 9.18.1.5, the recommendation for this case from CXL
> specification is the 'Else' statement in 'IMPLEMENTATION NODE' under
> 'Table 9-24 RDPAS Structure':
>
> "Probe all CXL Downstream Ports and determine whether they have logged an
> error in the CXL.io or CXL.cachemem status registers."
>
> The CXL subsystem already supports RCH RAS Error handling that has a
> dependency on the RCEC. Reuse and extend that RCH topoogy support to
> handle reported errors in the VH topology case. The implementation is
> composed of:
> * Provide a new interface from RCEC side to support walk all devices
> under RCEC and RCEC associated bus range. PCIe AER core uses this
> interface to walk all CXL endpoints and all CXL root ports under the
> bus ranges.
> * Update the PCIe AER core to enable Uncorrectable Internal Errors and
> Correctable Internal Errors report for root ports.

Thanks for the above background.

> * Invoke the cxl_pci error handler for RCEC reported errors.

So what do you expect happens when a switch is involved? In the RCH case
it knows that the only thing that can fire RCEC is a root complex
integrated endpoint implementation driven by cxl_pci. In the VH case it
could be a switch.

> * Handle root-port errors in the cxl_pci handler when the device is
> direct attached.

I do expect direct-attach to be a predominant use case, but I want to
make sure that the implementation at least does not make the switch port
error handling case more difficult to implement.