Re: [PATCH v3 4/6] cxl/pci: Add RCH downstream port error logging

From: Terry Bowman
Date: Mon Apr 24 2023 - 14:39:18 EST


Hi Dan,

I added comments inline below.

On 4/17/23 19:06, Dan Williams wrote:
> Terry Bowman wrote:
>> RCH downstream port error logging is missing in the current CXL driver. The
>> missing AER and RAS error logging is needed for communicating driver error
>> details to userspace. Update the driver to include PCIe AER and CXL RAS
>> error logging.
>>
>> Add RCH downstream port error handling into the existing RCiEP handler.
>> The downstream port error handler is added to the RCiEP error handler
>> because the downstream port is implemented in a RCRB, is not PCI
>> enumerable, and as a result is not directly accessible to the PCI AER
>> root port driver. The AER root port driver calls the RCiEP handler for
>> handling RCD errors and RCH downstream port protocol errors.
>>
>> Update mem.c to include RAS and AER setup. This includes AER and RAS
>> capability discovery and mapping for later use in the error handler.
>>
>> Disable RCH downstream port's root port cmd interrupts.[1]
>>
>> Update existing RCiEP correctable and uncorrectable handlers to also call
>> the RCH handler. The RCH handler will read the RCH AER registers, check for
>> error severity, and if an error exists will log using an existing kernel
>> AER trace routine. The RCH handler will also log downstream port RAS errors
>> if they exist.
>
> I think this patch wants a lead in refactoring to move the existing
> probe of the CXL RAS capability into the cxl_port driver so that the RCH
> path and the VH path can be unified for register mapping and error
> handling invocation. I do not see a compelling rationale to have 2
> separate ways to map the RAS capability. The timing of when
> cxl_setup_ras() is called looks problematic relative to when the first
> error handler callback might happen.
>

With respect to timing, I see this works for probing AER and RAS. Will it
work for caching the mapped AER and RAS addresses? I ask because the mapped
AER and RAS addresses are stored in cxlds and cxlds is created in cxl_pci
and isn't necessarily available during RCH dport discovery. RCH dport is
discovered within cxl_acpi context (beginning from cxl_acpi_probe()). Also,
port.c code shows cxlds is not typically used.

If you like I can change RCH RAS mapping to use cxl_map_component_regs()? This
was in cxl_rch_map_ras() to handle the RCH odd case for AER and RAS mapping.
The RAS can be moved out but RCH AER would still need to be mapped presumably still
in cxl_rch_map_ras().

> For example what happens when an error fires after cxl_pci has
> registered its error handlers, but before the component registers have
> been mapped out of the RCRB?
>

The RCiEP ISR would execute but the RCH AER and RAS would not be logged because
neither are mapped and are instead NULL. The AER and RAS register status would stay
resident and be logged in the next ISR entry.

> This implies the need for a callback for cxl_pci to notify the cxl_port
> driver of CXL errors to handle relative to a PCI AER event.

Along similar lines, could the RCH AER and RAS status be checked immediately after
mapping and logged if status is present?

Regards,
Terry