RE: Questions: Should kernel panic when PCIe fatal error occurs?

From: David Laight
Date: Mon Sep 25 2023 - 04:07:52 EST


From: Shuai Xue
> Sent: 25 September 2023 02:44
>
> On 2023/9/21 21:20, David Laight wrote:
> > ...
> > I've got a target to generate AER errors by generating read cycles
> > that are inside the address range that the bridge forwards but
> > outside of any BAR because there are 2 different sized BARs.
> > (Pretty easy to setup.)
> > On the system I was using they didn't get propagated all the way
> > to the root bridge - but were visible in the lower bridge.
>
> So how did you observe it? If the error message does not propagate
> to the root bridge, I think no AER interrupt will be trigger.

I looked at the internal registers (IIRC in PCIe config space)
of the intermediate bridge.
I don't think the root bridge on that system supported AER.
(I was testing the generation of AER indications by our fpga.)

>
> > It would be nice for a driver to be able to detect/clear such
> > a flag if it gets an unexpected ~0u read value.
> > (I'm not sure an error callback helps.)
>
> IMHO, a general model is that error detected at endpoint should be
> routed to upstream port for example: RCiEP route error message to RCEC,
> so that the AER port service could handle the error, the device driver
> only have to implement error handler callback.

The problem is that that and callback is too late for something
triggered by a PCIe read.
The driver has to detect that the value is 'dubious' and wants
a method of detecting whether there was an associated AER (or other)
error.
If the AER indication is routed through some external entity (like
board management hardware) there will be additional latency that
means that the associated interrupt (even if an NMI) may not have
been processed when the driver code is trying to determine what
happened.
This can only be made worse by the interrupt coming in on a
different cpu.

> > OTOH a 'nebs compliant' server routed any kind of PCIe link error
> > through to some 'system management' logic that then raised an NMI.
> > I'm not sure who thought an NMI was a good idea - they are pretty
> > impossible to handle in the kernel and too late to be of use to
> > the code performing the access.
>
> I think it is the responsibility of the device to prevent the spread of
> errors while reporting that errors have been detected. For example, drop
> the current, (drain submit queue) and report error in completion record.

Eh?
I can generate two types of PCIe error:
- Read/write requests for addresses that aren't inside a BAR.
- Link failures that cause retraining and might need config
space reconfiguring.

> Both NMI and MSI are asynchronous interrupts.

Indeed, which makes neither of them suitable for any indication
relating to a bus cycle failure.

> > In any case we were getting one after 'echo 1 >xxx/remove' and
> > then taking the PCIe link down by reprogramming the fpga.
> > So the link going down was entirely expected, but there seemed
> > to be nothing we could do to stop the kernel crashing.
> >
> > I'm sure 'nebs compliant' ought to contain some requirements for
> > resilience to hardware failures!
>
> How the kernel crash after a link down? Did the system detect a surprise
> down error?

It was a couple of years ago..
IIRC the 'link down' cause the hub to generate an AER error.
The root hub forwarded it to some 'board management hardware/software'
that then raised and NMI.
The kernel crashed because of an unexpected NMI.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)