Re: [PATCH] acpi: apei: call into AER handling regardless of severity

From: Baicar, Tyler
Date: Wed Aug 30 2017 - 11:42:18 EST


On 8/30/2017 9:31 AM, Sinan Kaya wrote:
On 8/30/2017 11:16 AM, Borislav Petkov wrote:
On Wed, Aug 30, 2017 at 10:05:44AM -0400, Sinan Kaya wrote:
Link reset is not the only recovery mechanism. In the case of nonfatal
errors, it is assumed that the endpoint CSR is still reachable.
Error is propagated the PCIe endpoint driver. Endpoint driver does a
re-initialization, we are back in business.
I'm assuming that's broadcast_error_message()'s job.

That's right. Each driver provides an err_handler hook. broadcast function
calls these.

static struct pci_driver e1000_driver = {
..
.err_handler = &e1000_err_handler
};

struct pci_error_handlers {
...
pci_ers_result_t (*error_detected)(struct pci_dev *dev,
enum pci_channel_state error);
}


That's not true. The GHES code is changing the severity here before posting
to the AER driver in ghes_do_proc().

if (gdata->flags & CPER_SEC_RESET)
aer_severity = AER_FATAL;
You're missing the point that we would walk into that if branch *only* for

if (sev == GHES_SEV_RECOVERABLE &&
sec_sev == GHES_SEV_RECOVERABLE

severities. So if you have an AER_FATAL error but ghes severities are
not GHES_SEV_RECOVERABLE, nothing happens.
I see. We should probably try to do something only if GHES_SEV_CORRECTED or
GHES_SEV_RECOVERABLE.

If somebody wants to crash the system with GHES_SEV_PANIC, there is no point
in doing additional work.
See below.
No, AER ISR is not set up if firmware first is enabled.
So then this is a major suckage. We do AER recovery on FF systems only
for GHES_SEV_RECOVERABLE severity.

The behavior should match non firmware-first case ideally.

1. Print all correctable errors.
2. Go to do_recovery for all uncorrectable errors including fatal and
non-fatal.

This is also what AER driver does in the absence of firmware first via
handle_error_source().
Yes, that makes sense.

Which would mean that we'd call aer_recover_queue() regardless of GHES
severity but we'd do recovery only if GHES_SEV_RECOVERABLE is set
or CPER_SEC_RESET. I.e., we can communicate all that by setting the
correct AER severity before calling aer_recover_queue(). And then call
do_recovery() based on AER severity.

Hmmm?

Sounds good. Do you still want to do PCIe recovery in the case of
GHES_SEV_PANIC or if some FW returns GHES_SEV_NO?

We do not need to worry about the GHES_SEV_PANIC case. Those get sent to __ghes_panic() in ghes_proc() without even making it to ghes_do_proc(). Those errors are just printed and then the kernel panics.

I think with my two patches we will have the desired functionality:

GHES_SEV_CORRECTABLE -> AER_CORRECTABLE -> Print AER info, but do not call do_recovery

GHES_SEV_RECOVERABLE -> AER_NONFATAL -> Print AER info and do_recovery

GHES_RECOVERABLE and CPER_SEC_RESET -> AER_FATAL -> Print AER info and do_recover

Thanks,
Tyler

--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.