[RFC PATCH 1/2] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events

From: Shuai Xue
Date: Tue Dec 06 2022 - 10:34:18 EST


There are two major types of uncorrected error (UC) :

- Action Required: The error is detected and the processor already consumes
the memory. OS requires to take action (for example, offline failure
page/kill failure thread) to recover this uncorrectable error.

- Action Optional: The error is detected out of processor execution
context. Some data in the memory are corrupted. But the data have not
been consumed. OS is optional to take action to recover this
uncorrectable error.

For X86 platforms, we can easily distinguish between these two types based
on the MCA Bank. While for arm64 platform, the memory failure flags for all
UCs which severity are GHES_SEV_RECOVERABLE are set as 0, a.k.a, Action
Optional now. Set memory failure flags as MF_ACTION_REQUIRED on synchronous
events.

Signed-off-by: Shuai Xue <xueshuai@xxxxxxxxxxxxxxxxx>
---
drivers/acpi/apei/ghes.c | 2 +-
include/linux/cper.h | 22 ++++++++++++++++++++++
2 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 9952f3a792ba..a420759fce2d 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -475,7 +475,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
(gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
flags = MF_SOFT_OFFLINE;
if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
- flags = 0;
+ flags = (gdata->flags & CPER_SEC_SYNC) ? MF_ACTION_REQUIRED : 0;

if (flags != -1)
return ghes_do_memory_failure(mem_err->physical_addr, flags);
diff --git a/include/linux/cper.h b/include/linux/cper.h
index eacb7dd7b3af..a3571fa8a73d 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -144,6 +144,28 @@ enum {
* corrective action before the data is consumed
*/
#define CPER_SEC_LATENT_ERROR 0x0020
+/*
+ * If set, the section is to be associated with an error that has been
+ * propagated due to hardware poisoning. This implies the error is a symptom of
+ * another error. It is not always possible to ascertain whether this is the
+ * case for an error, therefore if the flag is not set, it is unknown whether
+ * the error was propagated. this helps determining FRU when dealing with HW
+ * failures
+ */
+#define CPER_SEC_PROPAGATED 0x0040
+/*
+ * If set this flag indicates the firmware has detected an overflow of
+ * buffers/queues that are used to accumulate, collect, or report errors (e.g.
+ * the error status control block exposed to the OS). When this occurs, some
+ * error records may be lost.
+ */
+#define CPER_SEC_OVERFLOW 0x0080
+/*
+ * If set, it indicates that this event record is synchronous(e.g. cpu core
+ * consumes poison data, then cause instruction/data abort); if not set,
+ * this event record is asynchronous.
+ */
+#define CPER_SEC_SYNC 0x00100

/*
* Section type definitions, used in section_type field in struct
--
2.20.1.12.g72788fdb