Re: [PATCH 2/2] RAS: Introduce the FRU Memory Poison Manager

From: Yazen Ghannam
Date: Wed Feb 14 2024 - 09:29:06 EST


On 2/14/2024 4:28 AM, Borislav Petkov wrote:
On Tue, Feb 13, 2024 at 09:35:16PM -0600, Yazen Ghannam wrote:
+config RAS_FMPM
+ tristate "FRU Memory Poison Manager"
+ default m
+ depends on X86_MCE

I know this is generic-ish but it needs to be enabled only on AMD for
now. Whoever wants it somewhere else, then whoever needs to test it
there first and then enable it there.


Ack.
+ imply AMD_ATL
+ help
+ Support saving and restoring memory error information across reboot
+ cycles using ACPI ERST as persistent storage. Error information is

s/cycles//


Ack.
+ saved with the UEFI CPER "FRU Memory Poison" section format.
+
+ Memory may be retired during boot time and run time depending on

s/may/is/

Please check all your text - too many "may"s for something which is not
a vendor doc. :)


Ack.
+ platform-specific policies.
+
endif
diff --git a/drivers/ras/Makefile b/drivers/ras/Makefile
index 3fac80f58005..11f95d59d397 100644
--- a/drivers/ras/Makefile
+++ b/drivers/ras/Makefile
@@ -3,4 +3,5 @@ obj-$(CONFIG_RAS) += ras.o
obj-$(CONFIG_DEBUG_FS) += debugfs.o
obj-$(CONFIG_RAS_CEC) += cec.o
+obj-$(CONFIG_RAS_FMPM) += amd/fmpm.o
obj-y += amd/atl/
diff --git a/drivers/ras/amd/fmpm.c b/drivers/ras/amd/fmpm.c
new file mode 100644
index 000000000000..077d9f35cc7d
--- /dev/null
+++ b/drivers/ras/amd/fmpm.c
@@ -0,0 +1,776 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * FRU (Field-Replaceable Unit) Memory Poison Manager
+ *
+ * Copyright (c) 2024, Advanced Micro Devices, Inc.
+ * All Rights Reserved.
+ *
+ * Authors:
+ * Naveen Krishna Chatradhi <naveenkrishna.chatradhi@xxxxxxx>
+ * Muralidhara M K <muralidhara.mk@xxxxxxx>
+ * Yazen Ghannam <Yazen.Ghannam@xxxxxxx>
+ *
+ * Implementation notes, assumptions, and limitations:
+ *
+ * - FRU Memory Poison Section and Memory Poison Descriptor definitions are not yet
+ * included in the UEFI specification. So they are defined here. Afterwards, they
+ * may be moved to linux/cper.h, if appropriate.
+ *
+ * - Platforms based on AMD MI300 systems will be the first to use these structures.
+ * There are a number of assumptions made here that will need to be generalized
+ * to support other platforms.
+ *
+ * AMD MI300-based platform(s) assumptions:
+ * - Memory errors are reported through x86 MCA.
+ * - The entire DRAM row containing a memory error should be retired.
+ * - There will be (1) FRU Memory Poison Section per CPER.
+ * - The FRU will be the CPU Package (Processor Socket).
+ * - The default number of Memory Poison Descriptor entries should be (8).
+ * - The Platform will use ACPI ERST for persistent storage.
+ * - All FRU records should be saved to persistent storage. Module init will
+ * fail if any FRU record is not successfully written.

Please drop all that capitalized spelling.


For which parts? The acronyms, structure names, or general things like package/socket?
Or all the above?
+ * - Source code will be under 'drivers/ras/amd/' unless and until there is interest
+ * to use this module for other vendors.

This is not needed.


Ack.
+ * - Boot time memory retirement may occur later than ideal due to dependencies
+ * on other libraries and drivers. This leaves a gap where bad memory may be
+ * accessed during early boot stages.
+ *
+ * - Enough memory should be pre-allocated for each FRU record to be able to hold
+ * the expected number of descriptor entries. This, mostly empty, record is
+ * written to storage during init time. Subsequent writes to the same record
+ * should allow the Platform to update the stored record in-place. Otherwise,
+ * if the record is extended, then the Platform may need to perform costly memory
+ * management operations on the storage. For example, the Platform may spend time
+ * in Firmware copying and invalidating memory on a relatively slow SPI ROM.

That's a good thing to have here.


Okay.
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/cper.h>
+#include <linux/ras.h>
+
+#include <acpi/apei.h>
+
+#include <asm/cpu_device_id.h>
+#include <asm/mce.h>
+
+#pragma pack(1)

Is that some ugly thing to avoid adding __packed annotation to the
structure definitions below?


Yes, but __packed could be used too.

"GCC supports several types of pragmas, primarily in order to compile
code originally written for other compilers. Note that in general we do
not recommend the use of pragmas; See Declaring Attributes of Functions,
for further explanation. "

Oh, that 1 is something else:

-fpack-struct[=n]

Without a value specified, pack all structure members together
without holes. When a value is specified (which must be a small
power of two), pack structure members according to this value,
representing the maximum alignment (that is, objects with default
alignment requirements larger than this are output potentially
unaligned at the next fitting location.

So do I understand it correctly that struct members should be aligned to
2^1 bytes?


Yes, no padding and no reordering too, I think.
Grepping the tree, this looks like something BIOS does...


The BIOS does do this on its side. We need to make sure to do it on the kernel
side.

See <linux/cper.h> and <acpi/actbl1.h> for examples.
Thanks,
Yazen