Re: [PATCH 2/2] RAS: Introduce the FRU Memory Poison Manager

From: Yazen Ghannam
Date: Wed Feb 14 2024 - 09:22:03 EST


On 2/14/2024 4:06 AM, Borislav Petkov wrote:
On Tue, Feb 13, 2024 at 09:35:16PM -0600, Yazen Ghannam wrote:
Memory errors are an expected occurrence on systems with high memory
density. Generally, errors within a small number of unique physical
locations is acceptable, based on manufacturer and/or admin policy.
During run time, memory with errors may be retired so it is no longer
used by the system. This is done in the kernel memory manager, and the
effect will remain until the system is restarted.

If a memory location is consistently faulty, then the same run time
error handling may occur in the next reboot cycle. Running jobs may be
terminated due to previously known bad memory. This could be prevented
if information from the previous boot was not lost.

Some add-in cards with driver-managed memory have on-board persistent
storage. Their driver may save memory error information to the
persistent storage during run time. The information may then be restored
after reset, and known bad memory may be retired before use. A running
log of bad memory locations is kept across multiple resets.

Too many "may"s above, please tone them down.


Will try :)
A similar solution is desirable for CPUs. However, this solution should

GPUs you mean?


I mean CPUs. GPUs would fall under the "add-in" card scenario.
leverage industry-standard components, as much as possible, rather than
a bespoke platform driver.

Two components are needed: a record format and a persistent storage
interface.

A UEFI CPER "FRU Memory Poison Section" is being proposed, along with a
"Memory Poison Descriptor", to use for this purpose. These new structures
are minimal, saving space on limited non-volatile memory, and extensible.

CPER-aware persistent storage interfaces, like ACPI ERST and EFI Runtime
Variables, can be used. A new interface is not required.

I don't think stuff which is being proposed belongs here.


Do you mean this should be left out of the commit message?
Implement a new module to manage the record formats on persistent
storage. Use the requirements for an AMD MI300-based system to start.
Vendor- and platform-specific details can be abstracted later as needed.

This is a big diff so I'm splitting mails.


Okay.

Thanks,
Yazen