[PATCH 4/4] RAS/fmp: Add Documentation on Persistence of FRU memory poisons

From: Muralidhara M K
Date: Wed Nov 29 2023 - 02:51:30 EST


From: Muralidhara M K <muralidhara.mk@xxxxxxx>

On Data center servers with On chip HBM3 memory, FRU identification
needs a mechanism to identify the bad page information by persisting
them in non volatile storage across reboots and read them during boot
helps to check the number of pages poisoned.

Co-developed-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@xxxxxxx>
Signed-off-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@xxxxxxx>
Co-developed-by: Sathya Priya Kumar <sathyapriya.k@xxxxxxx>
Signed-off-by: Sathya Priya Kumar <sathyapriya.k@xxxxxxx>
Signed-off-by: Muralidhara M K <muralidhara.mk@xxxxxxx>
---
Documentation/RAS/ras.rst | 122 ++++++++++++++++++++++++++++++++++++++
1 file changed, 122 insertions(+)

diff --git a/Documentation/RAS/ras.rst b/Documentation/RAS/ras.rst
index 2556b397cd27..2f86bf02655a 100644
--- a/Documentation/RAS/ras.rst
+++ b/Documentation/RAS/ras.rst
@@ -24,3 +24,125 @@ Also, the user can pass particular family and model to decode the error
string::

$ rasdaemon -p --status <STATUS> --ipid <IPID> --smca --family <CPU Family> --model <CPU Model> --bank <BANK_NUM>
+
+=============================================================
+Persist FRU(Field Replaceable Unit) Memory Poison
+=============================================================
+
+Large scale Data center servers such as MI300A has on-chip stacked Memory
+High Bandwidth Memory v3 (HBM).
+ - Example: MI300A has 8 stacks of HBM/die, a total of 128Gb per socket.
+Host Operating system is responsible for memory management, allocating HBM3 pages.
+
+Many memory errors tend to be consistent or intermittent and may reoccur. Upon
+reaching a certain threshold of these errors, the specific memory area is deemed
+faulty and should be replaced. In the case of on-die High Bandwidth Memory (HBM),
+any returns due to these issues will likely be directed to the socket vendor.
+
+Define a criteria to identify the Field Replicable Unit(FRU) by evaluating the
+count of "poisoned" pages within the socket and log these poisoned pages persistently
+in a non-volatile storage. This process assists in retaining information about
+defective memory page within the socket for potential replacement.
+
+Linux supports retiring pages by marking the page HW_POISON. However, it doesn't
+persist these marked pages across reboots.
+To address this, a potential solution is to persist bad page details in non-volatile
+storage(ERST). This prevents the reuse of compromised memory region, ensuring they
+are not utilized again.
+
+ERST to persist Bad page information
+====================================
+
+ERST (Error Record Serialization Table) defined by ACPI/APEI provides a mechanism for
+storing and retrieve hardware error Information to and from a persistent memory.
+
+Platform FW(BIOS) with ERST support, reserves ERST tables usually 64KB in non-volatile
+storage. Configure Linux to select ERST as backend for Pstore (read/write from NV storage).
+
+Upon Specific MCE errors Linux would call pstore with CPER format per FRU, platform FW
+would store it in NV storage. and on next boot, Linux would query bad page information
+from ERST and retire the pages again.
+
+FRU memory poison Common Platform Error Record (CPER) definition
+================================================================
+
+One CPER per FRU (Protected Processor Inventory Number (PPIN)).
+1 CPER record per MI300A socket (4 X MI300As system) with the 4 CPERs in a system and
+Each FRU containing poison list offset of the given PPIN.
+
+The FRU poison CPER record size is (BIOS ERST memory) / (Number of FRUs).
+Each erst_write() or erst_read() will write/read this entire structure as one record.
+
+Number of poison entries that can be reached is based on the calculation below
+"(size - sizeof(struct cper_poison_record)) / sizeof(struct cper_fru_poison_data)"
+
+FRU Poison CPER definition for storing error record as below
+
+ struct cper_poison_record {
+ struct cper_record_header hdr;
+ struct cper_section_descriptor sec_hdr;
+ struct cper_sec_fru_mem_poisons fmpl;
+ } __packed;
+
+use 'struct cper_record_header' and 'struct cper_section_descriptor' as defined
+in 'include/linux/cper.h'
+
+ * Section body follows the description of a “non-standard section body” and is defined below.
+
+ * per FRU poison section data
+ struct cper_sec_fru_mem_poisons {
+ char signature[4];
+ u64 checksum;
+ u32 model_id_type;
+ u32 model_id;
+ u32 fru_id_type;
+ u64 fru_id;
+ u32 poison_count;
+ u64 p_list_off; //offset for contiguous memory to poison data structure
+ };
+
+ * FRU Poison data structure
+ struct cper_fru_poison_data {
+ u32 hw_id_type;
+ u32 addr_type;
+ u64 hw_id;
+ u64 addr;
+ };
+
+
+Implementation Notes on FRU Identification:
+==========================================
+
+ * HBM suppose to have total of 8 DRAM rows.
+ * When MCE error occurs, offline all the pages in that range in a particular row(8 columns in a row).
+ If all the 8 rows become bad, then entire socket has to be replaced.
+ * Perist the error information mentioned in "struct cper_fru_poison_data" to ERST storage.
+
+ * Don’t delete the FMP records once they are saved in persistence store. Keep them in ERST
+ forever until all the poison_data entries become full.
+ * Once the entries full, then do not save the error information in ERST.
+
+At OS boot:
+==========
+ * One CPER per FRU (Protected Processor Inventory Number (PPIN)) has been created.
+ * Size of each CPER will not exceed (1/4)th the available space.
+ * The node controller should make sure there is a CPER for each PPIN in the node. If this is a
+ new processor never seen before, then create a CPER with N=0.
+ * Read the CPERs through the Error Record Serialization Table (ERST).
+ * If OS matches a PPIN to a socket and identifes mce address, it will re-create the SPA for all
+ pages on the HBM row of the poisoned DA, retire all pages mapped to that row.
+ * If a CPER is found for a PPIN that isn’t in the node, OS will print a warning.
+ * If OS tries to persist more errors than fit in the CPER, will refuse to update the CPER
+ and print a message.
+ * OS creates sysfs file for each FRU_ID with a list of DRAM address, MCA_IPID which are retired.
+ $ ls /sys/devices/system/edac/mc/mc0/fmpl
+ * Example: mc0 for socket 0 and mc3 for socket 3.
+ * To read the CPER Record information at any time when the system is up follow below
+ $ cat /sys/devices/system/edac/mc/mc<socket_index>/fmpl
+ $ dmesg
+
+At Mission mode:
+===============
+ * Notifier is registered to handle the FRU memory poison errors.
+ * When the error is injected on particular PPIN, and If OS matches a system PPIN to a socket
+ with MCE PPIN, append the poison data until it reaches maximum number of poison entries.
--
2.25.1