[RFC PATCH 0/9] [RFC PATCH 0/9] Use ERST for persistent storage of MCE and APEI errors

From: Shuai Xue
Date: Sat Sep 16 2023 - 09:04:37 EST


In certain scenarios (ie. hosts/guests with root filesystems on NFS/iSCSI
where networking software and/or hardware fails, and thus kdump fails), it
is necessary to serialize hardware error information available for
post-mortem debugging. Save the hardware error log into flash via ERST
before go panic, the hardware error log can be gotten from the flash after
system boot successful again, which is very useful in production.

On X86 platform, the kernel has supported to serialize and deserialize MCE
error record by commit 482908b49ebf ("ACPI, APEI, Use ERST for persistent
storage of MCE"). The process involves two steps:

- MCE Producer: When a hardware error is detected, MCE raised and its
handler writes MCE error record into flash via ERST before panic
- MCE Consumor: After system reboot, /sbin/mcelog run, it reads /dev/mcelog
to check flash for error record of previous boot via ERST

After /dev/mcelog character device deprecated by commit 5de97c9f6d85
("x86/mce: Factor out and deprecate the /dev/mcelog driver"), the
serialized MCE error record, of previous boot in persistent storage is not
collected via APEI ERST.

This patch set include two part:

- PATCH 1-3: rework apei_{read,write}_mce to use pstore data structure and emit
the mce_record tracepoint, enabling the collection of MCE records by the
rasdaemon tool.
- PATCH 4-9: use ERST for persistent storage of APEI errors, and emit
tracepoints for CPER sections, enabling the collection of MCE records by the
rasdaemon tool.

Shuai Xue (9):
pstore: move pstore creator id, section type and record struct to
common header
ACPI: APEI: Use common ERST struct to read/write serialized MCE record
ACPI: APEI: ERST: Emit the mce_record tracepoint
ACPI: tables: change section_type of generic error data as guid_t
ACPI: APEI: GHES: Use ERST to serialize APEI generic error before
panic
ACPI: APEI: GHES: export ghes_report_chain
ACPI: APEI: ESRT: kick ghes_report_chain notifier to report serialized
memory errors
ACPI: APEI: ESRT: print AER to report serialized PCIe errors
ACPI: APEI: ESRT: log ARM processor error

arch/x86/kernel/cpu/mce/apei.c | 82 +++++++++++++++-------------------
drivers/acpi/acpi_extlog.c | 2 +-
drivers/acpi/apei/erst.c | 51 ++++++++++++---------
drivers/acpi/apei/ghes.c | 48 +++++++++++++++++++-
drivers/firmware/efi/cper.c | 2 +-
fs/pstore/platform.c | 3 ++
include/acpi/actbl1.h | 5 ++-
include/acpi/ghes.h | 2 +-
include/linux/pstore.h | 29 ++++++++++++
9 files changed, 150 insertions(+), 74 deletions(-)

--
2.41.0