Re: [RFC PATCH v2 0/9] Use ERST for persistent storage of MCE and APEI errors

From: Shuai Xue
Date: Sat Oct 07 2023 - 03:16:01 EST




On 2023/9/28 22:43, Borislav Petkov wrote:
> On Mon, Sep 25, 2023 at 03:44:17PM +0800, Shuai Xue wrote:
>> After /dev/mcelog character device deprecated by commit 5de97c9f6d85
>> ("x86/mce: Factor out and deprecate the /dev/mcelog driver"), the
>> serialized MCE error record, of previous boot in persistent storage is not
>> collected via APEI ERST.
>
> You lost me here. /dev/mcelog is deprecated but you can still use it and
> apei_write_mce() still happens.

Yes, you are right. apei_write_mce() still happens so that MCE records are
written to persistent storage and the MCE records can be retrieved by
apei_read_mce(). Previously, the task was performed by the mcelog package.
However, it has been deprecated, some distributions like Arch kernels are
not even compiled with the necessary configuration option
CONFIG_X86_MCELOG_LEGACY.[1]

So, IMHO, it's better to add a way to retrieve MCE records through switching
to the new generation rasdaemon solution.

>
> Looking at your patches, you're adding this to ghes so how about you sit
> down first and explain your exact use case and what exactly you wanna
> do?
>
> Thx.
>

Sorry for the poor cover letter. I hope the following response can clarify
the matter.

Q1: What is the exact problem?

Traditionally, fatal hardware errors will cause Linux print error log to
console, e.g. print_mce() or __ghes_print_estatus(), then reboot. With
Linux, the primary method for obtaining debugging information of a serious
error or fault is via the kdump mechanism. Kdump captures a wealth of
kernel and machine state and writes it to a file for post-mortem debugging.

In certain scenarios, ie. hosts/guests with root filesystems on NFS/iSCSI
where networking software and/or hardware fails, and thus kdump fails to
collect the hardware error context, leaving us unaware of what actually
occurred. In the public cloud scenario, multiple virtual machines run on a
single physical server, and if that server experiences a failure, it can
potentially impact multiple tenants. It is crucial for us to thoroughly
analyze the root causes of each instance failure in order to:

- Provide customers with a detailed explanation of the outage to reassure them.
- Collect the characteristics of the failures, such as ECC syndrome, to enable fault prediction.
- Explore potential solutions to prevent widespread outages.

In short, it is necessary to serialize hardware error information available
for post-mortem debugging.

Q2: What exactly I wanna do:

The MCE handler, do_machine_check(), saves the MCE record to persistent
storage and it is retrieved by mcelog. Mcelog has been deprecated when
kernel 4.12 released in 2017, and the help of the configuration option
CONFIG_X86_MCELOG_LEGACY suggest to consider switching to the new
generation rasdaemon solution. The GHES handler does not support APEI error
record now.

To serialize hardware error information available for post-mortem
debugging:
- add support to save APEI error record into flash via ERST before go panic,
- add support to retrieve MCE or APEI error record from the flash and emit
the related tracepoint after system boot successful again so that rasdaemon
can collect them


Best Regards,
Shuai


[1] https://wiki.archlinux.org/title/Machine-check_exception