[PATCH v4] x86/mce: Set PG_hwpoison page flag to avoid the capture kernel panic

From: Zhiquan Li
Date: Mon Oct 23 2023 - 00:03:49 EST


Memory errors don't happen very often, especially the severity is fatal.
However, in large-scale scenarios, such as data centers, it might still
happen. When there is a fatal machine check Linux calls mce_panic()
without checking to see if bad data at some memory address was reported
in the machine check banks.

If kexec is enabled, check for memory errors and mark the page as
poisoned so that the kexec'ed kernel can avoid accessing the page.

Co-developed-by: Youquan Song <youquan.song@xxxxxxxxx>
Signed-off-by: Youquan Song <youquan.song@xxxxxxxxx>
Signed-off-by: Zhiquan Li <zhiquan1.li@xxxxxxxxx>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@xxxxxxx>
Cc: Borislav Petkov <bp@xxxxxxxxx>

---

V3: https://lore.kernel.org/all/20231014051754.3759099-1-zhiquan1.li@xxxxxxxxx/

Changes since V3:
- Rebased to v6.6-rc7.
- Added the check if kexec is enabled highlighted by Boris.
- Re-wrote the commit message suggested by Tony.

V2: https://lore.kernel.org/all/20230914030539.1622477-1-zhiquan1.li@xxxxxxxxx/

Changes since V2:
- Rebased to v6.6-rc5.
- Explained full scenario in commit message per Boris's suggestion.
- Included Ingo's fixes.
Link: https://lore.kernel.org/all/ZRsUpM%2FXtPAE50Rm@xxxxxxxxx/

V1: https://lore.kernel.org/all/20230127015030.30074-1-tony.luck@xxxxxxxxx/

Changes since V1:
- Revised the commit message as per Naoya's suggestion.
- Replaced "TODO" comment in code with comments based on mailing list
discussion on the lack of value in covering other page types.
- Added the tag from Naoya.
Link: https://lore.kernel.org/all/20230327083739.GA956278@xxxxxxxxxxxxxxxxxxxxxxxxxxx/
---
arch/x86/kernel/cpu/mce/core.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 6f35f724cc14..930b1120009b 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -44,6 +44,7 @@
#include <linux/sync_core.h>
#include <linux/task_work.h>
#include <linux/hardirq.h>
+#include <linux/kexec.h>

#include <asm/intel-family.h>
#include <asm/processor.h>
@@ -233,6 +234,7 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
struct llist_node *pending;
struct mce_evt_llist *l;
int apei_err = 0;
+ struct page *p;

/*
* Allow instrumentation around external facilities usage. Not that it
@@ -286,6 +288,19 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
if (!fake_panic) {
if (panic_timeout == 0)
panic_timeout = mca_cfg.panic_timeout;
+ if (kexec_crash_loaded()) {
+ /*
+ * Kdump can exclude the poisoned page to avoid touching the error
+ * page again, the prerequisite is that the PG_hwpoison page flag
+ * is set. However, for some MCE fatal error cases, there is no
+ * opportunity to queue a task for calling memory_failure(), and as
+ * a result, the capture kernel panics. So mark the page as
+ * poisoned before kernel panic() for MCE.
+ */
+ p = pfn_to_online_page(final->addr >> PAGE_SHIFT);
+ if (final && (final->status & MCI_STATUS_ADDRV) && p)
+ SetPageHWPoison(p);
+ }
panic(msg);
} else
pr_emerg(HW_ERR "Fake kernel panic: %s\n", msg);
--
2.25.1