RE: [PATCH] x86/mce: Add workaround for SKX/CLX/CPX spurious machine checks

From: Luck, Tony
Date: Wed Feb 16 2022 - 13:42:08 EST


> Well, we could try to decode the instructions around rIP when the #MC
> is raised and see what caused the MCE and perhaps pick apart which insn
> caused it, is it accessing behind the buffer boundaries, etc.

Is this a case of "perfect is the enemy of good enough"?

It is a rare scenario (only a pain point for Jue because Google has billions and billions
of cores running this code). You need:

1) An uncorrected error
2) That error must be in first cache line of a page
3) Kernel must execute page_copy from the page immediately before that page

When all three happen, kernel crashes because we don't
have a recover path from kernel page_copy

-Tony