Re: [PATCH] x86/mce: Check that memory address is usable for recovery

From: Tony Luck
Date: Tue Apr 18 2023 - 13:27:57 EST


On Tue, Apr 18, 2023 at 12:41:17PM -0400, Yazen Ghannam wrote:
> On 3/21/23 20:51, Tony Luck wrote:
> > uc_decode_notifier() includes a check that "struct mce"
> > contains a valid address for recovery. But the machine
> > check recovery code does not include a similar check.
> >
> > Use mce_usable_address() to check that there is a valid
> > address.
> >
> > Signed-off-by: Tony Luck <tony.luck@xxxxxxxxx>
> > ---
> > arch/x86/kernel/cpu/mce/core.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> > index 2eec60f50057..fa28b3f7d945 100644
> > --- a/arch/x86/kernel/cpu/mce/core.c
> > +++ b/arch/x86/kernel/cpu/mce/core.c
> > @@ -1533,7 +1533,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
> > /* If this triggers there is no way to recover. Die hard. */
> > BUG_ON(!on_thread_stack() || !user_mode(regs));
> >
> > - if (kill_current_task)
> > + if (kill_current_task || !mce_usable_address(&m))
> > queue_task_work(&m, msg, kill_me_now);
> > else
> > queue_task_work(&m, msg, kill_me_maybe);
>
> I think it should be like this:
>
> if (mce_usable_address(&m))
> queue_task_work(&m, msg, kill_me_maybe);
> else
> queue_task_work(&m, msg, kill_me_now);
>
> A usable address should always go through memory_failure() so that the page is
> marked as poison. If !RIPV, then memory_failure() will get the MF_MUST_KILL
> flag and try to kill all processes after the page is poisoned.
>
> I had a similar patch a while back:
> https://lore.kernel.org/linux-edac/20210504174712.27675-3-Yazen.Ghannam@xxxxxxx/
>
> We could also get rid of kill_me_now() like you had suggested.

Can we also get rid of "kill_current_task"? It is only set when there is
no valid return address:

if (!(m.mcgstatus & MCG_STATUS_RIPV))
kill_current_task = 1;

kill_me_maybe() does an equivalent check:

if (!p->mce_ripv)
flags |= MF_MUST_KILL;

Which only leaves this check to resolve in some way:

if (worst != MCE_AR_SEVERITY && !kill_current_task)
goto out;

-Tony