Re: [PATCH RESEND] x86/entry: Don't write to CR3 when restoring to kernel CR3

From: Andy Lutomirski
Date: Mon Sep 18 2023 - 23:28:30 EST


On Thu, Aug 17, 2023, at 5:15 AM, Brendan Jackman wrote:
> From: Lai Jiangshan <laijs@xxxxxxxxxxxxxxxxx>
>
> Skip resuming KERNEL pages since it is already KERNEL CR3
>
> Signed-off-by: Lai Jiangshan <laijs@xxxxxxxxxxxxxxxxx>
> Signed-off-by: Brendan Jackman <jackmanb@xxxxxxxxxx>
> ---
>
> While staring at paranoid_exit I was confused about why we had this CR3
> write, avoiding it seems like a free optimisation. The original commit
> 21e94459110252 ("x86/mm: Optimize RESTORE_CR3") says "Most NMI/paranoid
> exceptions will not in fact change pagetables" but I didn't't understand
> what the "most" was referring to. I then discovered this patch on the
> mailing list, Andy said[1] that it looks correct so maybe now is the
> time to merge it?

I did?

Looking at the link, I think I was saying that the opposite patch (*always* flush) looked okay.

>
> Note there's another patch in [1] as well, the benefit of that one is
> not obvious to me though.
>
> We've tested an equivalent patch in our internal kernel.
>
> [1] https://lore.kernel.org/lkml/20200526043507.51977-3-laijs@xxxxxxxxxxxxxxxxx/
> -- >8 --
> arch/x86/entry/calling.h | 13 ++++---------
> 1 file changed, 4 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
> index f6907627172b..b2458685d56e 100644
> --- a/arch/x86/entry/calling.h
> +++ b/arch/x86/entry/calling.h
> @@ -236,14 +236,13 @@ For 32-bit we have the following conventions -
> kernel is built with
> .macro RESTORE_CR3 scratch_reg:req save_reg:req
> ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
>
> - ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
> -
> /*
> - * KERNEL pages can always resume with NOFLUSH as we do
> - * explicit flushes.
> + * Skip resuming KERNEL pages since it is already KERNEL CR3.
> */
> bt $PTI_USER_PGTABLE_BIT, \save_reg
> - jnc .Lnoflush_\@
> + jnc .Lend_\@

I don't get it. How do you know that the actual loaded CR3 is correct?

I'm willing to believe that there is some constraint in the way the kernel works such that every paranoid entry will, as part of its execution, switch CR3 to kernel *and leave it like that* *and that this will be the _same_ kernel CR3 that was saved*. But I'm not really convinced that's true. (Can we schedule in a paranoid entry? Probably not. What about the weird NMI paths? What if we do something that switches to init mm? Hmm -- doing that in a paranoid context is probably not a brilliant idea.)

Maybe it is true, and maybe a convincing argument could be made. But that seems like a lot of thinking and fragility for an optimization that seems pretty minor.

--Andy