Re: [PATCH RESEND] x86/entry: Don't write to CR3 when restoring to kernel CR3

From: Brendan Jackman
Date: Wed Aug 23 2023 - 14:43:18 EST


Oops, just noticed I didn't put +Peter in the recipients.

On Thu, 17 Aug 2023 at 05:15, Brendan Jackman <jackmanb@xxxxxxxxxx> wrote:
>
> From: Lai Jiangshan <laijs@xxxxxxxxxxxxxxxxx>
>
> Skip resuming KERNEL pages since it is already KERNEL CR3
>
> Signed-off-by: Lai Jiangshan <laijs@xxxxxxxxxxxxxxxxx>
> Signed-off-by: Brendan Jackman <jackmanb@xxxxxxxxxx>
> ---
>
> While staring at paranoid_exit I was confused about why we had this CR3
> write, avoiding it seems like a free optimisation. The original commit
> 21e94459110252 ("x86/mm: Optimize RESTORE_CR3") says "Most NMI/paranoid
> exceptions will not in fact change pagetables" but I didn't't understand
> what the "most" was referring to. I then discovered this patch on the
> mailing list, Andy said[1] that it looks correct so maybe now is the
> time to merge it?
>
> Note there's another patch in [1] as well, the benefit of that one is
> not obvious to me though.

Also expanding on this a bit: the main purpose of this code is to
switch back to the user address space after handling one of these
"paranoid" exceptions. When one of those exceptions interrupts kernel
mode, we didn't switch from user to kernel address space so the
restore isn't needed.

There's no other reason to change CR3 here; context switches don't
happen in these exceptions but even if they did we would restore CR3
from the returning context_switch path. In fact in that case doing it
in paranoid_exit would potentially use the wrong PCID if it had been
reallocated by choose_new_asid. (And on the other side of the coin if
the restore was needed, it would be needed under !KPTI too).

> We've tested an equivalent patch in our internal kernel.
>
> [1] https://lore.kernel.org/lkml/20200526043507.51977-3-laijs@xxxxxxxxxxxxxxxxx/
> -- >8 --
> arch/x86/entry/calling.h | 13 ++++---------
> 1 file changed, 4 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
> index f6907627172b..b2458685d56e 100644
> --- a/arch/x86/entry/calling.h
> +++ b/arch/x86/entry/calling.h
> @@ -236,14 +236,13 @@ For 32-bit we have the following conventions - kernel is built with
> .macro RESTORE_CR3 scratch_reg:req save_reg:req
> ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
>
> - ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
> -
> /*
> - * KERNEL pages can always resume with NOFLUSH as we do
> - * explicit flushes.
> + * Skip resuming KERNEL pages since it is already KERNEL CR3.
> */
> bt $PTI_USER_PGTABLE_BIT, \save_reg
> - jnc .Lnoflush_\@
> + jnc .Lend_\@
> +
> + ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
>
> /*
> * Check if there's a pending flush for the user ASID we're
> @@ -261,10 +260,6 @@ For 32-bit we have the following conventions - kernel is built with
> SET_NOFLUSH_BIT \save_reg
>
> .Lwrcr3_\@:
> - /*
> - * The CR3 write could be avoided when not changing its value,
> - * but would require a CR3 read *and* a scratch register.
> - */
> movq \save_reg, %cr3
> .Lend_\@:
> .endm
> --
> 2.41.0.694.ge786442a9b-goog
>