Re: [PATCH RESEND] x86/entry: Don't write to CR3 when restoring to kernel CR3

From: Thomas Gleixner
Date: Tue Sep 19 2023 - 05:07:28 EST


On Mon, Sep 18 2023 at 20:28, Andy Lutomirski wrote:
> On Thu, Aug 17, 2023, at 5:15 AM, Brendan Jackman wrote:
>> From: Lai Jiangshan <laijs@xxxxxxxxxxxxxxxxx>
>>
>> Skip resuming KERNEL pages since it is already KERNEL CR3
>>
>> Signed-off-by: Lai Jiangshan <laijs@xxxxxxxxxxxxxxxxx>
>> Signed-off-by: Brendan Jackman <jackmanb@xxxxxxxxxx>
>> ---
>>
>> While staring at paranoid_exit I was confused about why we had this CR3
>> write, avoiding it seems like a free optimisation. The original commit
>> 21e94459110252 ("x86/mm: Optimize RESTORE_CR3") says "Most NMI/paranoid
>> exceptions will not in fact change pagetables" but I didn't't understand
>> what the "most" was referring to. I then discovered this patch on the
>> mailing list, Andy said[1] that it looks correct so maybe now is the
>> time to merge it?
>
> I did?
>
> Looking at the link, I think I was saying that the opposite patch
> (*always* flush) looked okay.

That always flush part was solely for the user CR3 restore path.

>> @@ -236,14 +236,13 @@ For 32-bit we have the following conventions -
>> kernel is built with
>> .macro RESTORE_CR3 scratch_reg:req save_reg:req
>> ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
>>
>> - ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
>> -
>> /*
>> - * KERNEL pages can always resume with NOFLUSH as we do
>> - * explicit flushes.
>> + * Skip resuming KERNEL pages since it is already KERNEL CR3.
>> */
>> bt $PTI_USER_PGTABLE_BIT, \save_reg
>> - jnc .Lnoflush_\@
>> + jnc .Lend_\@
>
> I don't get it. How do you know that the actual loaded CR3 is correct?
>
> I'm willing to believe that there is some constraint in the way the
> kernel works such that every paranoid entry will, as part of its
> execution, switch CR3 to kernel *and leave it like that* *and that
> this will be the _same_ kernel CR3 that was saved*. But I'm not
> really convinced that's true. (Can we schedule in a paranoid entry?
> Probably not. What about the weird NMI paths? What if we do
> something that switches to init mm? Hmm -- doing that in a paranoid
> context is probably not a brilliant idea.)

You have to differentiate between entry from kernel and entry from user.

Entry from user switches to the task stack, while entry from kernel
always runs on IST.

Entry from user cannot have kernel CR3 obviously, while entry from
kernel can have either kernel CR3 or user CR3. Entry from user does not
use the paranoid entry/exit paths at all, so that's a non-issue.

IST prevents that the exception can schedule, which in turn guarantees
that CR3 stays the same until it returns. Unless some completely stupid
code path would trie to switch to a different mm from an exception which
can hit into the middle of an mm switch. Then the failed restore is
probably the least of our problems.

> Maybe it is true, and maybe a convincing argument could be made. But
> that seems like a lot of thinking and fragility for an optimization
> that seems pretty minor.

I don't think its pretty minor. CR3 writes even with the noflush bit set
are not necessarily cheap.

While most exceptions which go through the paranoid path are not
hotpath, the one which matters is #NMI due to perf. So I think it's
worth to spare the redundant CR3 switch in that case.

Thanks,

tglx