Re: [RFC PATCH v2 20/20] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs

From: Dave Hansen
Date: Tue Jul 25 2023 - 13:12:37 EST


On 7/25/23 09:37, Marcelo Tosatti wrote:
>> TLB flushes for freed page tables are another game entirely. The CPU is
>> free to cache any part of the paging hierarchy it wants at any time.
> Depend on CONFIG_PAGE_TABLE_ISOLATION=y, which flushes TLB (and page
> table caches) on user->kernel and kernel->user context switches ?

Well, first of all, CONFIG_PAGE_TABLE_ISOLATION doesn't flush the TLB at
all on user<->kernel switches when PCIDs are enabled.

Second, even if it did, the CPU is still free to cache any portion of
the paging hierarchy at any time. Without LASS[1], userspace can even
_compel_ walks of the kernel portion of the address space, and we don't
have any infrastructure to tell if a freed kernel page is exposed in the
user copy of the page tables with PTI.

Third, (also ignoring PCIDs) there are plenty of instructions between
kernel entry and the MOV-to-CR3 that can flush the TLB. All those
instructions architecturally permitted to speculatively set Accessed or
Dirty bits in any part of the address space. If they run into a free
page table page, things get ugly.

These accesses are not _likely_. There probably isn't a predictor out
there that's going to see a:

movq %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)

and go off trying to dirty memory in the vmalloc() area. But we'd need
some backward *and* forward-looking guarantees from our intrepid CPU
designers to promise that this kind of thing is safe yesterday, today
and tomorrow. I suspect such a guarantee is going to be hard to obtain.

1. https://lkml.kernel.org/r/20230110055204.3227669-1-yian.chen@xxxxxxxxx