Re: [PATCH 1/1] riscv: Implement arch_sync_kernel_mappings() for "preventive" TLB flush

From: Alexandre Ghiti
Date: Tue Aug 08 2023 - 15:50:53 EST


Hi Dylan,

On 07/08/2023 10:23, Dylan Jhong wrote:
Since RISC-V is a microarchitecture that allows caching invalid entries in the TLB,
it is necessary to issue a "preventive" SFENCE.VMA to ensure that each core obtains
the correct kernel mapping.

The patch implements TLB flushing in arch_sync_kernel_mappings(), ensuring that kernel
page table mappings created via vmap/vmalloc() are updated before switching MM.

Signed-off-by: Dylan Jhong <dylan@xxxxxxxxxxxxx>
---
arch/riscv/include/asm/page.h | 2 ++
arch/riscv/mm/tlbflush.c | 12 ++++++++++++
2 files changed, 14 insertions(+)

diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
index b55ba20903ec..6c86ab69687e 100644
--- a/arch/riscv/include/asm/page.h
+++ b/arch/riscv/include/asm/page.h
@@ -21,6 +21,8 @@
#define HPAGE_MASK (~(HPAGE_SIZE - 1))
#define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT)
+#define ARCH_PAGE_TABLE_SYNC_MASK PGTBL_PTE_MODIFIED
+
/*
* PAGE_OFFSET -- the first address of the first page of memory.
* When not using MMU this corresponds to the first free page in
diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
index 77be59aadc73..d63364948c85 100644
--- a/arch/riscv/mm/tlbflush.c
+++ b/arch/riscv/mm/tlbflush.c
@@ -149,3 +149,15 @@ void flush_pmd_tlb_range(struct vm_area_struct *vma, unsigned long start,
__flush_tlb_range(vma->vm_mm, start, end - start, PMD_SIZE);
}
#endif
+
+/*
+ * Since RISC-V is a microarchitecture that allows caching invalid entries in the TLB,
+ * it is necessary to issue a "preventive" SFENCE.VMA to ensure that each core obtains
+ * the correct kernel mapping. arch_sync_kernel_mappings() will ensure that kernel
+ * page table mappings created via vmap/vmalloc() are updated before switching MM.
+ */
+void arch_sync_kernel_mappings(unsigned long start, unsigned long end)
+{
+ if (start < VMALLOC_END && end > VMALLOC_START)


This test is too restrictive, it should catch the range [MODULES_VADDR;  MODULES_END[ too, sorry I did not notice that at first.


+ flush_tlb_all();
+}
\ No newline at end of file


I have to admit that I *think* both your patch and mine are wrong: one of the problem that led to the removal of vmalloc_fault() is the possibility for tracing functions to actually allocate vmalloc regions in the vmalloc page fault path, which could give rise to nested exceptions (see https://lore.kernel.org/lkml/20200508144043.13893-1-joro@xxxxxxxxxx/).

Here, everytime we allocate a vmalloc region, we send an IPI. If a vmalloc allocation happens in this path (if it is traced for example), it will give rise to an IPI...and so on.

So I came to the conclusion that the only way to actually fix this issue is by resolving the vmalloc faults very early in the page fault path (by emitting a sfence.vma on uarch that cache invalid entries), before the kernel stack is even accessed. That's the best solution since it would completely remove all the preventive sfence.vma in flush_cache_vmap()/arch_sync_kernel_mappings(), we would rely on faulting which I assume should not happen a lot (?).

I'm implementing this solution, but I'm pretty sure it won't be ready for 6.5. In the meantime, we need either your patch or mine to fix your issue...