Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize

From: Nanyong Sun
Date: Thu Feb 08 2024 - 04:45:01 EST



在 2024/2/7 20:20, Catalin Marinas 写道:
On Wed, Feb 07, 2024 at 11:21:17AM +0000, Matthew Wilcox wrote:
On Wed, Feb 07, 2024 at 11:12:52AM +0000, Will Deacon wrote:
On Sat, Jan 27, 2024 at 01:04:15PM +0800, Nanyong Sun wrote:
On 2024/1/26 2:06, Catalin Marinas wrote:
On Sat, Jan 13, 2024 at 05:44:33PM +0800, Nanyong Sun wrote:
HVO was previously disabled on arm64 [1] due to the lack of necessary
BBM(break-before-make) logic when changing page tables.
This set of patches fix this by adding necessary BBM sequence when
changing page table, and supporting vmemmap page fault handling to
fixup kernel address translation fault if vmemmap is concurrently accessed.
I'm not keen on this approach. I'm not even sure it's safe. In the
second patch, you take the init_mm.page_table_lock on the fault path but
are we sure this is unlocked when the fault was taken?
I think this situation is impossible. In the implementation of the second
patch, when the page table is being corrupted
(the time window when a page fault may occur), vmemmap_update_pte() already
holds the init_mm.page_table_lock,
and unlock it until page table update is done.Another thread could not hold
the init_mm.page_table_lock and
also trigger a page fault at the same time.
If I have missed any points in my thinking, please correct me. Thank you.
It still strikes me as incredibly fragile to handle the fault and trying
to reason about all the users of 'struct page' is impossible. For example,
can the fault happen from irq context?
The pte lock cannot be taken in irq context (which I think is what
you're asking?)
With this patchset, I think it can: IRQ -> interrupt handler accesses
vmemmap -> faults -> fault handler in patch 2 takes the
init_mm.page_table_lock to wait for the vmemmap rewriting to complete.
Maybe it works if the hugetlb code disabled the IRQs but, as Will said,
such fault in any kernel context looks fragile.
How about take a new lock with irq disabled during BBM, like:

+void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte)
+{
+    spin_lock_irq(NEW_LOCK);
+    pte_clear(&init_mm, addr, ptep);
+    flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+    set_pte_at(&init_mm, addr, ptep, pte);
+    spin_unlock_irq(NEW_LOCK);
+}