Re: [PATCH 0/2] Allow nesting of lazy MMU mode

From: Aneesh Kumar K.V
Date: Tue Oct 17 2023 - 02:05:05 EST


Erhard Furtner <erhard_f@xxxxxxxxxxx> writes:

> On Thu, 12 Oct 2023 20:54:13 +0100
> "Matthew Wilcox (Oracle)" <willy@xxxxxxxxxxxxx> wrote:
>
>> Dave Woodhouse reported that we now nest calls to
>> arch_enter_lazy_mmu_mode(). That was inadvertent, but in principle we
>> should allow it. On further investigation, Juergen already fixed it
>> for Xen, but didn't tell anyone. Fix it for Sparc & PowerPC too.
>> This may or may not help fix the problem that Erhard reported.
>>
>> Matthew Wilcox (Oracle) (2):
>> powerpc: Allow nesting of lazy MMU mode
>> sparc: Allow nesting of lazy MMU mode
>>
>> arch/powerpc/include/asm/book3s/64/tlbflush-hash.h | 5 ++---
>> arch/sparc/mm/tlb.c | 5 ++---
>> 2 files changed, 4 insertions(+), 6 deletions(-)
>>
>> --
>> 2.40.1
>
> Applied the patch on top of v6.6-rc5 but unfortunately it did not fix my reported issue.
>
> Regards,
> Erhard
>

With the problem reported I guess we are finding the page->compound_head
wrong and hence folio->flags PG_dcache_clean check crashing. I still
don't know why we find page->compound_head wrong. Michael noted we are
using FLAT_MEM. That implies we are suppose to inialize struct page correctly
via init_unavailable_range because we are hitting this on an ioremap
address. We need to instrument the kernel to track the initialization of
the struct page backing these pfns which we know is crashing.

W.r.t arch_enter_lazy_mmu_mode() we can skip that completely on powerpc
because we don't allow the usage of set_pte on a valid pte entries. pte
updates are not done via set_pte interface and hence there is no TLB
invalidate required while using set_pte().

ie, we can do something like below. The change also make sure we call
set_pte_filter on all the ptes we are setting via set_ptes(). I haven't
sent this as a proper patch because we still are not able to fix the
issue Erhard reported.

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 3ba9fe411604..95ab20cca2da 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -191,28 +191,35 @@ void set_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
pte_t pte, unsigned int nr)
{
/*
- * Make sure hardware valid bit is not set. We don't do
- * tlb flush for this update.
+ * We don't need to call arch_enter/leave_lazy_mmu_mode()
+ * because we expect set_ptes to be only be used on not present
+ * and not hw_valid ptes. Hence there is not translation cache flush
+ * involved that need to be batched.
*/
- VM_WARN_ON(pte_hw_valid(*ptep) && !pte_protnone(*ptep));
+ for (;;) {

- /* Note: mm->context.id might not yet have been assigned as
- * this context might not have been activated yet when this
- * is called.
- */
- pte = set_pte_filter(pte);
+ /*
+ * Make sure hardware valid bit is not set. We don't do
+ * tlb flush for this update.
+ */
+ VM_WARN_ON(pte_hw_valid(*ptep) && !pte_protnone(*ptep));

- /* Perform the setting of the PTE */
- arch_enter_lazy_mmu_mode();
- for (;;) {
+ /* Note: mm->context.id might not yet have been assigned as
+ * this context might not have been activated yet when this
+ * is called.
+ */
+ pte = set_pte_filter(pte);
+
+ /* Perform the setting of the PTE */
__set_pte_at(mm, addr, ptep, pte, 0);
if (--nr == 0)
break;
ptep++;
- pte = __pte(pte_val(pte) + (1UL << PTE_RPN_SHIFT));
addr += PAGE_SIZE;
+ /* increment the pfn */
+ pte = __pte(pte_val(pte) + PAGE_SIZE);
+
}
- arch_leave_lazy_mmu_mode();
}

void unmap_kernel_page(unsigned long va)