Re: [PATCH 04/31] mm/pgtable: allow pte_offset_map[_lock]() to fail

From: Hugh Dickins
Date: Wed Jul 05 2023 - 18:26:31 EST


Hi Aneesh,

On Wed, 5 Jul 2023, Aneesh Kumar K.V wrote:
>
> Hi Hugh,
>
> Sorry for not checking about this before.

I was about to say "No problem" - but it appears I would be lying!

> I am looking at a kernel
> crash (BUG_ON()) on ppc64 with 4K page size. The reason we hit
> BUG_ON() is beause we have pmd_same calling BUG_ON on 4K with hash
> translation. We don't support THP with 4k page size and hash
> translation.

I misunderstood you at first. I was trying to work out what in that
context might lead to *pmd changing suddenly, was going to ask for
stack backtrace (in faulting? or in copying mm? or?), whether you
have PER_VMA_LOCK enabled, etc. etc.

Then I looked at the source: oh, that is gross, and not something
I had expected at all.

> > +pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
> > + unsigned long addr, spinlock_t **ptlp)
> > +{
> > + spinlock_t *ptl;
> > + pmd_t pmdval;
> > + pte_t *pte;
> > +again:
> > + pte = __pte_offset_map(pmd, addr, &pmdval);
> > + if (unlikely(!pte))
> > + return pte;
> > + ptl = pte_lockptr(mm, &pmdval);
> > + spin_lock(ptl);
> > + if (likely(pmd_same(pmdval, pmdp_get_lockless(pmd)))) {
> > + *ptlp = ptl;
> > + return pte;
> > + }
> > + pte_unmap_unlock(pte, ptl);
> > + goto again;
> > +}
>
> What is expected by that pmd_same check? We are holding pte lock
> and not pmd lock. So contents of pmd can change.

And you don't need me to answer that question: the answer is in the
"likely". We do not expect *pmd to change there (though maybe some
ancillary bits of it, like "accessed"), unless the page table is on
its way to being freed; and other locking higher up (mmap_lock or
rmap lock) prevent all possibilities of that at present. Later, we
arrange to hold pte lock as well as pmd lock when removing page table.

So the obvious quick fix is:

--- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -138,8 +138,7 @@ static inline int hash__pmd_trans_huge(p

static inline int hash__pmd_same(pmd_t pmd_a, pmd_t pmd_b)
{
- BUG();
- return 0;
+ return 1;
}

static inline pmd_t hash__pmd_mkhuge(pmd_t pmd)

But I hope you will reject that as almost as gross, and instead commit
a patch which makes hash__pmd_same() ... check whether the pmd_ts are the
same - as in ppc64's other implementations. That will save having to change
it again, when/if someone extends the current replace-page-table-by-THP work
by non-THP work to remove empty page tables without mmap_lock or rmap lock.

Thanks for finding this, Aneesh, and I'm sorry I didn't notice it before,
and I'm sorry to have given you trouble... but really, that BUG(),

(H)ugh!