Re: [PATCH] mm/pgtable: return null if no ptl in __pte_offset_map_lock

From: Hugh Dickins
Date: Thu Nov 16 2023 - 03:26:56 EST


On Wed, 15 Nov 2023, Hugh Dickins wrote:
> On Wed, 15 Nov 2023, Matthew Wilcox wrote:
> > On Wed, Nov 15, 2023 at 06:05:30PM +0200, José Pekkarinen wrote:
> >
> > > > I don't think we should be changing ptlock_ptr().
> > >
> > > This is where the null ptr dereference originates, so the only
> > > alternative I can think of is to protect the life cycle of the ptdesc
> > > to prevent it to die between the pte check and the spin_unlock of
> > > __pte_offset_map_lock. Would that work for you?
>
> Thanks for pursuing this, José, but I agree with Matthew: I don't
> think your patch is right at all. The change in ptlock_ptr() did not
> make sense to me, and the change in __pte_offset_map_lock() leaves us
> still wondering what has gone wrong (and misses an rcu_read_unlock()).
>
> You mentioned "I tested the syzbot reproducer in x86 and it doesn't
> produce this kasan report anymore": are you saying that you were able
> to reproduce the issue on x86 (without your patch)? That would be very
> interesting (and I think would disprove my hypothesis below). I ought
> to try on x86 if you managed to reproduce on it, but it's not worth
> the effort if you did not. If you have an x86 stack and registers,
> please show (though I'm uncertain how much KASAN spoils the output).
>
> >
> > Ah! I think I found the problem.
> >
> > If ALLOC_SPLIT_PTLOCKS is not set, there is no problem as ->ptl
> > is embedded in the struct page. But if ALLOC_SPLIT_PTLOCKS is set
> > (eg you have LOCKDEP enabled), we can _return_ a NULL pointer from
> > ptlock_ptr. The NULL pointer dereference isn't in ptlock_ptr(), it's
> > in __pte_offset_map_lock().
> >
> > So, how to solve this? We can't just check the ptl against NULL; the
> > memory that ptl points to may have been freed. We could grab a reference
> > to the pmd_page, possibly conditionally on ALLOC_SPLIT_LOCK being set,
> > but that will slow down everything. We could make page_ptl_cachep
> > SLAB_TYPESAFE_BY_RCU, preventing the memory from being freed (even if
> > the lock might not be associated with this page any more).
>
> But I don't think that's quite right either: or, I hope it's not right,
> because it would say that all my business of rcu_read_lock() and
> pte_free_defer() etc was a failing waste of everyone's time: if the
> page table (and/or its lock) can be freed while someone is in that RCU-
> protected area, then I've simply got it wrong.
>
> What I thought, when yesterday's flurry of syzbot messages came in,
> was perhaps the problem is at the other end - when the new page table
> is allocated, rather than when it's being freed: a barrier missing there?
>
> But pmd_install() does have an smp_wmb(), with a (suspiciously) long
> comment about why no smp_rmb()s are needed, except on alpha.
>
> I'm not much good on barriers, but the thought now in my mind is that
> that comment is not quite accurate: that at the bottom level an smp_rmb()
> is (very seldom!) needed - not just to avoid a NULL (or previous garbage)
> ptl pointer in the ALLOC_SPLIT_LOCK case, but to make sure that the ptl
> is visibly correctly initialized (or would spin_lock on it achieve that?);
> and that what pte_offset_map() points to is visibly empty. (I'm imagining
> stale cache lines left behind on the oopsing CPU, which need to be
> refetched after pmdval has been read.)
>
> And that this issue has, strictly speaking, always been there, but in
> practice never a problem, because of mmap_lock held for write while (or
> prior to) freeing page table, and racers holding mmap_lock for read
> (vma lock not changing the calculus); but collapse_pte_mapped_thp()
> can now be done with mmap_lock for read, letting those racers in
> (and maybe your filemap_map_pages() has helped speed these races up -
> any blame I can shift on to you, the better ;)
>
> But I may well be spouting nonsense. And even if I'm right, is adding
> an smp_rmb() in there too high a price to pay on some architectures?
>
> I did make an attempt to reproduce on an arm64, but there's too much
> more I'd need to muck around with to get it working right. Let's ask for
> a syz test on top of v6.7-rc1 - hmm, how do I tell syzbot that it's arm64
> that it needs to try on? I hope it gets that from the Cc'ed syz number.

Syzbot couldn't do it from this mail, but did it from the original thread:
https://lore.kernel.org/linux-mm/000000000000ba0007060a40644f@xxxxxxxxxx/
tells us that I was spouting nonsense: this patch does not fix it.
I have no further idea yet.

>
> #syz test: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git b85ea95d086471afb4ad062012a4d73cd328fa86
>
> [PATCH] mm/pgtable: smp_rmb() to match smp_wmb() in pmd_install()
>
> Not-Yet-Signed-off-by: Hugh Dickins <hughd@xxxxxxxxxx>
> ---
> mm/memory.c | 2 ++
> mm/pgtable-generic.c | 5 +++++
> 2 files changed, 7 insertions(+)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 1f18ed4a5497..8939357f1509 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -425,6 +425,8 @@ void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
> * being the notable exception) will already guarantee loads are
> * seen in-order. See the alpha page table accessors for the
> * smp_rmb() barriers in page table walking code.
> + *
> + * See __pte_offset_map() for the smp_rmb() at the pte level.
> */
> smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
> pmd_populate(mm, pmd, *pte);
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index 4fcd959dcc4d..3330b666e9c3 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -297,6 +297,11 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
> pmd_clear_bad(pmd);
> goto nomap;
> }
> + /*
> + * Pair with the smp_wmb() in pmd_install(): make sure that the
> + * page table lock and page table contents are visibly initialized.
> + */
> + smp_rmb();
> return __pte_map(&pmdval, addr);
> nomap:
> rcu_read_unlock();
> --
> 2.35.3