Re: [PATCH 03/11] mm: page_vma_mapped_walk(): use pmd_read_atomic()

From: Jason Gunthorpe
Date: Fri Jun 11 2021 - 11:36:22 EST


On Thu, Jun 10, 2021 at 11:37:14PM -0700, Hugh Dickins wrote:
> On Thu, 10 Jun 2021, Jason Gunthorpe wrote:
> > On Thu, Jun 10, 2021 at 12:06:17PM +0300, Kirill A. Shutemov wrote:
> > > On Wed, Jun 09, 2021 at 11:38:11PM -0700, Hugh Dickins wrote:
> > > > page_vma_mapped_walk() cleanup: use pmd_read_atomic() with barrier()
> > > > instead of READ_ONCE() for pmde: some architectures (e.g. i386 with PAE)
> > > > have a multi-word pmd entry, for which READ_ONCE() is not good enough.
> > > >
> > > > Signed-off-by: Hugh Dickins <hughd@xxxxxxxxxx>
> > > > Cc: <stable@xxxxxxxxxxxxxxx>
> > > > mm/page_vma_mapped.c | 5 ++++-
> > > > 1 file changed, 4 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> > > > index 7c0504641fb8..973c3c4e72cc 100644
> > > > +++ b/mm/page_vma_mapped.c
> > > > @@ -182,13 +182,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> > > > pud = pud_offset(p4d, pvmw->address);
> > > > if (!pud_present(*pud))
> > > > return false;
> > > > +
> > > > pvmw->pmd = pmd_offset(pud, pvmw->address);
> > > > /*
> > > > * Make sure the pmd value isn't cached in a register by the
> > > > * compiler and used as a stale value after we've observed a
> > > > * subsequent update.
> > > > */
> > > > - pmde = READ_ONCE(*pvmw->pmd);
> > > > + pmde = pmd_read_atomic(pvmw->pmd);
> > > > + barrier();
> > > > +
> > >
> > > Hm. It makes me wounder if barrier() has to be part of pmd_read_atomic().
> > > mm/hmm.c uses the same pattern as you are and I tend to think that the
> > > rest of pmd_read_atomic() users may be broken.
> > >
> > > Am I wrong?
> >
> > I agree with you, something called _atomic should not require the
> > caller to provide barriers.
> >
> > I think the issue is simply that the two implementations of
> > pmd_read_atomic() should use READ_ONCE() internally, no?
>
> I've had great difficulty coming up with answers for you.
>
> This patch was based on two notions I've had lodged in my mind
> for several years:
>
> 1) that pmd_read_atomic() is the creme-de-la-creme for reading a pmd_t
> atomically (even if the pmd_t spans multiple words); but reading again
> after all this time the comment above it, it seems to be more specialized
> than I'd thought (biggest selling point being for when you want to check
> pmd_none(), which we don't). And has no READ_ONCE() or barrier() inside,
> so really needs that barrier() to be as safe as the previous READ_ONCE().

IMHO it is a simple bug that the 64 bit version of this is not marked
with READ_ONCE() internally. It is reading data without a lock, AFAIK
our modern understanding of the memory model is that requires
READ_ONCE().

diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
index e896ebef8c24cb..0bf1fdec928e71 100644
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -75,7 +75,7 @@ static inline void native_set_pte(pte_t *ptep, pte_t pte)
static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
{
pmdval_t ret;
- u32 *tmp = (u32 *)pmdp;
+ u32 *tmp = READ_ONCE((u32 *)pmdp);

ret = (pmdval_t) (*tmp);
if (ret) {
@@ -84,7 +84,7 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
* or we can end up with a partial pmd.
*/
smp_rmb();
- ret |= ((pmdval_t)*(tmp + 1)) << 32;
+ ret |= READ_ONCE((pmdval_t)*(tmp + 1)) << 32;
}

return (pmd_t) { ret };
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 46b13780c2c8c9..c8c7a3307d2773 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1228,7 +1228,7 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
* only going to work, if the pmdval_t isn't larger than
* an unsigned long.
*/
- return *pmdp;
+ return READ_ONCE(*pmdp);
}
#endif


> 2) the barrier() in mm_find_pmd(), that replaced an earlier READ_ONCE(),
> because READ_ONCE() did not work (did not give the necessary guarantee? or
> did not build?) on architectures with multiple word pmd_ts e.g. i386 PAE.

This is really interesting, the git history e37c69827063 ("mm: replace
ACCESS_ONCE with READ_ONCE or barriers") says the READ_ONCE was
dropped here "because it doesn't work on non-scalar types" due to a
(now 8 year old) gcc bug.

According to the gcc bug READ_ONCE() on anything that is a scalar
sized struct triggers GCC to ignore the READ_ONCEyness. To work around
this bug then READ_ONCE can never be used on any of the struct
protected page table elements. While I am not 100% sure, it looks like
this is a pre gcc 4.9 bug, and since gcc 4.9 is now the minimum
required compiler this is all just cruft. Use READ_ONCE() here too...

> But I've now come across some changes that Will Deacon made last year:
> the include/asm-generic/rwonce.h READ_ONCE() now appears to allow for
> native word type *or* type sizeof(long long) (e.g. i386 PAE) - given
> "a strong prevailing wind" anyway :) And 8e958839e4b9 ("sparc32: mm:
> Restructure sparc32 MMU page-table layout") put an end to sparc32's
> typedef struct { unsigned long pmdv[16]; } pmd_t.

Indeed, that is good news

> As to your questions about pmd_read_atomic() usage elsewhere:
> please don't force me to think so hard! (And you've set me half-
> wondering, whether there are sneaky THP transitions, perhaps of the
> "unstable" kind, that page_vma_mapped_walk() should be paying more
> attention to: but for sanity's sake I won't go there, not now.)

If I recall, (and I didn't recheck this right now) the only thing
pmd_read_atomic() provides is the special property that if the pmd's
flags are observed to point to a pte table then pmd_read_atomic() will
reliably return the pte table pointer.

Otherwise it returns the flags and a garbage pointer because under the
THP protocol a PMD pointing at a page can be converted to a PTE table
if you hold the mmap sem in read mode. Upgrading a PTE table to a PMD
page requires mmap sem in write mode so once a PTE table is observed
'locklessly' the value is stable.. Or at least so says the documentation

pmd_read_atomic() is badly named, tricky to use, and missing the
READ_ONCE() because it is so old..

As far as is page_vma_mapped_walk correct.. Everything except
is_pmd_migration_entry() is fine to my eye, and I simply don't know
the rules aroudn migration entries to know right/wrong.

I suspect it probably is required to manipulate a migration entry
while holding the mmap_sem in write mode??

There is some list of changes to the page table that require holding
the mmap sem in write mode that I've never seen documented for..

Jason