Re: [PATCH 04/47] LoongArch: Set _PAGE_DIRTY only if _PAGE_WRITE is set in {pmd,pte}_mkdirty()

From: Peter Xu
Date: Mon Nov 21 2022 - 13:33:06 EST


On Sat, Nov 19, 2022 at 10:20:20PM +0800, hev wrote:
> Hi, Peter,

Hev,

>
> On Fri, Nov 18, 2022 at 2:53 AM Peter Xu <peterx@xxxxxxxxxx> wrote:
> >
> > On Thu, Nov 17, 2022 at 10:12:07AM -0500, Peter Xu wrote:
> > > Hi, Huacai,
> > >
> > > On Thu, Nov 17, 2022 at 12:25:32PM +0800, Huacai Chen wrote:
> > > > Now {pmd,pte}_mkdirty() set _PAGE_DIRTY bit unconditionally, this causes
> > > > random segmentation fault after commit 0ccf7f168e17bb7e ("mm/thp: carry
> > > > over dirty bit when thp splits on pmd").
> > > >
> > > > The reason is: when fork(), parent process use pmd_wrprotect() to clear
> > > > huge page's _PAGE_WRITE and _PAGE_DIRTY (for COW);
> > >
> > > Is it safe to drop dirty bit when wr-protect? It means the mm can reclaim
> > > the page directly assuming the page contains rubbish.
> > >
> > > Consider after fork() and memory pressure kicks the kswapd, I don't see
> > > anything stops the kswapd from recycling the pages and lose the data in
> > > both processes.
> >
> > Feel free to ignore this question.. I think I got an answer from Hev (and
> > I then got a follow up question):
> >
> > https://lore.kernel.org/all/Y3Z9Zf0jARMOkFBq@x1n/
> >
> > >
> > > > then pte_mkdirty() set
> > > > _PAGE_DIRTY as well as _PAGE_MODIFIED while splitting dirty huge pages;
> > > > once _PAGE_DIRTY is set, there will be no tlb modify exception so the COW
> > > > machanism fails; and at last memory corruption occurred between parent
> > > > and child processes.
> > > >
> > > > So, we should set _PAGE_DIRTY only when _PAGE_WRITE is set in {pmd,pte}_
> > > > mkdirty().
> > > >
> > > > Cc: stable@xxxxxxxxxxxxxxx
> > > > Cc: Peter Xu <peterx@xxxxxxxxxx>
> > > > Signed-off-by: Huacai Chen <chenhuacai@xxxxxxxxxxx>
> > > > ---
> > > > Note: CC sparc maillist because they have similar issues.
> > >
> > > I also had a look on sparc64, it seems to not do the same as loongarch
> > > here (not removing dirty in wr-protect):
> > >
> > > static inline pmd_t pmd_wrprotect(pmd_t pmd)
> > > {
> > > pte_t pte = __pte(pmd_val(pmd));
> > >
> > > pte = pte_wrprotect(pte);
> > >
> > > return __pmd(pte_val(pte));
> > > }
> > >
> > > static inline pte_t pte_wrprotect(pte_t pte)
> > > {
> > > unsigned long val = pte_val(pte), tmp;
> > >
> > > __asm__ __volatile__(
> > > "\n661: andn %0, %3, %0\n"
> > > " nop\n"
> > > "\n662: nop\n"
> > > " nop\n"
> > > " .section .sun4v_2insn_patch, \"ax\"\n"
> > > " .word 661b\n"
> > > " sethi %%uhi(%4), %1\n"
> > > " sllx %1, 32, %1\n"
> > > " .word 662b\n"
> > > " or %1, %%lo(%4), %1\n"
> > > " andn %0, %1, %0\n"
> > > " .previous\n"
> > > : "=r" (val), "=r" (tmp)
> > > : "0" (val), "i" (_PAGE_WRITE_4U | _PAGE_W_4U),
> > > "i" (_PAGE_WRITE_4V | _PAGE_W_4V));
> > >
> > > return __pte(val);
> > > }
> >
> > (Same here; I just overlooked what does _PAGE_W_4U meant..)
> >
> > >
> > > >
> > > > arch/loongarch/include/asm/pgtable.h | 8 ++++++--
> > > > 1 file changed, 6 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/arch/loongarch/include/asm/pgtable.h b/arch/loongarch/include/asm/pgtable.h
> > > > index 946704bee599..debbe116f105 100644
> > > > --- a/arch/loongarch/include/asm/pgtable.h
> > > > +++ b/arch/loongarch/include/asm/pgtable.h
> > > > @@ -349,7 +349,9 @@ static inline pte_t pte_mkclean(pte_t pte)
> > > >
> > > > static inline pte_t pte_mkdirty(pte_t pte)
> > > > {
> > > > - pte_val(pte) |= (_PAGE_DIRTY | _PAGE_MODIFIED);
> > > > + pte_val(pte) |= _PAGE_MODIFIED;
> > > > + if (pte_val(pte) & _PAGE_WRITE)
> > > > + pte_val(pte) |= _PAGE_DIRTY;
> > >
> > > I'm not sure whether mm has rule to always set write bit then set dirty
> > > bit, need to be careful here because the outcome may differ when use:
> > >
> > > pte_mkdirty(pte_mkwrite(pte))
> > > (expected)
> > >
> > > VS:
> > >
> > > pte_mkwrite(pte_mkdirty(pte))
> > > (dirty not set)
> > >
> > > I had a feeling I miss some arch-specific details here on why loongarch
> > > needs such implementation, but I can't quickly tell.
> >
> > After a closer look I think it's fine for loongarch as pte_mkwrite will
> > also set the dirty bit unconditionally, so at least the two ways will still
> > generate the same pte (DIRTY+MODIFIED+WRITE).
> >
> > But this whole thing is still confusing to me. It'll still be great if
> > anyone can help explain why the _DIRTY cannot be set only in pte_mkwrite()
> > if that's the solo place in charge of "whether the pte is writable".
> >
> > The other follow up question is: how do we mark "this pte contains valid
> > data" (the common definition of "dirty bit"), while "this pte is not
> > writable" on loongarch?
> >
> > It can happen when we're installing a page with non-zero data meanwhile
> > wr-protected. That's actually a valid case for userfaultfd wr-protect mode
> > where user specified UFFDIO_COPY ioctl with flag UFFDIO_COPY_MODE_WP, where
> > we'll install a non-zero page from user buffer but don't grant write bit.
> >
> > From code-wise, I think it can be done currently with this on loongarch:
> >
> > pte_wrprotect(pte_mkwrite(pte_mkdirty(pte)))
> >
> > Where pte_wrprotect(pte_mkwrite(pte)) is not a no-op but applying MODIFIED.
>
> We would like to note that on LoongArch (for misunderstanding naming):
> * _PAGE_DIRTY meaning hardware writable.
> * _PAGE_WRITE meaning software writable.
> * _PAGE_MODIFIED meaning software dirty, this page contains updated valid data.
>
> PTE APIs:
> * pte_mkwrite: Allow to write, only needs set _PAGE_WRITE.
> * pte_mkdirty: Mark as dirty, only needs set _PAGE_MODIFIED.
> * pte_dirty: Test is dirty, only test _PAGE_MODIFIED.
> * pte_wrprotect: Clear both writable, force to raise exception to
> handle_mm_fault.
>
> If a pte is only set _PAGE_WRITE without _PAGE_DIRTY by pte_mkwrite,
> then a write memory access will cause mmu exception, and the
> (_PAGE_DIRTY|_PAGE_MODIFIED) will be set in this exception handler. I
> think the _PAGE_DIRTY is also possible to set in pte_mkwrite for
> speedup, then _PAGE_MODIFIED must be set at the same time. To avoid
> the page data being modified but not detected by pte_dirty. (Current
> code may needs to fix
>
> pte_mkdirty mark pte as dirty is the main function, It can also make
> pte writeable by hardware(_PAGE_DIRTY) for speedup (too) if and only
> if the pte is writable(_PAGE_WRITE). (mkdirty sets _PAGE_DIRTY
> unconditionally is the root cause of the huge page COW issue.

Yes, and that's why Huacai's patch can fix this issue for Loongarch, but
sparc64 still suffers from it so far.

>
> For write-protection, pte_wrprotect will clear both writable(software
> and hardware) in pte to force a MMU exception to handle_mm_fault.
>
> So yeah, the pte marked as dirty(_PAGE_MODIFIED) and without any
> writable in the following code:
>
> pte_wrprotect(pte_mkwrite(pte_mkdirty(pte)))

Thanks for the further explanation, Hev.

I think so far I keep the questioning on whether it's a good optimization
to apply _DIRTY in pte_mkdirty here.

Since we have pte_mkwrite() API after all, even if there's an optimization
IMHO it should be done by the kernel generic code already, I don't
immediately see why it needs to be arch-specific. IOW, I'm not sure
whether such optimization for loongarch will bring any real performance
benefit?

The thing is, generic mm code should already have called pte_mkwrite()
along with pte_mkdirty() when proper, so _DIRTY will be properly applied
even if pte_mkdirty() only applies _MODIFIED in loongarch code without
extra mmu faults - if you see the code base that's really the major cases,
except a few very special conditions where we want to only set _MODIFIED
but may want to keep the pte read-only, but in that case it's never wise to
set _DIRTY in pte_mkdirty anyway.

I just don't know whether there's anything else I could have overlooked, so
maybe removing "set _DIRTY" in pte_mkdirty() will regress in some other way.

For now, I think I'll go ahead and try to propose another trivial fix for
the pmd split generic code so we'll have the pte_mkdirty back (I think I
need to reorder pte_wrprotect() so that it needs to appear after
pte_mkdirty()) but hopefully also work for sparc64; loongarch will also
benefit if before this patch lands.

(Sorry to be off-topic on this thread)

--
Peter Xu