Re: [PATCH v4] userfaultfd/shmem: fix MCOPY_ATOMIC_CONTINUE behavior

From: Peter Xu
Date: Mon Apr 12 2021 - 21:22:03 EST


On Mon, Apr 12, 2021 at 05:51:14PM -0700, Hugh Dickins wrote:
> On Mon, 12 Apr 2021, Peter Xu wrote:
> > On Tue, Apr 06, 2021 at 11:14:30PM -0700, Hugh Dickins wrote:
> > > > +static int mcopy_atomic_install_ptes(struct mm_struct *dst_mm, pmd_t *dst_pmd,
> > > > + struct vm_area_struct *dst_vma,
> > > > + unsigned long dst_addr, struct page *page,
> > > > + enum mcopy_atomic_mode mode, bool wp_copy)
> > > > +{
> >
> > [...]
> >
> > > > + if (writable) {
> > > > + _dst_pte = pte_mkdirty(_dst_pte);
> > > > + if (wp_copy)
> > > > + _dst_pte = pte_mkuffd_wp(_dst_pte);
> > > > + else
> > > > + _dst_pte = pte_mkwrite(_dst_pte);
> > > > + } else if (vm_shared) {
> > > > + /*
> > > > + * Since we didn't pte_mkdirty(), mark the page dirty or it
> > > > + * could be freed from under us. We could do this
> > > > + * unconditionally, but doing it only if !writable is faster.
> > > > + */
> > > > + set_page_dirty(page);
> > >
> > > I do not remember why Andrea or I preferred set_page_dirty() here to
> > > pte_mkdirty(); but I suppose there might somewhere be a BUG_ON(pte_dirty)
> > > which this would avoid. Risky to change it, though it does look odd.
> >
> > Is any of the possible BUG_ON(pte_dirty) going to trigger because the pte has
> > write bit cleared? That's one question I was not very sure, e.g., whether one
> > pte is allowed to be "dirty" if it's not writable.
> >
> > To me it's okay, it's actually very suitable for UFFDIO_COPY case, where it is
> > definitely dirty data (so we must never drop it) even if it's installed as RO,
> > however to achieve that we can still set the dirty on the page rather than the
> > pte as what we do here. It's just a bit awkward as you said.
> >
> > Meanwhile today I just noticed this in arm64 code:
> >
> > static inline pte_t pte_wrprotect(pte_t pte)
> > {
> > /*
> > * If hardware-dirty (PTE_WRITE/DBM bit set and PTE_RDONLY
> > * clear), set the PTE_DIRTY bit.
> > */
> > if (pte_hw_dirty(pte))
> > pte = pte_mkdirty(pte);
> >
> > pte = clear_pte_bit(pte, __pgprot(PTE_WRITE));
> > pte = set_pte_bit(pte, __pgprot(PTE_RDONLY));
> > return pte;
> > }
> >
> > So arm64 will explicitly set the dirty bit (from the HW dirty bit) when
> > wr-protect. It seems to prove that at least for arm64 it's very valid to have
> > !write && dirty pte.
>
> I did not mean to imply that it's wrong to have pte_dirty without
> pte_write: no, I agree with you, I believe that there are accepted
> and generic ways in which we can have pte_dirty without pte_write
> (and we could each probably add a warning somewhere which would
> very quickly prove that - but those would not prove that there
> are not BUG_ONs on some other path, which had been my fear).
>
> I wanted now to demonstrate that by pointing to change_pte_range() in
> mm/mprotect.c, showing that it does not clear pte_dirty when it clears
> pte_write. But alarmingly found rather the reverse: that it appears to
> set pte_write when it finds pte_dirty - if dirty_accountable.
>
> That looks very wrong, but if I spent long enough following up
> dirty_accountable in detail, I think I would be reassured to find that
> it is only adding the pte_write there when it had removed it from the
> prot passed down, for dirty accounting reasons (which apply !VM_SHARED
> protections in the VM_SHARED case, so that page_mkwrite() is called
> and dirty accounting done when necessary).
>
> What I did mean to imply is that changing set_page_dirty to pte_mkdirty,
> to make that userfaultfd code block look nicer, is not a change to be
> done lightly: by all means try it out, test it, and send a patch after
> Axel's series is in, but please do not ask Axel to make that change as
> a part of his series - it would be taking a risk, just for a cleanup.

Fair enough. Sorry if I wasn't clear when asking, the reason to ask was that I
wanted to be clear on these differences. For example, in my uffd-wp shmem
series I'll have to make sure dirty bit always there, I used pte_mkdirty()
unconditionally but I wanted to make sure I'm not overlooking anything..

Though this case is slightly special here, since if without the cleaning up the
logic will definitely be harder to review (with the cleanup, it'll be as simple
as one unconditional pte_mkdirty() and that's all). However you're definitely
right, it's not a reason to overload Axel with this patchset, especially if
such a cleanup is seen to be risky.

>
> Now, I have also looked up the mail exchange with Andrea which led to
> his dcf7fe9d8976 ("userfaultfd: shmem: UFFDIO_COPY: set the page dirty
> if VM_WRITE is not set") - it had to be off-list at the time. And he
> was rather led to that set_page_dirty by following old patterns left
> over in shmem_getpage_gfp(); but when I said "or it could be done with
> pte_mkdirty without pte_mkwrite", he answered "I explicitly avoided
> that because pte_dirty then has side effects on mprotect to decide
> pte_write. It looks safer to do set_page_dirty and not set dirty bits
> in not writable ptes unnecessarily".
>
> Haha: I think Andrea is referring to exactly the dirty_accountable code
> in change_pte_protection() which worried me above. Now, I think that
> will turn out okay (shmem does not have a page_mkwrite(), and does not
> participate in dirty accounting), but you will have to do some work to
> assure us all of that, before sending in a cleanup patch.

OK. Maybe I can clean this after Axel's work.

Thanks a lot for all these details, Hugh!

--
Peter Xu