Re: [PATCH] mm: Add an explicit smp_wmb() to UFFDIO_CONTINUE

From: James Houghton
Date: Wed Mar 06 2024 - 12:58:04 EST


On Tue, Mar 5, 2024 at 7:41 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
>
> On Wed, Mar 06, 2024 at 12:15:10AM +0000, James Houghton wrote:
> > I've tried to see if I can legitimately get a user to read stale data,
> > and a few attempts with this test[2] have been unsuccessful.
>
> AFAICT that won't easily reproduce even if the problem existed, as we
> contain so many implict memory barriers here and there. E.g. right at the
> entry of ioctl(), mmget_not_zero() already contains a full ordering
> constraint:
>
> /**
> * atomic_inc_not_zero() - atomic increment unless zero with full ordering
> * @v: pointer to atomic_t

Oh yes, of course. Thanks for pointing this out. So we definitely
don't need a Fixes.

>
> I was expecting the syscall routine will guarantee an ordering already but
> indeed I can't find any. I also checked up Intel's spec and SYSCALL inst
> document only has one paragraph on ordering:
>
> Instruction ordering. Instructions following a SYSCALL may be
> fetched from memory before earlier instructions complete execution,
> but they will not execute (even speculatively) until all
> instructions prior to the SYSCALL have completed execution (the
> later instructions may execute before data stored by the earlier
> instructions have become globally visible).
>
> I guess it implies a hardware reordering is indeed possible in this case?

That's my understanding as well.

>
> >
> > [1]: commit 153132571f02 ("userfaultfd/shmem: support UFFDIO_CONTINUE for shmem")
> > [2]: https://gist.github.com/48ca/38d0665b0f1a6319a56507dc73a173f9
> >
> > mm/hugetlb.c | 15 +++++++++------
> > mm/userfaultfd.c | 18 ++++++++++++++++++
> > 2 files changed, 27 insertions(+), 6 deletions(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index bb17e5c22759..533bf6b2d94d 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -6779,12 +6779,15 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
> > }
> > }
> >
> > - /*
> > - * The memory barrier inside __folio_mark_uptodate makes sure that
> > - * preceding stores to the page contents become visible before
> > - * the set_pte_at() write.
> > - */
> > - __folio_mark_uptodate(folio);
> > + if (!is_continue) {
> > + /*
> > + * The memory barrier inside __folio_mark_uptodate makes sure
> > + * that preceding stores to the page contents become visible
> > + * before the set_pte_at() write.
> > + */
> > + __folio_mark_uptodate(folio);
>
> Can we move the comment above the "if", explaining both conditions?

Yes, I'll do that. I think the explanation for
WARN_ON_ONCE(!folio_test_uptodate(folio)) is:

We only need to `__folio_mark_uptodate(folio)` if we have
allocated a new folio, and HugeTLB pages will always be Uptodate if
they are in the pagecache.

We could even drop the WARN_ON_ONCE.

>
> > + } else
> > + WARN_ON_ONCE(!folio_test_uptodate(folio));
> >
> > /* Add shared, newly allocated pages to the page cache. */
> > if (vm_shared && !is_continue) {
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index 503ea77c81aa..d515b640ca48 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -531,6 +531,10 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
> > goto out_unlock;
> > }
> >
> > + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
> > + /* See the comment in mfill_atomic. */
> > + smp_wmb();
> > +
> > while (src_addr < src_start + len) {
> > BUG_ON(dst_addr >= dst_start + len);
> >
> > @@ -743,6 +747,20 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
> > uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
> > goto out_unlock;
> >
> > + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
> > + /*
> > + * A caller might reasonably assume that UFFDIO_CONTINUE
> > + * contains a wmb() to ensure that any writes to the
> > + * about-to-be-mapped page by the thread doing the
> > + * UFFDIO_CONTINUE are guaranteed to be visible to subsequent
> > + * reads of the page through the newly mapped address.
> > + *
> > + * For MFILL_ATOMIC_COPY, the wmb() is done for each COPYed
> > + * page. We can do the wmb() now for CONTINUE as the user has
> > + * already prepared the page contents.
> > + */
> > + smp_wmb();
> > +
>
> Why you did it twice separately? Can we still share the code?
>
> I'm wildly guessing: I don't worry on an extra wmb() in failure paths, as
> that's never a performance concern to make failure slightly slower, IMHO.

You're right, I shouldn't care so much about making the slow paths a
little bit slower. I'll move the wmb earlier so that we only need it
in one place.

>
> Thanks,

Thanks Peter!