[PATCH] mm: Add an explicit smp_wmb() to UFFDIO_CONTINUE

From: James Houghton
Date: Tue Mar 05 2024 - 19:15:31 EST


Users of UFFDIO_CONTINUE may reasonably assume that a write memory
barrier is included as part of UFFDIO_CONTINUE. That is, a user may
(mistakenly) believe that all writes it has done to a page that it is
now UFFDIO_CONTINUE'ing are guaranteed to be visible to anyone
subsequently reading the page through the newly mapped virtual memory
region.

Include only a single smp_wmb() for each UFFDIO_CONTINUE, as that is all
that is necessary. While we're at it, optimize the smp_wmb() that is
already incidentally present for the HugeTLB case.

Documentation doesn't specify if the kernel does a wmb(), so it's not
wrong not to include it. But by not including it, we are making is easy
for a user to have a very hard-to-detect bug. Include it now to be safe.

A user that decides to update the contents of the page in one thread and
UFFDIO_CONTINUE that page in another must already take additional steps
to synchronize properly.

Signed-off-by: James Houghton <jthoughton@xxxxxxxxxx>
---

I'm not sure if this patch should be merged. I think it's the right
thing to do, as it is very easy for a user to get this wrong. (I have
been using UFFDIO_CONTINUE for >2 years and only realized this problem
recently.) Given that it's not a "bug" strictly speaking, even if this
patch is a good idea, I'm unsure if it needs to be backported.

This quirk has existed since minor fault support was added for shmem[1].

I've tried to see if I can legitimately get a user to read stale data,
and a few attempts with this test[2] have been unsuccessful.

[1]: commit 153132571f02 ("userfaultfd/shmem: support UFFDIO_CONTINUE for shmem")
[2]: https://gist.github.com/48ca/38d0665b0f1a6319a56507dc73a173f9

mm/hugetlb.c | 15 +++++++++------
mm/userfaultfd.c | 18 ++++++++++++++++++
2 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bb17e5c22759..533bf6b2d94d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6779,12 +6779,15 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
}
}

- /*
- * The memory barrier inside __folio_mark_uptodate makes sure that
- * preceding stores to the page contents become visible before
- * the set_pte_at() write.
- */
- __folio_mark_uptodate(folio);
+ if (!is_continue) {
+ /*
+ * The memory barrier inside __folio_mark_uptodate makes sure
+ * that preceding stores to the page contents become visible
+ * before the set_pte_at() write.
+ */
+ __folio_mark_uptodate(folio);
+ } else
+ WARN_ON_ONCE(!folio_test_uptodate(folio));

/* Add shared, newly allocated pages to the page cache. */
if (vm_shared && !is_continue) {
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 503ea77c81aa..d515b640ca48 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -531,6 +531,10 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
goto out_unlock;
}

+ if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
+ /* See the comment in mfill_atomic. */
+ smp_wmb();
+
while (src_addr < src_start + len) {
BUG_ON(dst_addr >= dst_start + len);

@@ -743,6 +747,20 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
goto out_unlock;

+ if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
+ /*
+ * A caller might reasonably assume that UFFDIO_CONTINUE
+ * contains a wmb() to ensure that any writes to the
+ * about-to-be-mapped page by the thread doing the
+ * UFFDIO_CONTINUE are guaranteed to be visible to subsequent
+ * reads of the page through the newly mapped address.
+ *
+ * For MFILL_ATOMIC_COPY, the wmb() is done for each COPYed
+ * page. We can do the wmb() now for CONTINUE as the user has
+ * already prepared the page contents.
+ */
+ smp_wmb();
+
while (src_addr < src_start + len) {
pmd_t dst_pmdval;


base-commit: a7f399ae964e1d2a11d88d863a1d64392678ccaf
--
2.44.0.278.ge034bb2e1d-goog