[PATCH] mm: rework PageAnonExclusive() interaction with GUP-fast

From: David Hildenbrand
Date: Sat Aug 27 2022 - 04:44:13 EST


commit 6c287605fd56 (mm: remember exclusively mapped anonymous pages with
PG_anon_exclusive) made sure that when PageAnonExclusive() has to be
cleared during temporary unmapping of a page, that the PTE is
cleared/invalidated and that the TLB is flushed.

That handling was inspired by an outdated comment in
mm/ksm.c:write_protect_page(), which similarly required the TLB flush in
the past to synchronize with GUP-fast. However, ever since general RCU GUP
fast was introduced in commit 2667f50e8b81 ("mm: introduce a general RCU
get_user_pages_fast()"), a TLB flush is no longer sufficient and
required to synchronize with concurrent GUP-fast

Peter pointed out, that TLB flush is not required, and looking into
details it turns out that he's right. To synchronize with GUP-fast, it's
sufficient to clear the PTE only: GUP-fast will either detect that the PTE
changed or that PageAnonExclusive is not set and back off. However, we
rely on a given memory order and should make sure that that order is
always respected.

Conceptually, GUP-fast pinning code of anon pages consists of:
(1) Read the PTE
(2) Pin the mapped page
(3) Check if the PTE changed by re-reading it; back off if so.
(4) Check if PageAnonExclusive is not set; back off if so.

Conceptually, PageAnonExclusive clearing code consists of:
(1) Clear PTE
(2) Check if the page is pinned; back off if so.
(3) Clear PageAnonExclusive
(4) Restore PTE (optional)

As GUP-fast temporarily pins the page before validating whether the PTE
changed, and PageAnonExclusive clearing code clears the PTE before
checking if the page is pinned, GUP-fast cannot end up pinning an anon
page that is not exclusive.

One corner case to consider is when we restore the PTE to the same value
after PageAnonExclusive was cleared, as it can happen in
mm/ksm.c:write_protect_page(). In that case, GUP-fast might not detect
a PTE change (because there was none). However, as restoring the PTE
happens after clearing PageAnonExclusive, GUP-fast would detect that
PageAnonExclusive was cleared in that case and would properly back off.

Let's document that, avoid the TLB flush where possible and use proper
explicit memory barriers where required. We shouldn't really care about the
additional memory barriers here, as we're not on extremely hot paths.

The possible issues due to reordering are of theoretical nature so far,
but it better be addressed.

Note that we don't need a memory barrier between checking if the page is
pinned and clearing PageAnonExclusive, because stores are not
speculated.

Fixes: 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
Signed-off-by: David Hildenbrand <david@xxxxxxxxxx>
---
include/linux/mm.h | 9 +++++--
include/linux/rmap.h | 58 ++++++++++++++++++++++++++++++++++++++++----
mm/huge_memory.c | 3 +++
mm/ksm.c | 1 +
mm/migrate_device.c | 22 +++++++----------
mm/rmap.c | 11 +++++----
6 files changed, 79 insertions(+), 25 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 21f8b27bd9fd..f7e8f4b34fb5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2975,8 +2975,8 @@ static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags)
* PageAnonExclusive() has to protect against concurrent GUP:
* * Ordinary GUP: Using the PT lock
* * GUP-fast and fork(): mm->write_protect_seq
- * * GUP-fast and KSM or temporary unmapping (swap, migration):
- * clear/invalidate+flush of the page table entry
+ * * GUP-fast and KSM or temporary unmapping (swap, migration): see
+ * page_try_share_anon_rmap()
*
* Must be called with the (sub)page that's actually referenced via the
* page table entry, which might not necessarily be the head page for a
@@ -2997,6 +2997,11 @@ static inline bool gup_must_unshare(unsigned int flags, struct page *page)
*/
if (!PageAnon(page))
return false;
+
+ /* See page_try_share_anon_rmap() for GUP-fast details. */
+ if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) && irqs_disabled())
+ smp_rmb();
+
/*
* Note that PageKsm() pages cannot be exclusive, and consequently,
* cannot get pinned.
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index bf80adca980b..454c159f2aae 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -267,7 +267,7 @@ static inline int page_try_dup_anon_rmap(struct page *page, bool compound,
* @page: the exclusive anonymous page to try marking possibly shared
*
* The caller needs to hold the PT lock and has to have the page table entry
- * cleared/invalidated+flushed, to properly sync against GUP-fast.
+ * cleared/invalidated.
*
* This is similar to page_try_dup_anon_rmap(), however, not used during fork()
* to duplicate a mapping, but instead to prepare for KSM or temporarily
@@ -283,12 +283,60 @@ static inline int page_try_share_anon_rmap(struct page *page)
{
VM_BUG_ON_PAGE(!PageAnon(page) || !PageAnonExclusive(page), page);

- /* See page_try_dup_anon_rmap(). */
- if (likely(!is_device_private_page(page) &&
- unlikely(page_maybe_dma_pinned(page))))
- return -EBUSY;
+ /* device private pages cannot get pinned via GUP. */
+ if (unlikely(is_device_private_page(page))) {
+ ClearPageAnonExclusive(page);
+ return 0;
+ }

+ /*
+ * We have to make sure that while we clear PageAnonExclusive, that
+ * the page is not pinned and that concurrent GUP-fast won't succeed in
+ * concurrently pinning the page.
+ *
+ * Conceptually, GUP-fast pinning code of anon pages consists of:
+ * (1) Read the PTE
+ * (2) Pin the mapped page
+ * (3) Check if the PTE changed by re-reading it; back off if so.
+ * (4) Check if PageAnonExclusive is not set; back off if so.
+ *
+ * Conceptually, PageAnonExclusive clearing code consists of:
+ * (1) Clear PTE
+ * (2) Check if the page is pinned; back off if so.
+ * (3) Clear PageAnonExclusive
+ * (4) Restore PTE (optional)
+ *
+ * In GUP-fast, we have to make sure that (2),(3) and (4) happen in
+ * the right order. Memory order between (2) and (3) is handled by
+ * GUP-fast, independent of PageAnonExclusive.
+ *
+ * When clearing PageAnonExclusive(), we have to make sure that (1),
+ * (2), (3) and (4) happen in the right order.
+ *
+ * Note that (4) has to happen after (3) in both cases to handle the
+ * corner case whereby the PTE is restored to the original value after
+ * clearing PageAnonExclusive and while GUP-fast might not detect the
+ * PTE change, it will detect the PageAnonExclusive change.
+ *
+ * We assume that there might not be a memory barrier after
+ * clearing/invalidating the PTE (1) and before restoring the PTE (4),
+ * so we use explicit ones here.
+ *
+ * These memory barriers are paired with memory barriers in GUP-fast
+ * code, including gup_must_unshare().
+ */
+
+ /* Clear/invalidate the PTE before checking for PINs. */
+ if (IS_ENABLED(CONFIG_HAVE_FAST_GUP))
+ smp_mb();
+
+ if (unlikely(page_maybe_dma_pinned(page)))
+ return -EBUSY;
ClearPageAnonExclusive(page);
+
+ /* Clear PageAnonExclusive() before eventually restoring the PTE. */
+ if (IS_ENABLED(CONFIG_HAVE_FAST_GUP))
+ smp_mb__after_atomic();
return 0;
}

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e9414ee57c5b..2aef8d76fcf2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2140,6 +2140,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
*
* In case we cannot clear PageAnonExclusive(), split the PMD
* only and let try_to_migrate_one() fail later.
+ *
+ * See page_try_share_anon_rmap(): invalidate PMD first.
*/
anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
if (freeze && anon_exclusive && page_try_share_anon_rmap(page))
@@ -3177,6 +3179,7 @@ int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
pmdval = pmdp_invalidate(vma, address, pvmw->pmd);

+ /* See page_try_share_anon_rmap(): invalidate PMD first. */
anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
if (anon_exclusive && page_try_share_anon_rmap(page)) {
set_pmd_at(mm, address, pvmw->pmd, pmdval);
diff --git a/mm/ksm.c b/mm/ksm.c
index d7526c705081..971cf923c0eb 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1091,6 +1091,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
goto out_unlock;
}

+ /* See page_try_share_anon_rmap(): clear PTE first. */
if (anon_exclusive && page_try_share_anon_rmap(page)) {
set_pte_at(mm, pvmw.address, pvmw.pte, entry);
goto out_unlock;
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 27fb37d65476..47e955212f15 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -193,20 +193,16 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
bool anon_exclusive;
pte_t swp_pte;

+ ptep_get_and_clear(mm, addr, ptep);
+
+ /* See page_try_share_anon_rmap(): clear PTE first. */
anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
- if (anon_exclusive) {
- flush_cache_page(vma, addr, pte_pfn(*ptep));
- ptep_clear_flush(vma, addr, ptep);
-
- if (page_try_share_anon_rmap(page)) {
- set_pte_at(mm, addr, ptep, pte);
- unlock_page(page);
- put_page(page);
- mpfn = 0;
- goto next;
- }
- } else {
- ptep_get_and_clear(mm, addr, ptep);
+ if (anon_exclusive && page_try_share_anon_rmap(page)) {
+ set_pte_at(mm, addr, ptep, pte);
+ unlock_page(page);
+ put_page(page);
+ mpfn = 0;
+ goto next;
}

migrate->cpages++;
diff --git a/mm/rmap.c b/mm/rmap.c
index edc06c52bc82..b3a21ea9f3d0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1579,11 +1579,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
} else {
flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
- /*
- * Nuke the page table entry. When having to clear
- * PageAnonExclusive(), we always have to flush.
- */
- if (should_defer_flush(mm, flags) && !anon_exclusive) {
+ /* Nuke the page table entry. */
+ if (should_defer_flush(mm, flags)) {
/*
* We clear the PTE but do not flush so potentially
* a remote CPU could still be writing to the folio.
@@ -1714,6 +1711,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
page_vma_mapped_walk_done(&pvmw);
break;
}
+
+ /* See page_try_share_anon_rmap(): clear PTE first. */
if (anon_exclusive &&
page_try_share_anon_rmap(subpage)) {
swap_free(entry);
@@ -2045,6 +2044,8 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
}
VM_BUG_ON_PAGE(pte_write(pteval) && folio_test_anon(folio) &&
!anon_exclusive, subpage);
+
+ /* See page_try_share_anon_rmap(): clear PTE first. */
if (anon_exclusive &&
page_try_share_anon_rmap(subpage)) {
if (folio_test_hugetlb(folio))
--
2.37.1


--
Thanks,

David / dhildenb