[PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

From: Hugh Dickins
Date: Tue Jun 20 2023 - 03:52:27 EST


Add s390-specific pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon. This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

This version is more complicated than others: because s390 fits two 2K
page tables into one 4K page (so page->rcu_head must be shared between
both halves), and already uses page->lru (which page->rcu_head overlays)
to list any free halves; with clever management by page->_refcount bits.

Build upon the existing management, adjusted to follow a new rule: that
a page is not linked to mm_context_t::pgtable_list while either half is
pending free, by either tlb_remove_table() or pte_free_defer(); but is
afterwards either relinked to the list (if other half is allocated), or
freed (if other half is free): by __tlb_remove_table() in both cases.

This rule ensures that page->lru is no longer in use while page->rcu_head
may be needed for use by pte_free_defer(). And a fortuitous byproduct of
following this rule is that page_table_free() no longer needs its curious
two-step manipulation of _refcount - read commit c2c224932fd0 ("s390/mm:
fix 2KB pgtable release race") for what to think of there. But it does
not solve the problem that two halves may need rcu_head at the same time.

For that, add HHead bits between s390's AAllocated and PPending bits in
the upper byte of page->_refcount: then the second pte_free_defer() can
see that rcu_head is already in use, and the RCU callee pte_free_half()
can see that it needs to make a further call_rcu() for that other half.

page_table_alloc() set the page->pt_mm field, so __tlb_remove_table()
knows where to link the freed half while its other half is allocated.
But linking to the list needs mm->context.lock: and although AA bit set
guarantees that pt_mm must still be valid, it does not guarantee that mm
is still valid an instant later: so acquiring mm->context.lock would not
be safe. For now, use a static global mm_pgtable_list_lock instead:
then a soon-to-follow commit will split it per-mm as before (probably by
using a SLAB_TYPESAFE_BY_RCU structure for the list head and its lock);
and update the commentary on the pgtable_list.

Signed-off-by: Hugh Dickins <hughd@xxxxxxxxxx>
---
arch/s390/include/asm/pgalloc.h | 4 +
arch/s390/mm/pgalloc.c | 205 +++++++++++++++++++++++---------
include/linux/mm_types.h | 2 +-
3 files changed, 154 insertions(+), 57 deletions(-)

diff --git a/arch/s390/include/asm/pgalloc.h b/arch/s390/include/asm/pgalloc.h
index 17eb618f1348..89a9d5ef94f8 100644
--- a/arch/s390/include/asm/pgalloc.h
+++ b/arch/s390/include/asm/pgalloc.h
@@ -143,6 +143,10 @@ static inline void pmd_populate(struct mm_struct *mm,
#define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte)
#define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte)

+/* arch use pte_free_defer() implementation in arch/s390/mm/pgalloc.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
void vmem_map_init(void);
void *vmem_crst_alloc(unsigned long val);
pte_t *vmem_pte_alloc(void);
diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
index 66ab68db9842..11983a3ff95a 100644
--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -159,6 +159,11 @@ void page_table_free_pgste(struct page *page)

#endif /* CONFIG_PGSTE */

+/*
+ * Temporarily use a global spinlock instead of mm->context.lock.
+ * This will be replaced by a per-mm spinlock in a followup commit.
+ */
+static DEFINE_SPINLOCK(mm_pgtable_list_lock);
/*
* A 2KB-pgtable is either upper or lower half of a normal page.
* The second half of the page may be unused or used as another
@@ -172,7 +177,7 @@ void page_table_free_pgste(struct page *page)
* When a parent page gets fully allocated it contains 2KB-pgtables in both
* upper and lower halves and is removed from mm_context_t::pgtable_list.
*
- * When 2KB-pgtable is freed from to fully allocated parent page that
+ * When 2KB-pgtable is freed from the fully allocated parent page that
* page turns partially allocated and added to mm_context_t::pgtable_list.
*
* If 2KB-pgtable is freed from the partially allocated parent page that
@@ -182,16 +187,24 @@ void page_table_free_pgste(struct page *page)
* As follows from the above, no unallocated or fully allocated parent
* pages are contained in mm_context_t::pgtable_list.
*
+ * NOTE NOTE NOTE: The commentary above and below has not yet been updated:
+ * the new rule is that a page is not linked to mm_context_t::pgtable_list
+ * while either half is pending free by any method; but afterwards is
+ * either relinked to it, or freed, by __tlb_remove_table(). This allows
+ * pte_free_defer() to use the page->rcu_head (which overlays page->lru).
+ *
* The upper byte (bits 24-31) of the parent page _refcount is used
* for tracking contained 2KB-pgtables and has the following format:
*
- * PP AA
- * 01234567 upper byte (bits 24-31) of struct page::_refcount
- * || ||
- * || |+--- upper 2KB-pgtable is allocated
- * || +---- lower 2KB-pgtable is allocated
- * |+------- upper 2KB-pgtable is pending for removal
- * +-------- lower 2KB-pgtable is pending for removal
+ * PPHHAA
+ * 76543210 upper byte (bits 24-31) of struct page::_refcount
+ * ||||||
+ * |||||+--- lower 2KB-pgtable is allocated
+ * ||||+---- upper 2KB-pgtable is allocated
+ * |||+----- lower 2KB-pgtable is pending free by page->rcu_head
+ * ||+------ upper 2KB-pgtable is pending free by page->rcu_head
+ * |+------- lower 2KB-pgtable is pending free by any method
+ * +-------- upper 2KB-pgtable is pending free by any method
*
* (See commit 620b4e903179 ("s390: use _refcount for pgtables") on why
* using _refcount is possible).
@@ -200,7 +213,7 @@ void page_table_free_pgste(struct page *page)
* The parent page is either:
* - added to mm_context_t::pgtable_list in case the second half of the
* parent page is still unallocated;
- * - removed from mm_context_t::pgtable_list in case both hales of the
+ * - removed from mm_context_t::pgtable_list in case both halves of the
* parent page are allocated;
* These operations are protected with mm_context_t::lock.
*
@@ -239,32 +252,22 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
/* Try to get a fragment of a 4K page as a 2K page table */
if (!mm_alloc_pgste(mm)) {
table = NULL;
- spin_lock_bh(&mm->context.lock);
+ spin_lock_bh(&mm_pgtable_list_lock);
if (!list_empty(&mm->context.pgtable_list)) {
page = list_first_entry(&mm->context.pgtable_list,
struct page, lru);
mask = atomic_read(&page->_refcount) >> 24;
- /*
- * The pending removal bits must also be checked.
- * Failure to do so might lead to an impossible
- * value of (i.e 0x13 or 0x23) written to _refcount.
- * Such values violate the assumption that pending and
- * allocation bits are mutually exclusive, and the rest
- * of the code unrails as result. That could lead to
- * a whole bunch of races and corruptions.
- */
- mask = (mask | (mask >> 4)) & 0x03U;
- if (mask != 0x03U) {
- table = (unsigned long *) page_to_virt(page);
- bit = mask & 1; /* =1 -> second 2K */
- if (bit)
- table += PTRS_PER_PTE;
- atomic_xor_bits(&page->_refcount,
- 0x01U << (bit + 24));
- list_del(&page->lru);
- }
+ /* Cannot be on this list if either half pending free */
+ WARN_ON_ONCE(mask & ~0x03U);
+ /* One or other half must be available, but not both */
+ WARN_ON_ONCE(mask == 0x00U || mask == 0x03U);
+ table = (unsigned long *)page_to_virt(page);
+ bit = mask & 0x01U; /* =1 -> second 2K available */
+ table += bit * PTRS_PER_PTE;
+ atomic_xor_bits(&page->_refcount, 0x01U << (bit + 24));
+ list_del(&page->lru);
}
- spin_unlock_bh(&mm->context.lock);
+ spin_unlock_bh(&mm_pgtable_list_lock);
if (table)
return table;
}
@@ -278,6 +281,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
}
arch_set_page_dat(page, 0);
/* Initialize page table */
+ page->pt_mm = mm;
table = (unsigned long *) page_to_virt(page);
if (mm_alloc_pgste(mm)) {
/* Return 4K page table with PGSTEs */
@@ -288,14 +292,14 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
/* Return the first 2K fragment of the page */
atomic_xor_bits(&page->_refcount, 0x01U << 24);
memset64((u64 *)table, _PAGE_INVALID, 2 * PTRS_PER_PTE);
- spin_lock_bh(&mm->context.lock);
+ spin_lock_bh(&mm_pgtable_list_lock);
list_add(&page->lru, &mm->context.pgtable_list);
- spin_unlock_bh(&mm->context.lock);
+ spin_unlock_bh(&mm_pgtable_list_lock);
}
return table;
}

-static void page_table_release_check(struct page *page, void *table,
+static void page_table_release_check(struct page *page, unsigned long *table,
unsigned int half, unsigned int mask)
{
char msg[128];
@@ -317,21 +321,18 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
if (!mm_alloc_pgste(mm)) {
/* Free 2K page table fragment of a 4K page */
bit = ((unsigned long) table & ~PAGE_MASK)/(PTRS_PER_PTE*sizeof(pte_t));
- spin_lock_bh(&mm->context.lock);
+ spin_lock_bh(&mm_pgtable_list_lock);
/*
- * Mark the page for delayed release. The actual release
- * will happen outside of the critical section from this
- * function or from __tlb_remove_table()
+ * Mark the page for release. The actual release will happen
+ * below from this function, or later from __tlb_remove_table().
*/
- mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
+ mask = atomic_xor_bits(&page->_refcount, 0x01U << (bit + 24));
mask >>= 24;
- if (mask & 0x03U)
+ if (mask & 0x03U) /* other half is allocated */
list_add(&page->lru, &mm->context.pgtable_list);
- else
+ else if (!(mask & 0x30U)) /* other half not pending */
list_del(&page->lru);
- spin_unlock_bh(&mm->context.lock);
- mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24));
- mask >>= 24;
+ spin_unlock_bh(&mm_pgtable_list_lock);
if (mask != 0x00U)
return;
half = 0x01U << bit;
@@ -362,19 +363,17 @@ void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
return;
}
bit = ((unsigned long) table & ~PAGE_MASK) / (PTRS_PER_PTE*sizeof(pte_t));
- spin_lock_bh(&mm->context.lock);
+ spin_lock_bh(&mm_pgtable_list_lock);
/*
- * Mark the page for delayed release. The actual release will happen
- * outside of the critical section from __tlb_remove_table() or from
- * page_table_free()
+ * Mark the page for delayed release.
+ * The actual release will happen later, from __tlb_remove_table().
*/
mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
mask >>= 24;
- if (mask & 0x03U)
- list_add_tail(&page->lru, &mm->context.pgtable_list);
- else
+ /* Other half not allocated? Other half not already pending free? */
+ if ((mask & 0x03U) == 0x00U && (mask & 0x30U) != 0x30U)
list_del(&page->lru);
- spin_unlock_bh(&mm->context.lock);
+ spin_unlock_bh(&mm_pgtable_list_lock);
table = (unsigned long *) ((unsigned long) table | (0x01U << bit));
tlb_remove_table(tlb, table);
}
@@ -382,17 +381,40 @@ void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
void __tlb_remove_table(void *_table)
{
unsigned int mask = (unsigned long) _table & 0x03U, half = mask;
- void *table = (void *)((unsigned long) _table ^ mask);
+ unsigned long *table = (unsigned long *)((unsigned long) _table ^ mask);
struct page *page = virt_to_page(table);

switch (half) {
case 0x00U: /* pmd, pud, or p4d */
- free_pages((unsigned long)table, CRST_ALLOC_ORDER);
+ __free_pages(page, CRST_ALLOC_ORDER);
return;
case 0x01U: /* lower 2K of a 4K page table */
- case 0x02U: /* higher 2K of a 4K page table */
- mask = atomic_xor_bits(&page->_refcount, mask << (4 + 24));
- mask >>= 24;
+ case 0x02U: /* upper 2K of a 4K page table */
+ /*
+ * If the other half is marked as allocated, page->pt_mm must
+ * still be valid, page->rcu_head no longer in use so page->lru
+ * good for use, so now make the freed half available for reuse.
+ * But be wary of races with that other half being freed.
+ */
+ if (atomic_read(&page->_refcount) & (0x03U << 24)) {
+ struct mm_struct *mm = page->pt_mm;
+ /*
+ * It is safe to use page->pt_mm when the other half
+ * is seen allocated while holding pgtable_list lock;
+ * but how will it be safe to acquire that spinlock?
+ * Global mm_pgtable_list_lock is safe and easy for
+ * now, then a followup commit will split it per-mm.
+ */
+ spin_lock_bh(&mm_pgtable_list_lock);
+ mask = atomic_xor_bits(&page->_refcount, mask << 28);
+ mask >>= 24;
+ if (mask & 0x03U)
+ list_add(&page->lru, &mm->context.pgtable_list);
+ spin_unlock_bh(&mm_pgtable_list_lock);
+ } else {
+ mask = atomic_xor_bits(&page->_refcount, mask << 28);
+ mask >>= 24;
+ }
if (mask != 0x00U)
return;
break;
@@ -407,6 +429,77 @@ void __tlb_remove_table(void *_table)
__free_page(page);
}

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static void pte_free_now0(struct rcu_head *head);
+static void pte_free_now1(struct rcu_head *head);
+
+static void pte_free_pgste(struct rcu_head *head)
+{
+ unsigned long *table;
+ struct page *page;
+
+ page = container_of(head, struct page, rcu_head);
+ table = (unsigned long *)page_to_virt(page);
+ table = (unsigned long *)((unsigned long)table | 0x03U);
+ __tlb_remove_table(table);
+}
+
+static void pte_free_half(struct rcu_head *head, unsigned int bit)
+{
+ unsigned long *table;
+ struct page *page;
+ unsigned int mask;
+
+ page = container_of(head, struct page, rcu_head);
+ mask = atomic_xor_bits(&page->_refcount, 0x04U << (bit + 24));
+
+ table = (unsigned long *)page_to_virt(page);
+ table += bit * PTRS_PER_PTE;
+ table = (unsigned long *)((unsigned long)table | (0x01U << bit));
+ __tlb_remove_table(table);
+
+ /* If pte_free_defer() of the other half came in, queue it now */
+ if (mask & 0x0CU)
+ call_rcu(&page->rcu_head, bit ? pte_free_now0 : pte_free_now1);
+}
+
+static void pte_free_now0(struct rcu_head *head)
+{
+ pte_free_half(head, 0);
+}
+
+static void pte_free_now1(struct rcu_head *head)
+{
+ pte_free_half(head, 1);
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+ unsigned int bit, mask;
+ struct page *page;
+
+ page = virt_to_page(pgtable);
+ if (mm_alloc_pgste(mm)) {
+ call_rcu(&page->rcu_head, pte_free_pgste);
+ return;
+ }
+ bit = ((unsigned long)pgtable & ~PAGE_MASK) /
+ (PTRS_PER_PTE * sizeof(pte_t));
+
+ spin_lock_bh(&mm_pgtable_list_lock);
+ mask = atomic_xor_bits(&page->_refcount, 0x15U << (bit + 24));
+ mask >>= 24;
+ /* Other half not allocated? Other half not already pending free? */
+ if ((mask & 0x03U) == 0x00U && (mask & 0x30U) != 0x30U)
+ list_del(&page->lru);
+ spin_unlock_bh(&mm_pgtable_list_lock);
+
+ /* Do not relink on rcu_head if other half already linked on rcu_head */
+ if ((mask & 0x0CU) != 0x0CU)
+ call_rcu(&page->rcu_head, bit ? pte_free_now1 : pte_free_now0);
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
/*
* Base infrastructure required to generate basic asces, region, segment,
* and page tables that do not make use of enhanced features like EDAT1.
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 306a3d1a0fa6..1667a1bdb8a8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -146,7 +146,7 @@ struct page {
pgtable_t pmd_huge_pte; /* protected by page->ptl */
unsigned long _pt_pad_2; /* mapping */
union {
- struct mm_struct *pt_mm; /* x86 pgds only */
+ struct mm_struct *pt_mm; /* x86 pgd, s390 */
atomic_t pt_frag_refcount; /* powerpc */
};
#if ALLOC_SPLIT_PTLOCKS
--
2.35.3