[PATCH v6 00/18] Transparent Contiguous PTEs for User Mappings

From: Ryan Roberts
Date: Thu Feb 15 2024 - 06:10:44 EST


Hi All,

This is a series to opportunistically and transparently use contpte mappings
(set the contiguous bit in ptes) for user memory when those mappings meet the
requirements. The change benefits arm64, but there is some (very) minor
refactoring for x86 to enable its integration with core-mm.

It is part of a wider effort to improve performance by allocating and mapping
variable-sized blocks of memory (folios). One aim is for the 4K kernel to
approach the performance of the 16K kernel, but without breaking compatibility
and without the associated increase in memory. Another aim is to benefit the 16K
and 64K kernels by enabling 2M THP, since this is the contpte size for those
kernels. We have good performance data that demonstrates both aims are being met
(see below).

Of course this is only one half of the change. We require the mapped physical
memory to be the correct size and alignment for this to actually be useful (i.e.
64K for 4K pages, or 2M for 16K/64K pages). Fortunately folios are solving this
problem for us. Filesystems that support it (XFS, AFS, EROFS, tmpfs, ...) will
allocate large folios up to the PMD size today, and more filesystems are coming.
And for anonymous memory, "multi-size THP" is now upstream.


Patch Layout
============

In this version, I've split the patches to better show each optimization:

- 1-2: mm prep: misc code and docs cleanups
- 3-6: mm,arm64,x86 prep: Add pte_advance_pfn() and make pte_next_pfn() a
generic wrapper around it
- 7-11: arm64 prep: Refactor ptep helpers into new layer
- 12: functional contpte implementation
- 23-18: various optimizations on top of the contpte implementation


Testing
=======

I've tested this series on both Ampere Altra (bare metal) and Apple M2 (VM):
- mm selftests (inc new tests written for multi-size THP); no regressions
- Speedometer Java script benchmark in Chromium web browser; no issues
- Kernel compilation; no issues
- Various tests under high memory pressure with swap enabled; no issues


Performance
===========

High Level Use Cases
~~~~~~~~~~~~~~~~~~~~

First some high level use cases (kernel compilation and speedometer JavaScript
benchmarks). These are running on Ampere Altra (I've seen similar improvements
on Android/Pixel 6).

baseline: mm-unstable (mTHP switched off)
mTHP: + enable 16K, 32K, 64K mTHP sizes "always"
mTHP + contpte: + this series
mTHP + contpte + exefolio: + patch at [6], which series supports

Kernel Compilation with -j8 (negative is faster):

| kernel | real-time | kern-time | user-time |
|---------------------------|-----------|-----------|-----------|
| baseline | 0.0% | 0.0% | 0.0% |
| mTHP | -5.0% | -39.1% | -0.7% |
| mTHP + contpte | -6.0% | -41.4% | -1.5% |
| mTHP + contpte + exefolio | -7.8% | -43.1% | -3.4% |

Kernel Compilation with -j80 (negative is faster):

| kernel | real-time | kern-time | user-time |
|---------------------------|-----------|-----------|-----------|
| baseline | 0.0% | 0.0% | 0.0% |
| mTHP | -5.0% | -36.6% | -0.6% |
| mTHP + contpte | -6.1% | -38.2% | -1.6% |
| mTHP + contpte + exefolio | -7.4% | -39.2% | -3.2% |

Speedometer (positive is faster):

| kernel | runs_per_min |
|:--------------------------|--------------|
| baseline | 0.0% |
| mTHP | 1.5% |
| mTHP + contpte | 3.2% |
| mTHP + contpte + exefolio | 4.5% |


Micro Benchmarks
~~~~~~~~~~~~~~~~

The following microbenchmarks are intended to demonstrate the performance of
fork() and munmap() do not regress. I'm showing results for order-0 (4K)
mappings, and for order-9 (2M) PTE-mapped THP. Thanks to David for sharing his
benchmarks.

baseline: mm-unstable + batch zap [7] series
contpte-basic: + patches 0-19; functional contpte implementation
contpte-batch: + patches 20-23; implement new batched APIs
contpte-inline: + patch 24; __always_inline to help compiler
contpte-fold: + patch 25; fold contpte mapping when sensible

Primary platform is Ampere Altra bare metal. I'm also showing results for M2 VM
(on top of MacOS) for reference, although experience suggests this might not be
the most reliable for performance numbers of this sort:

| FORK | order-0 | order-9 |
| Ampere Altra |------------------------|------------------------|
| (pte-map) | mean | stdev | mean | stdev |
|----------------|------------|-----------|------------|-----------|
| baseline | 0.0% | 2.7% | 0.0% | 0.2% |
| contpte-basic | 6.3% | 1.4% | 1948.7% | 0.2% |
| contpte-batch | 7.6% | 2.0% | -1.9% | 0.4% |
| contpte-inline | 3.6% | 1.5% | -1.0% | 0.2% |
| contpte-fold | 4.6% | 2.1% | -1.8% | 0.2% |

| MUNMAP | order-0 | order-9 |
| Ampere Altra |------------------------|------------------------|
| (pte-map) | mean | stdev | mean | stdev |
|----------------|------------|-----------|------------|-----------|
| baseline | 0.0% | 0.5% | 0.0% | 0.3% |
| contpte-basic | 1.8% | 0.3% | 1104.8% | 0.1% |
| contpte-batch | -0.3% | 0.4% | 2.7% | 0.1% |
| contpte-inline | -0.1% | 0.6% | 0.9% | 0.1% |
| contpte-fold | 0.1% | 0.6% | 0.8% | 0.1% |

| FORK | order-0 | order-9 |
| Apple M2 VM |------------------------|------------------------|
| (pte-map) | mean | stdev | mean | stdev |
|----------------|------------|-----------|------------|-----------|
| baseline | 0.0% | 1.4% | 0.0% | 0.8% |
| contpte-basic | 6.8% | 1.2% | 469.4% | 1.4% |
| contpte-batch | -7.7% | 2.0% | -8.9% | 0.7% |
| contpte-inline | -6.0% | 2.1% | -6.0% | 2.0% |
| contpte-fold | 5.9% | 1.4% | -6.4% | 1.4% |

| MUNMAP | order-0 | order-9 |
| Apple M2 VM |------------------------|------------------------|
| (pte-map) | mean | stdev | mean | stdev |
|----------------|------------|-----------|------------|-----------|
| baseline | 0.0% | 0.6% | 0.0% | 0.4% |
| contpte-basic | 1.6% | 0.6% | 233.6% | 0.7% |
| contpte-batch | 1.9% | 0.3% | -3.9% | 0.4% |
| contpte-inline | 2.2% | 0.8% | -1.6% | 0.9% |
| contpte-fold | 1.5% | 0.7% | -1.7% | 0.7% |

Misc
~~~~

John Hubbard at Nvidia has indicated dramatic 10x performance improvements for
some workloads at [8], when using 64K base page kernel.


---
This series applies on top of [7], which in turn applies on top of mm-unstable
(649936c3db47). A branch is available at [9].

I believe this is ready to go into mm-unstable as soon as [7] is in (which I
also believe to be ready). Catalin has said he is happy for this to go via the
mm-untable branch, once suitably acked by arm64 folks; Mark has reviewed v5 and
I've made all of his suggested minor changes - hopefully he is comfortable
acking the remaining patches now.

Note: the quoted perf numbers are against v5, but I've rerun a subset of the
tests against this version and results are consistent. Code changes are all
minor and are not expected to make any difference anyway.


Changes since v5 [5]
====================

- Keep pte_next_pfn() as generic wrapper around pte_advance_pfn(). Allowed
dropping arm and powerpc changes (per David)
- Reshaped "Refactor ptep helpers into new layer" into 4 patches to better
show the conversion process (no change to code diff) (per Mark)
- Added comment to justify pte_mknoncont() at public API interface (per Mark)
- Added check for efi_mm in mm_is_user() (per Mark)
- Simplified a couple of pointer alignment statements (per Mark)
- Added braces for if in contpte_try_unfold_partial() (per Mark)
- Renamed some variables in __contpte_try_fold() (per Mark)
- Fixed couple of comment typos (per Mark)
- Enhanced docs for pte_batch_hint() (per David)
- Minor tidy up in folio_pte_batch() (per David)
- Picked up RBs/Acks (thanks to David, Mark, Ard)


Changes since v4 [4]
====================

- Rebased onto David's generic fork and zap batching work
- I had an implementation similar to this prior to v4, but ditched it
because I couldn't make it reliably provide a speedup; David succeeded.
- roughly speaking, a few functions get renamed compared to v4:
- pte_batch_remaining() -> pte_batch_hint()
- set_wrprotects() -> wrprotect_ptes()
- clear_ptes() -> [get_and_]clear_full_ptes()
- Had to convert pte_next_pfn() to pte_advance_pfn()
- Integration into core-mm is simpler because most has been done by
David's work
- Reworked patches to better show the progression from basic implementation to
the various optimizations.
- Removed the 'full' flag that I added to set_ptes() and set_wrprotects() in
v4: I've been able to make up most of the performance in other ways, so this
keeps the interface simpler.
- Simplified contpte_set_ptes(nr > 1): Observed that set_ptes(nr > 1) is only
called for ptes that are initially not present. So updated the spec to
require that, and no longer need to check if any ptes are initially present
when applying a contpte mapping.


Changes since v3 [3]
====================

- Added v3#1 to batch set_ptes() when splitting a huge pmd to ptes; avoids
need to fold contpte blocks for perf improvement
- Separated the clear_ptes() fast path into its own inline function (Alistair)
- Reworked core-mm changes to copy_present_ptes() and zap_pte_range() to
remove overhead when memory is all order-0 folios (for arm64 and !arm64)
- Significant optimization of arm64 backend fork operations (set_ptes_full()
and set_wrprotects()) to ensure no regression when memory is order-0 folios.
- fixed local variable declarations to be reverse xmas tree. - Added
documentation for the new backend APIs (pte_batch_remaining(),
set_ptes_full(), clear_ptes(), ptep_set_wrprotects())
- Renamed tlb_get_guaranteed_space() -> tlb_reserve_space() and pass requested
number of slots. Avoids allocating memory when not needed; perf improvement.


Changes since v2 [2]
====================

- Removed contpte_ptep_get_and_clear_full() optimisation for exit() (v2#14),
and replaced with a batch-clearing approach using a new arch helper,
clear_ptes() (v3#2 and v3#15) (Alistair and Barry)
- (v2#1 / v3#1)
- Fixed folio refcounting so that refcount >= mapcount always (DavidH)
- Reworked batch demarcation to avoid pte_pgprot() (DavidH)
- Reverted return semantic of copy_present_page() and instead fix it up in
copy_present_ptes() (Alistair)
- Removed page_cont_mapped_vaddr() and replaced with simpler logic
(Alistair)
- Made batch accounting clearer in copy_pte_range() (Alistair)
- (v2#12 / v3#13)
- Renamed contpte_fold() -> contpte_convert() and hoisted setting/
clearing CONT_PTE bit to higher level (Alistair)


Changes since v1 [1]
====================

- Export contpte_* symbols so that modules can continue to call inline
functions (e.g. ptep_get) which may now call the contpte_* functions (thanks
to JohnH)
- Use pte_valid() instead of pte_present() where sensible (thanks to Catalin)
- Factor out (pte_valid() && pte_cont()) into new pte_valid_cont() helper
(thanks to Catalin)
- Fixed bug in contpte_ptep_set_access_flags() where TLBIs were missed (thanks
to Catalin)
- Added ARM64_CONTPTE expert Kconfig (enabled by default) (thanks to Anshuman)
- Simplified contpte_ptep_get_and_clear_full()
- Improved various code comments


[1] https://lore.kernel.org/linux-arm-kernel/20230622144210.2623299-1-ryan.roberts@xxxxxxx/
[2] https://lore.kernel.org/linux-arm-kernel/20231115163018.1303287-1-ryan.roberts@xxxxxxx/
[3] https://lore.kernel.org/linux-arm-kernel/20231204105440.61448-1-ryan.roberts@xxxxxxx/
[4] https://lore.kernel.org/lkml/20231218105100.172635-1-ryan.roberts@xxxxxxx/
[5] https://lore.kernel.org/linux-mm/633af0a7-0823-424f-b6ef-374d99483f05@xxxxxxx/
[6] https://lore.kernel.org/lkml/08c16f7d-f3b3-4f22-9acc-da943f647dc3@xxxxxxx/
[7] https://lore.kernel.org/linux-mm/20240214204435.167852-1-david@xxxxxxxxxx/
[8] https://lore.kernel.org/linux-mm/c507308d-bdd4-5f9e-d4ff-e96e4520be85@xxxxxxxxxx/
[9] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/contpte-lkml_v6


Thanks,
Ryan

Ryan Roberts (18):
mm: Clarify the spec for set_ptes()
mm: thp: Batch-collapse PMD with set_ptes()
mm: Introduce pte_advance_pfn() and use for pte_next_pfn()
arm64/mm: Convert pte_next_pfn() to pte_advance_pfn()
x86/mm: Convert pte_next_pfn() to pte_advance_pfn()
mm: Tidy up pte_next_pfn() definition
arm64/mm: Convert READ_ONCE(*ptep) to ptep_get(ptep)
arm64/mm: Convert set_pte_at() to set_ptes(..., 1)
arm64/mm: Convert ptep_clear() to ptep_get_and_clear()
arm64/mm: New ptep layer to manage contig bit
arm64/mm: Split __flush_tlb_range() to elide trailing DSB
arm64/mm: Wire up PTE_CONT for user mappings
arm64/mm: Implement new wrprotect_ptes() batch API
arm64/mm: Implement new [get_and_]clear_full_ptes() batch APIs
mm: Add pte_batch_hint() to reduce scanning in folio_pte_batch()
arm64/mm: Implement pte_batch_hint()
arm64/mm: __always_inline to improve fork() perf
arm64/mm: Automatically fold contpte mappings

arch/arm64/Kconfig | 9 +
arch/arm64/include/asm/pgtable.h | 411 ++++++++++++++++++++++++++----
arch/arm64/include/asm/tlbflush.h | 13 +-
arch/arm64/kernel/efi.c | 4 +-
arch/arm64/kernel/mte.c | 2 +-
arch/arm64/kvm/guest.c | 2 +-
arch/arm64/mm/Makefile | 1 +
arch/arm64/mm/contpte.c | 404 +++++++++++++++++++++++++++++
arch/arm64/mm/fault.c | 12 +-
arch/arm64/mm/fixmap.c | 4 +-
arch/arm64/mm/hugetlbpage.c | 40 +--
arch/arm64/mm/kasan_init.c | 6 +-
arch/arm64/mm/mmu.c | 16 +-
arch/arm64/mm/pageattr.c | 6 +-
arch/arm64/mm/trans_pgd.c | 6 +-
arch/x86/include/asm/pgtable.h | 8 +-
include/linux/efi.h | 5 +
include/linux/pgtable.h | 32 ++-
mm/huge_memory.c | 58 +++--
mm/memory.c | 19 +-
20 files changed, 924 insertions(+), 134 deletions(-)
create mode 100644 arch/arm64/mm/contpte.c

--
2.25.1