[RFC PATCH v2 0/9] Introduce Copy-On-Write to Page Table

From: Chih-En Lin
Date: Tue Sep 27 2022 - 12:27:47 EST


Currently, copy-on-write is only used for the mapped memory; the child
process still needs to copy the entire page table from the parent
process during forking. The parent process might take a lot of time and
memory to copy the page table when the parent has a big page table
allocated. For example, the memory usage of a process after forking with
1 GB mapped memory is as follows:

DEFAULT FORK
parent child
VmRSS: 1049688 kB 1048688 kB
VmPTE: 2096 kB 2096 kB

This patch introduces copy-on-write (COW) for the PTE level page tables.
COW PTE improves performance in the situation where the user needs
copies of the program to run on isolated environments. Feedback-based
fuzzers (e.g., AFL) and serverless/microservice frameworks are two major
examples. For instance, COW PTE achieves a 9.3x throughput increase when
running SQLite on a fuzzer (AFL). As COW PTE only boosts performance in
some cases, the patch adds a new sysctl, vm.cow_pte, with the input
process ID (PID) to allow the user to enable COW PTE for a specific
process.

To handle the page table state of each process that has a shared PTE
table, the patch introduces the concept of COW PTE table ownership. This
implementation uses the address of the PMD index to track the ownership
of the PTE table. This helps maintain the state of the COW PTE tables,
such as the RSS and pgtable_bytes. Some PTE tables (e.g., pinned pages
that reside in the table) still need to be copied immediately for
consistency with the current COW logic. As a result, a flag,
COW_PTE_OWNER_EXCLUSIVE, indicating whether a PTE table is exclusive
(i.e., only one task owns it at a time) is added to the table’s owner
pointer. Every time a PTE table is copied during the fork, the owner
pointer (and thus the exclusive flag) will be checked to determine
whether the PTE table can be shared across processes.

This patch uses a refcount to track the shared page table's lifetime.
Invoking fork with COW PTE will increase the refcount. A refcount=1
means that the page table is not currently shared with another process
but may be shared. And, when someone writes to the shared PTE table, it
will cause the write fault to break COW PTE. If the shared PTE table's
refcount is one, the process that triggers the fault will reuse the
shared PTE table. Otherwise, the process will decrease the refcount,
copy the information to a new PTE table or dereference all the
information and change the owner if they have the shared PTE table.

After applying COW to PTE, the memory usage after forking is as follows:

COW PTE
parent child
VmRSS: 1049968 kB 2576 kB
VmPTE: 2096 kB 44 kB

The results show that this patch significantly decreases memory usage.
Other improvements such as lower fork latency and page fault latency,
which are the major benefits, are discussed later.

Real-world applications
=======================

We run benchmarks of fuzzing and VM cloning. The experiments were done
with the normal fork or the fork with COW PTE.

With AFL (LLVM mode) and SQLite, COW PTE (503.67 execs/sec) achieves a
9.3x throughput increase over the normal fork version (53.86 execs/sec).

fork
execs_per_sec unix_time time
count 26.000000 2.600000e+01 26.000000
mean 53.861538 1.663145e+09 84.423077
std 3.715063 5.911357e+01 59.113567
min 35.980000 1.663145e+09 0.000000
25% 54.440000 1.663145e+09 32.250000
50% 54.610000 1.663145e+09 82.000000
75% 54.837500 1.663145e+09 140.750000
max 55.600000 1.663145e+09 178.000000

COW PTE
execs_per_sec unix_time time
count 36.000000 3.600000e+01 36.000000
mean 503.674444 1.663146e+09 88.916667
std 81.805271 5.369191e+01 53.691912
min 84.910000 1.663146e+09 0.000000
25% 472.952500 1.663146e+09 44.500000
50% 504.700000 1.663146e+09 89.000000
75% 553.367500 1.663146e+09 133.250000
max 568.270000 1.663146e+09 178.000000

With TriforceAFL which is for kernel fuzzing with QEMU, COW PTE
(124.31 execs/sec) achieves a 1.3x throughput increase over the
normal fork version (96.44 execs/sec).

fork
execs_per_sec unix_time time
count 18.000000 1.800000e+01 18.000000
mean 96.436667 1.663146e+09 84.388889
std 25.260184 6.601795e+01 66.017947
min 6.590000 1.663146e+09 0.000000
25% 91.025000 1.663146e+09 21.250000
50% 100.350000 1.663146e+09 92.000000
75% 111.247500 1.663146e+09 146.750000
max 122.260000 1.663146e+09 169.000000

COW PTE
execs_per_sec unix_time time
count 22.000000 2.200000e+01 22.000000
mean 124.305455 1.663147e+09 90.409091
std 32.508728 6.033846e+01 60.338457
min 6.590000 1.663146e+09 0.000000
25% 113.227500 1.663146e+09 26.250000
50% 122.435000 1.663147e+09 112.000000
75% 145.792500 1.663147e+09 141.500000
max 161.280000 1.663147e+09 168.000000

Comparison with uffd
====================

For RFC v1, David Hildenbrand mentioned that uffd-wp is a new way of
snapshotting in QEMU. There is some overlap between uffd and fork use
cases, such as database snapshotting. So the following microbenchmarks
also measure the overhead of uffd-wp and uffd-copy-page.

To be fair in terms of CPU usage, the uffd handlers are pinned to the
same core as the main thread. uffd-wp simulates the work QEMU does with
uffd-wp. It will store the page that causes the fault into a memory
buffer and remove write protection for that page. Also, uffd-copy-page
will allocate the memory and replace the original page that causes the
fault.

Microbenchmark - syscall/registering latency
=============================================

We run microbenchmarks to measure the latency of a fork syscall or
registering uffd with sizes of mapped memory ranging from 0 to 512 MB
for the use cases that focus on lowering startup time (e.g., serverless
frameworks). The results show that the latency of a normal fork and
registering uffd-wp reaches 10 ms and 3.9 ms respectively, while the
latency of registering uffd-copy-page is around 0.007 ms. The latency of
a fork with COW PTE is around 0.625 ms after 200 MB, which is
significantly lower than the normal fork/uffd-wp. In short, with 512 MB
mapped memory, COW PTE decreases latency by 93% for normal fork and 83%
for uffd-wp.

Microbenchmark - page fault latency
====================================

We conducted some microbenchmarks to measure page fault latency with
different patterns of accesses to a 512 MB memory buffer after forking
or registering uffd.

In the first experiment, the program accesses the entire 512 MB memory
by writing to all the pages consecutively. The experiment is done with
normal fork, fork with COW PTE, uffd-wp, and uffd-copy-page and
calculates the single access average latency. The result shows that the
page fault latency of COW PTE (0.000045 ms) is 59.5x faster than the
uffd-wp (0.002676 ms). The low uffd-wp performance is probably because
of the cost of switching between kernel and user mode. What is more
interesting is that COW PTE also improves the average page fault
latency. COW PTE page fault latency (0.000045 ms) is 16.5x lower than
the normal fork fault latency (0.000742 ms). Here are the raw numbers:

Page fault - Access to the entire 512 MB memory
fork mean: 0.000742 ms
COW PTE mean: 0.000045 ms
uffd (wp) mean: 0.002676 ms
uffd (copy-page) mean: 0.008667 ms

The second experiment simulates real-world applications with sparse
accesses. The program randomly accesses the memory by writing to one
random page 1 million times and calculates the average access time.
Since the number of fork and COW PTE are too close to each other, we
cannot simply conclude which one is faster, so we run both 100 times
to get the averages. The result shows that COW PTE (0.000027 ms) is
similar to normal fork (0.000028 ms) and is 2.3x faster than uffd-wp
(0.000060 ms).

Page fault - Random access
fork mean: 0.000028 ms
COW PTE mean: 0.000027 ms
uffd (wp) mean: 0.000060 ms
uffd (copy-page) mean: 0.002363 ms

All the tests were run with QEMU and the kernel was built with the
x86_64 default config.

Summary
=======

In summary, COW PTE reduces the memory footprint of processes and
improves the initialization and page fault latency for various
applications, which would be important to some frameworks that require
very low execution startup (e.g., serverless framework) or
high-throughput short executions of child processes (e.g., testing).

This patch is based on the paper "On-demand-fork: a microsecond fork
for memory-intensive and latency-sensitive applications" [1] from
Purdue University.

Any comments and suggestions are welcome.

Thanks,
Chih-En Lin

---

TODO list:
- Handle the file-backed and shmem with reclaim.
- Handle OOM, KSM, page table walker, and migration.
- Deal with TLB flush in the break COW PTE handler.

RFC v1 -> RFC v2
- Change the clone flag method to sysctl with PID.
- Change the MMF_COW_PGTABLE flag to two flags, MMF_COW_PTE and
MMF_COW_PTE_READY, for the sysctl.
- Change the owner pointer to use the folio padding.
- Handle all the VMAs that cover the PTE table when doing the break COW PTE.
- Remove the self-defined refcount to use the _refcount for the page
table page.
- Add the exclusive flag to let the page table only own by one task in
some situations.
- Invalidate address range MMU notifier and start the write_seqcount
when doing the break COW PTE.
- Handle the swap cache and swapoff.

RFC v1: https://lore.kernel.org/all/20220519183127.3909598-1-shiyn.lin@xxxxxxxxx/

[1] https://dl.acm.org/doi/10.1145/3447786.3456258

This patch is based on v6.0-rc5.

---

Chih-En Lin (9):
mm: Add new mm flags for Copy-On-Write PTE table
mm: pgtable: Add sysctl to enable COW PTE
mm, pgtable: Add ownership to PTE table
mm: Add COW PTE fallback functions
mm, pgtable: Add a refcount to PTE table
mm, pgtable: Add COW_PTE_OWNER_EXCLUSIVE flag
mm: Add the break COW PTE handler
mm: Handle COW PTE with reclaim algorithm
mm: Introduce Copy-On-Write PTE table

include/linux/mm.h | 2 +
include/linux/mm_types.h | 5 +-
include/linux/pgtable.h | 140 +++++++++++++
include/linux/rmap.h | 2 +
include/linux/sched/coredump.h | 8 +-
kernel/fork.c | 5 +
kernel/sysctl.c | 8 +
mm/Makefile | 2 +-
mm/cow_pte.c | 39 ++++
mm/gup.c | 13 +-
mm/memory.c | 360 ++++++++++++++++++++++++++++++++-
mm/mmap.c | 3 +
mm/mremap.c | 3 +
mm/page_vma_mapped.c | 5 +
mm/rmap.c | 2 +-
mm/swapfile.c | 1 +
mm/vmscan.c | 1 +
17 files changed, 587 insertions(+), 12 deletions(-)
create mode 100644 mm/cow_pte.c

--
2.37.3