[PATCH 0/2] mm: Remember young bit for migration entries

From: Peter Xu
Date: Tue Aug 02 2022 - 21:22:12 EST


rfc->v1:
- Fix build for arch/um where MAX_PHYSMEM_BITS not defined [syzbot]
- Add VM_BUG_ON() in swp_offset_pfn() to check swap entry type [Ying]
- Use max_swapfile_size() to detect swp offset size [Ying]
- Posted patch 3 separately, dropped patch 4

rfc: https://lore.kernel.org/all/20220729014041.21292-1-peterx@xxxxxxxxxx/

Problem
=======

When migrate a page, right now we always mark the migrated page as old.
The reason could be that we don't really know whether the page is hot or
cold, so we could have taken it a default negative assuming that's safer.

However that could lead to at least two problems:

(1) We lost the real hot/cold information while we could have persisted.
That information shouldn't change even if the backing page is changed
after the migration,

(2) There can be always extra overhead on the immediate next access to
any migrated page, because hardware MMU needs cycles to set the young
bit again (as long as the MMU supports).

Many of the recent upstream works showed that (2) is not something trivial
and actually very measurable. In my test case, reading 1G chunk of memory
- jumping in page size intervals - could take 99ms just because of the
extra setting on the young bit on a generic x86_64 system, comparing to 4ms
if young set.

This issue is originally reported by Andrea Arcangeli.

Solution
========

To solve this problem, this patchset tries to remember the young bit in the
migration entries and carry it over when recovering the ptes.

We have the chance to do so because in many systems the swap offset is not
really fully used. Migration entries use swp offset to store PFN only,
while the PFN is normally not as large as swp offset and normally smaller.
It means we do have some free bits in swp offset that we can use to store
things like young, and that's how this series tried to approach this
problem.

max_swapfile_size() is used here to detect per-arch offset length in swp
entries. We'll automatically remember the young bit when we find that we
have enough swp offset field to keep both the PFN and the young bit for a
migration entry.

Tests
=====

After the patchset applied, the immediate read access test [1] of above 1G
chunk after migration can shrink from 99ms to 4ms. The test is done by
moving 1G pages from node 0->1->0 then read it in page size jumps. The
test is with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz.

Patch Layout
============

Patch 1: Add swp_offset_pfn() and apply to all pfn swap entries, we should
also stop treating swp_offset() as PFN anymore because it can
contain more information starting from next patch.
Patch 2: The core patch to remember young bit in swap offsets.

Please review, thanks.

[1] https://github.com/xzpeter/clibs/blob/master/misc/swap-young.c

Peter Xu (2):
mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry
mm: Remember young bit for page migrations

arch/arm64/mm/hugetlbpage.c | 2 +-
include/linux/swapops.h | 84 ++++++++++++++++++++++++++++++++++---
mm/hmm.c | 2 +-
mm/huge_memory.c | 10 ++++-
mm/memory-failure.c | 2 +-
mm/migrate.c | 4 +-
mm/migrate_device.c | 2 +
mm/page_vma_mapped.c | 6 +--
mm/rmap.c | 3 +-
9 files changed, 99 insertions(+), 16 deletions(-)

--
2.32.0