[RFC PATCH v1 0/7] PAGE_SIZE Unmapping in Memory Failure Recovery for HugeTLB Pages

From: Jiaqi Yan
Date: Thu Apr 27 2023 - 20:41:51 EST


Goal
====
Currently once a byte in a HugeTLB hugepage becomes HWPOISON, the whole
hugepage will be unmapped from the page table because that is the finest
granularity of the mapping.

High granularity mapping (HGM) [1], the functionality to map memory
addresses at finer granularities (extreme case is PAGE_SIZE), is recently
proposed upstream, and provides the opportunity to handle memory error more
efficiently: instead of unmapping the whole hugepage, only the raw subpage
in the hugepage needs to be thrown away and all the healthy
subpages can still be kept available for users.

Idea
====
Today memory failure recovery for HugeTLB pages (hugepage) is different
from raw and THP pages. We are only interested in in-use hugepages, which is
dealt with in these simplified steps:
1. Increment the refcount on the compound head of the hugepage.
2. Insert the raw HWPOISON page to the compound head’s raw_hwp_list
(_hugetlb_hwpoison) if it is not already in the list.
3. Unmap the entire hugepage from HugeTLB’s page table.
4. Kill the processes that are accessing the poisoned hugepage.

HGM can greatly improve this recovery mechanism. Step #3 (unmapping
entire hugepage) can be replaced by
3.1 Map the entire hugepage at finer granularity, so that the exact
HWPOISON address is mapped by a PAGE_SIZE PTE, and the rest of the
address spaces optimally mapped by either smaller P*Ds or PTEs. In
other words, the original HugeTLB PTE is split into smaller P*Ds
and PTEs.
3.2 Only unmap the newly mapped PTE that maps the HWPOISON address.

For shared mappings, current HGM patches is already a solid basis for
splitting functionality in step #3.1. This RFC drafts a complete solution
for shared mapping. The splitting-based idea can be applied to private
mappings as well, but additional subtle complexity needs to be dealt with.
We defer the private mapping case as future work.

Splitting HugeTLB PTEs (Step #3.1)
==================================
The general process of splitting a present leaf HugeTLB PTE is
1. Get and clear the original HugeTLB PTE old_pte.
2. Initialize curr with the start address range corresponding to old_pte.
3. Find the optimal level we should map curr at.
4. Perform HGM walk on curr with the optimal level found in step 3,
potentially allocating a new PTE at the optimal level.
5. Populate the newly allocated PTE with bits from old_pte, including
dirty, write, and UFFD_WP.
6. Update curr += the newly created PTE size, repeat step 3 until the
entire VMA is covered.

The functionality of splitting hugepage mapping is not meaningful for
mostly none PTEs. We handle none or userfaultfd write protect (UFFD_WP)
marker HugeTLB PTEs at the time of page faulting. Migration and HWPOISON
PTEs are better left not touched.

Memory Failure Recovery and Unmapping (Step #3.2)
=================================================
A few changes are made in memory_failure and rmap to only unmap raw
HWPOISON pages:
1. as long as HGM is turned on in CONFIG, memory_failure attempts to enable
HGM on the VMA containing the poisoned hugepage
2. memory_failure attempts to split the HugeTLB PTE so that poisoned
address is mapped by a PAGE_SIZE PTE, for all the VMAs containing the
poisoned hugepage.
3. get_huge_page_for_hwpoison only returns -EHWPOISON if the raw page is
already in the compound head’s raw_hwp_list. This makes unmapping work
correctly when multiple raw pages in the same hugepage become HWPOISON.
4. rmap utilizes compound head’s raw_hwp_list to 1) avoid unmapping raw
pages not in the list, and 2) keep track if the raw pages in the list
are already unmapped.
5. page refcount check in me_huge_page is skipped.

Between mmap() and Page Fault
==========================
Memory error can occur between the time when userspace maps a hugepage and
the time when userspace faults in the mapped hugepage. General idea is to
not create any raw-page-size page table entry for HWPOISON memory,
and render memory in healthy raw pages still available to userspace (via
normal fault handling). At the time of hugetlb_no_page:
- If the entire hugepage doesn’t contain any HWPOISON page, the normal
page fault handler continues.
- If the memory address being faulted is within a HWPOISON raw page,
hugetlb_no_page returns VM_FAULT_HWPOISON_LARGE (so that page fault
handler sends a BUS_MCEERR_AR SIGBUS to the faulting process).
- If the memory address being faulted is within a healthy raw page,
hugetlb_no_page utilize HGM to create a new HugeTLB PTE so that its
hugetlb_pte_size cannot be larger and at the same time it doesn’t map any
HWPOISON address. Then the normal page fault handler continues.

Failure Handling
================
- If the kernel still fails to allocate a new raw_hwp_page after a retry,
memory_failure returns MF_IGNORED with MF_MSG_UNKNOWN.
- For each VMA that maps the HWPOISON hugepage
- If the VMA is not eligible for HGM, the old behavior is taken: unmap
the entire hugepage from that VMA.
- If memory_failure fails to enable HGM on the VMA, or if memory_failure
fails to split any VMA that mapped the HWPOISON page, the recovery
returns MF_IGNORED with MF_MSG_UNMAP_FAILED.
- For a particular VMA, if splitting HugeTLB PTE fails, the original PTE
will be restored to the page table.

Code Changes
============
The code patches in this RFC is based on HGM patchset V2 [1], composed
of two parts. The first part implements the idea laid out in the cover
letter; the second part tests two major scenarios: HWPOISON on already
faulted pages and HWPOISON between mapped and faulted.

Future Changes
==============
There is a pending improvement to hugetlbfs_read_iter. If a hugepage is
found from page cache and it contains HWPOISON subpages, today kernel
returns -EIO immediately. With the new splitting-then-unmap
behavior, kernel can return userspace every byte until up to the first
raw HWPOISON byte. If userspace wants the read to start within a raw
HWPOISON page, kernel will have to return -EIO. This improvement and its
selftest will be done in the future patch series.

[1] https://lore.kernel.org/all/20230218002819.1486479-1-jthoughton@xxxxxxxxxx/

Jiaqi Yan (7):
hugetlb: add HugeTLB splitting functionality
hugetlb: create PTE level mapping when possible
mm: publish raw_hwp_page in mm.h
mm/memory_failure: unmap raw HWPoison PTEs when possible
hugetlb: only VM_FAULT_HWPOISON_LARGE raw page
selftest/mm: test PAGESIZE unmapping HWPOISON pages
selftest/mm: test PAGESIZE unmapping UFFD WP marker HWPOISON pages

include/linux/hugetlb.h | 14 +
include/linux/mm.h | 36 ++
mm/hugetlb.c | 405 ++++++++++++++++++++++-
mm/memory-failure.c | 206 ++++++++++--
mm/rmap.c | 38 ++-
tools/testing/selftests/mm/hugetlb-hgm.c | 364 ++++++++++++++++++--
6 files changed, 1004 insertions(+), 59 deletions(-)

--
2.40.1.495.gc816e09b53d-goog