[PATCH 1/2] mm/madvise: help MADV_PAGEOUT to find swap cache pages

From: Dave Hansen
Date: Mon Mar 23 2020 - 19:43:02 EST



From: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>

tl;dr: MADV_PAGEOUT ignores unmapped swap cache pages. Enable
MADV_PAGEOUT to find and reclaim swap cache.

The long story:

Looking for another issue, I wrote a simple test which had two
processes: a parent and a fork()'d child. The parent reads a
memory buffer shared by the fork() and the child calls
madvise(MADV_PAGEOUT) on the same buffer.

The first call to MADV_PAGEOUT does what is expected: it pages
the memory out and causes faults in the parent. However, after
that, it does not cause any faults in the parent. MADV_PAGEOUT
only works once! This was a surprise.

The PTEs in the shared buffer start out pte_present()==1 in
both parent and child. The first MADV_PAGEOUT operation replaces
those with pte_present()==0 swap PTEs. The parent process
quickly faults and recreates pte_present()==1. However, the
child process (the one calling MADV_PAGEOUT) never touches the
memory and has retained the non-present swap PTEs.

This situation could also happen in the case where a single
process had some of its data placed in the swap cache but where
the memory has not yet been reclaimed.

The MADV_PAGEOUT code has a pte_present()==0 check. It will
essentially ignore any pte_present()==0 pages. This essentially
makes unmapped swap cache immune from MADV_PAGEOUT, which is not
very friendly behavior.

Enable MADV_PAGEOUT to find and reclaim swap cache. Because
swap cache is not pinned by holding the PTE lock, a reference
must be held until the page is isolated, where a second
reference is obtained.

Signed-off-by: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
Cc: Michal Hocko <mhocko@xxxxxxxx>
Cc: Jann Horn <jannh@xxxxxxxxxx>
Cc: Vlastimil Babka <vbabka@xxxxxxx>
Cc: Minchan Kim <minchan@xxxxxxxxxx>
Cc: Daniel Colascione <dancol@xxxxxxxxxx>
Cc: "Joel Fernandes (Google)" <joel@xxxxxxxxxxxxxxxxx>
Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
---

b/mm/madvise.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 57 insertions(+), 11 deletions(-)

diff -puN mm/madvise.c~madv-pageout-find-swap-cache mm/madvise.c
--- a/mm/madvise.c~madv-pageout-find-swap-cache 2020-03-23 16:30:48.505385896 -0700
+++ b/mm/madvise.c 2020-03-23 16:30:48.509385896 -0700
@@ -250,6 +250,52 @@ static void force_shm_swapin_readahead(s
#endif /* CONFIG_SWAP */

/*
+ * Given a PTE, find the corresponding 'struct page'
+ * and acquire a reference. Also handles non-present
+ * swap PTEs.
+ *
+ * Returns NULL when there is no page to reclaim.
+ */
+static struct page *pte_get_reclaim_page(struct vm_area_struct *vma,
+ unsigned long addr, pte_t ptent)
+{
+ swp_entry_t entry;
+ struct page *page;
+
+ /* Totally empty PTE: */
+ if (pte_none(ptent))
+ return NULL;
+
+ /* Handle present or PROT_NONE ptes: */
+ if (!is_swap_pte(ptent)) {
+ page = vm_normal_page(vma, addr, ptent);
+ if (page)
+ get_page(page);
+ return page;
+ }
+
+ /*
+ * 'ptent' is now definitely a (non-present) swap
+ * PTE in this process. Go look for additional
+ * references to the swap cache.
+ */
+
+ /*
+ * Is it one of the "swap PTEs" that's not really
+ * swap? Do not try to reclaim those.
+ */
+ entry = pte_to_swp_entry(ptent);
+ if (non_swap_entry(entry))
+ return NULL;
+
+ /*
+ * The PTE was a true swap entry. The page may be in
+ * the swap cache.
+ */
+ return lookup_swap_cache(entry, vma, addr);
+}
+
+/*
* Schedule all required I/O operations. Do not wait for completion.
*/
static long madvise_willneed(struct vm_area_struct *vma,
@@ -398,13 +444,8 @@ regular_page:
for (; addr < end; pte++, addr += PAGE_SIZE) {
ptent = *pte;

- if (pte_none(ptent))
- continue;
-
- if (!pte_present(ptent))
- continue;
-
- page = vm_normal_page(vma, addr, ptent);
+ /* 'page' can be mapped, in the swap cache or both */
+ page = pte_get_reclaim_page(vma, addr, ptent);
if (!page)
continue;

@@ -413,9 +454,10 @@ regular_page:
* are sure it's worth. Split it if we are only owner.
*/
if (PageTransCompound(page)) {
- if (page_mapcount(page) != 1)
+ if (page_mapcount(page) != 1) {
+ put_page(page);
break;
- get_page(page);
+ }
if (!trylock_page(page)) {
put_page(page);
break;
@@ -436,12 +478,14 @@ regular_page:
}

/* Do not interfere with other mappings of this page */
- if (page_mapcount(page) != 1)
+ if (page_mapcount(page) != 1) {
+ put_page(page);
continue;
+ }

VM_BUG_ON_PAGE(PageTransCompound(page), page);

- if (pte_young(ptent)) {
+ if (!is_swap_pte(ptent) && pte_young(ptent)) {
ptent = ptep_get_and_clear_full(mm, addr, pte,
tlb->fullmm);
ptent = pte_mkold(ptent);
@@ -466,6 +510,8 @@ regular_page:
}
} else
deactivate_page(page);
+ /* drop ref acquired in pte_get_reclaim_page() */
+ put_page(page);
}

arch_leave_lazy_mmu_mode();
_