Re: [PATCH RFC 4/4] mm: Install marker pte when page out for shmem pages

From: Tiberiu Georgescu
Date: Fri Aug 13 2021 - 11:22:11 EST


Hello Peter,

Sorry for commenting so late.

> On 7 Aug 2021, at 04:25, Peter Xu <peterx@xxxxxxxxxx> wrote:
>
> When shmem pages are swapped out, instead of clearing the pte entry, we leave a
> marker pte showing that this page is swapped out as a hint for pagemap. A new
> TTU flag is introduced to identify this case.
>
> This can be useful for detecting swapped out cold shmem pages. Then after some
> memory background scanning work (which will fault in the shmem page and
> confusing page reclaim), we can do MADV_PAGEOUT explicitly on this page to swap
> it out again as we know it was cold.
>
> For pagemap, we don't need to explicitly set PM_SWAP bit, because by nature
> SWP_PTE_MARKER ptes are already counted as PM_SWAP due to it's format as swap.
>
> Signed-off-by: Peter Xu <peterx@xxxxxxxxxx>
> ---
> fs/proc/task_mmu.c | 1 +
> include/linux/rmap.h | 1 +
> mm/rmap.c | 19 +++++++++++++++++++
> mm/vmscan.c | 2 +-
> 4 files changed, 22 insertions(+), 1 deletion(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index eb97468dfe4c..21b8594abc1d 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1384,6 +1384,7 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
> if (pm->show_pfn)
> frame = swp_type(entry) |
> (swp_offset(entry) << MAX_SWAPFILES_SHIFT);
> + /* NOTE: this covers PTE_MARKER_PAGEOUT too */
> flags |= PM_SWAP;
> if (is_pfn_swap_entry(entry))
> page = pfn_swap_entry_to_page(entry);
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index c976cc6de257..318a0e95c7fb 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -95,6 +95,7 @@ enum ttu_flags {
> * do a final flush if necessary */
> TTU_RMAP_LOCKED = 0x80, /* do not grab rmap lock:
> * caller holds it */
> + TTU_HINT_PAGEOUT = 0x100, /* Hint for pageout operation */
> };
>
> #ifdef CONFIG_MMU
> diff --git a/mm/rmap.c b/mm/rmap.c
> index b9eb5c12f3fe..24a70b36b6da 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1384,6 +1384,22 @@ void page_remove_rmap(struct page *page, bool compound)
> unlock_page_memcg(page);
> }
>
> +static inline void
> +pte_marker_install(struct vm_area_struct *vma, pte_t *pte,
> + struct page *page, unsigned long address)
> +{
> +#ifdef CONFIG_PTE_MARKER_PAGEOUT
> + swp_entry_t entry;
> + pte_t pteval;
> +
> + if (vma_is_shmem(vma) && !PageAnon(page) && pte_none(*pte)) {
> + entry = make_pte_marker_entry(PTE_MARKER_PAGEOUT);
> + pteval = swp_entry_to_pte(entry);
> + set_pte_at(vma->vm_mm, address, pte, pteval);
> + }
> +#endif
> +}
> +
> /*
> * @arg: enum ttu_flags will be passed to this argument
> */
> @@ -1628,6 +1644,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> */
> dec_mm_counter(mm, mm_counter_file(page));
> }
> +
> + if (flags & TTU_HINT_PAGEOUT)
> + pte_marker_install(vma, pvmw.pte, page, address);
> discard:
> /*
> * No need to call mmu_notifier_invalidate_range() it has be
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4620df62f0ff..4754af6fa24b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1493,7 +1493,7 @@ static unsigned int shrink_page_list(struct list_head *page_list,
> * processes. Try to unmap it here.
> */
> if (page_mapped(page)) {
> - enum ttu_flags flags = TTU_BATCH_FLUSH;
> + enum ttu_flags flags = TTU_BATCH_FLUSH | TTU_HINT_PAGEOUT;
> bool was_swapbacked = PageSwapBacked(page);
>
> if (unlikely(PageTransHuge(page)))
> --
> 2.32.0
>
The RFC looks good to me. It makes it much cleaner to verify whether the page
is in swap or not, and there is great performance benefit compared to my patch.

Currently, your patch does not have a way of putting the swap offset onto the
entry when a shared page is swapped, so it does not cover all use cases
IMO.

To counter that, I added my patch on top of yours. By making use of your
mechanism, we don't need to read the page cache xarray when we pages
are definitely none. This reduces much unnecessary overhead.

Main diff in fs/proc/task_mmu.c:pte_to_pagemap_entry():
} else if (is_swap_pte(pte)) {
swp_entry_t entry;
...
entry = pte_to_swp_entry(pte);
+ if (is_pte_marker_entry(entry)) {
+ void *xa_entry = get_xa_entry_at_vma_addr(vma, addr);
...

For reference: https://lore.kernel.org/lkml/20210730160826.63785-1-tiberiu.georgescu@xxxxxxxxxxx/.

After some preliminary performance tests on VMs, I noticed a significant
performance improvement in my patch, when yours is added. Here is the
most interesting result:

Program I tested it on: Same as https://lore.kernel.org/lkml/20210730160826.63785-1-tiberiu.georgescu@xxxxxxxxxxx/

Specs:
Memory: 32GB memory, 64GB swap
Params: 16GB allocated shmem, 50% dirty, 4GB cgroup, no batching

In short, the pagemap read deals with
* ~4GB of physical memory (present pages),
* ~4GB in swap (swapped pages)
* 8GB non-dirty (none pages).

Results:
+------------+-------+-------+
| Peter\Tibi | Pre | Post | (in seconds)
+------------+-------+-------+
| Pre | 11.82 | 38.44 |
| Post | 13.28 | 22.34 |
+------------+-------+-------+

With your patch, mine yields the results in twice the time, compared to 4x.
I am aware it is not ideal, but until it is decided whether a whole different
system is preferred to PTE markers, our solutions together currently deliver
the best performance for correctness. Thank you for finding this solution.

I am happy to introduce a configuration variable to enable my pagemap
patch only if necessary. Either way, if performance is a must, batching is still
the best way to access multiple pagemap entries.

Looking forward to updates,
Tibi