Re: [RFC v2] mm: Multi-Gen LRU: fix use mm/page_idle/bitmap

From: Henry Huang
Date: Fri Dec 15 2023 - 08:00:26 EST


On Fri, Dec 15, 2023 at 15:23 PM Yu Zhao <yuzhao@xxxxxxxxxx> wrote:
>Regarding the change itself, it'd cause a slight regression to other
>use cases (details below).
>
> > @@ -3355,6 +3359,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
> > unsigned long pfn;
> > struct folio *folio;
> > pte_t ptent = ptep_get(pte + i);
> > + bool is_pte_young;
> >
> > total++;
> > walk->mm_stats[MM_LEAF_TOTAL]++;
> > @@ -3363,16 +3368,20 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
> > if (pfn == -1)
> > continue;
> >
> > - if (!pte_young(ptent)) {
> > - walk->mm_stats[MM_LEAF_OLD]++;
>
> Most overhead from page table scanning normally comes from
> get_pfn_folio() because it almost always causes a cache miss. This is
> like a pointer dereference, whereas scanning PTEs is like streaming an
> array (bad vs good cache performance).
>
> pte_young() is here to avoid an unnecessary cache miss from
> get_pfn_folio(). Also see the first comment in get_pfn_folio(). It
> should be easy to verify the regression -- FlameGraph from the
> memcached benchmark in the original commit message should do it.
>
> Would a tracepoint here work for you?
>
>
>
> > + is_pte_young = !!pte_young(ptent);
> > + folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap, is_pte_young);
> > + if (!folio) {
> > + if (!is_pte_young)
> > + walk->mm_stats[MM_LEAF_OLD]++;
> > continue;
> > }
> >
> > - folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap);
> > - if (!folio)
> > + if (!folio_test_clear_young(folio) && !is_pte_young) {
> > + walk->mm_stats[MM_LEAF_OLD]++;
> > continue;
> > + }
> >
> > - if (!ptep_test_and_clear_young(args->vma, addr, pte + i))
> > + if (is_pte_young && !ptep_test_and_clear_young(args->vma, addr, pte + i))
> > VM_WARN_ON_ONCE(true);
> >
> > young++;

Thanks for replying.

For avoiding below:
1. confict between page_idle/bitmap and mglru scan
2. performance downgrade in mglru page-table scan if call get_pfn_folio for each pte.

We have a new idea:
1. Include a new api under /sys/kernel/mm/page_idle, support mark idle flag only, without
rmap walking or clearing pte young.
2. Use mglru proactive scan to clear page idle flag.

workflows:
t1 t2
mark pages idle mglru scan and check pages idle

It's easy for us to know that whether a page is accessed during t1~t2.

Some code changes like these:

We clear idle flags in get_pfn_folio, and in walk_pte_range we still follow
original design.

static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg,
struct pglist_data *pgdat, bool can_swap, bool clear_idle)
{
struct folio *folio;

/* try to avoid unnecessary memory loads */
if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
return NULL;

folio = pfn_folio(pfn);
+
+ if (clear_idle && folio_test_idle(folio))
+ folio_clear_idle(folio);
+
if (folio_nid(folio) != pgdat->node_id)
return NULL;

if (folio_memcg_rcu(folio) != memcg)
return NULL;

/* file VMAs can contain anon pages from COW */
if (!folio_is_file_lru(folio) && !can_swap)
return NULL;

return folio;
}

static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
struct mm_walk *args)
{
...
for (i = pte_index(start), addr = start; addr != end; i++, addr += PAGE_SIZE) {
unsigned long pfn;
struct folio *folio;
pte_t ptent = ptep_get(pte + i);

total++;
walk->mm_stats[MM_LEAF_TOTAL]++;

pfn = get_pte_pfn(ptent, args->vma, addr);
if (pfn == -1)
continue;

if (!pte_young(ptent)) {
walk->mm_stats[MM_LEAF_OLD]++;
continue;
}

+ folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap, true);
- folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap);
if (!folio)
continue;
...
}

Is it a good idea or not ?

--
2.43.0