Re: [RFC v2] mm: Multi-Gen LRU: fix use mm/page_idle/bitmap

From: Yuanchu Xie
Date: Wed Jan 10 2024 - 14:24:24 EST


On Fri, Dec 22, 2023 at 7:40 AM Henry Huang <henry.hj@xxxxxxxxxxxx> wrote:
> > - are pages ever shared between different memcg hierarchies? You
> > mentioned sharing between processes in A and A/B, but I'm wondering
> > if there is sharing between two different memcg hierarchies where root
> > is the only common ancestor?
>
> Yes, there is a another really common case:
> If docker graph driver is overlayfs, different docker containers use the
> same image, or share same low layers, would share file cache of public bin or
> lib(i.e libc.so).
Does this present a problem with setting memcg limits or OOMs? It
seems like deterministically charging shared pages would be highly
desirable. Mina Almasry previously proposed a memcg= mount option to
implement deterministic charging[1], but it wasn't a generic sharing
mechanism. Nonetheless, the problem remains, and it would be
interesting to learn if this presents any issues for you.

[1] https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@xxxxxxxxxx/
>
> > - do you anticipate a shorter scan period at some point? Proactively
> > reclaiming all memory colder than one hour is a long time :) Are you
> > concerned at all about the cost of doing your current idle bit
> > harvesting approach becoming too expensive if you significantly reduce
> > the scan period?
>
> We don't want the owner of the application to feel a significant
> performance downgrade when using swap. There is a high risk to reclaim pages
> which idle age are less than 1 hour. We have internal test and
> data analysis to support it.
>
> We disabled global swappiness and memcg swapinness.
> Only proactive reclaim can swap anon pages.
>
> What's more, we see that mglru has a more efficient way to scan pte access bit.
> We perferred to use mglru scan help us scan and select idle pages.
I'm working on a kernel driver/per-memcg interface to perform aging
with MGLRU, including configuration for the MGLRU page scanning
optimizations. I suspect scanning the PTE accessed bits for pages
charged to a foreign memcg ad-hoc has some performance implications,
and the more general solution is to charge in a predetermined way,
which makes the scanning on behalf of the foreign memcg a bit cleaner.
This is possible nonetheless, but a bit hacky. Let me know you have
any ideas.
>
> > - is proactive reclaim being driven by writing to memory.reclaim, by
> > enforcing a smaller memory.high, or something else?
>
> Because all pages info and idle age are stored in userspace, kernel can't get
> these information directly. We have a private patch include a new reclaim interface
> to support reclaim pages with specific pfns.
Thanks for sharing! It's been enlightening to learn about different
prod environments.