Re: [PATCH v5 0/6] workload-specific and memory pressure-driven zswap writeback

From: Chris Li
Date: Sun Nov 19 2023 - 03:55:04 EST


On Fri, Nov 17, 2023 at 8:23 AM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
>
> On Thu, Nov 16, 2023 at 4:57 PM Chris Li <chrisl@xxxxxxxxxx> wrote:
> >
> > Hi Nhat,
> >
> > I want want to share the high level feedback we discussed here in the
> > mailing list as well.
> >
> > It is my observation that each memcg LRU list can't compare the page
> > time order with other memcg.
> > It works great when the leaf level memcg hits the memory limit and you
> > want to reclaim from that memcg.
> > It works less well on the global memory pressure you need to reclaim
> > from all memcg. You kind of have to
> > scan each all child memcg to find out the best page to shrink from. It
> > is less effective to get to the most desirable page quickly.
> >
> > This can benefit from a design similar to MGLRU. This idea is
> > suggested by Yu Zhao, credit goes to him not me.
> > In other words, the current patch is similar to the memcg page list
> > pre MGLRU world. We can have a MRLRU
> > like per memcg zswap shrink list.
>
> I was gonna summarize the points myself :P But thanks for doing this.
> It's your idea so you're more qualified to explain this anyway ;)

The MGLRU like shrinker was Zhao Yu's idea. I just observe the problem.

>
> I absolutely agree that having a generation-aware cgroup-aware
> NUMA-aware LRU is the future way to go. Currently, IIUC, the reclaim logic
> selects cgroups in a round-robin-ish manner. It's "fair" in this perspective,
> but I also think it's not ideal. As we have discussed, the current list_lru
> infrastructure only take into account intra-cgroup relative recency, not
> inter-cgroup relative recency. The recently proposed time-based zswap
> reclaim mechanism will provide us with a source of information, but the
> overhead of using this might be too high - and it's very zswap-specific.

I don't mind it is zswap-specific, as long as it is effective.
The overhead has two folds:
1) memory overhead on storing timestamps on per compressed page.
2) cpu overhead for reading timestamps.
Using MGLRU likely have advantage over time stamps on both memory and
cpu. The generation can use fewer bits and doesn't require reading
time on every page.

> Maybe after this, we should improve zswap reclaim (and perhaps all
> list_lru users) by adding generations to list_lru then take generations
> into account in the vmscan code. This patch series could be merged

One high level idea is that we can get the page generation in the
MGLRU before it gets into zswap. Just retain the generation into the
zpool LRU somehow.

> as-is, and once we make list_lru generation-aware, zswap shrinker
> will automagically be improved (along with all other list_lru/shrinker
> users).

I don't think it will automatically improve, you will need to rewrite
a lot of code in the shrinker as well to best use MGLRU zpool.

>
> I don't know enough about the current design of MGLRU to comment
> too much further, but let me know if this makes sense, and if you have
> objections/other ideas.

Taking the step by step approach is fine by me as long as we are
making steady progress towards the better end goal.

>
> And if you have other documentations for MGLRU than its code, could
> you please let me know? I'm struggling to find more details about this.

I would need to learn MGLRU myself. We can share and compare notes
when we get to it.

Chris