Re: [PATCHv4 0/9] zsmalloc/zram: configurable zspage size

From: Minchan Kim
Date: Fri Nov 11 2022 - 12:03:48 EST


On Fri, Nov 11, 2022 at 09:56:36AM +0900, Sergey Senozhatsky wrote:
> Hi,
>
> On (22/11/10 14:44), Minchan Kim wrote:
> > On Mon, Oct 31, 2022 at 02:40:59PM +0900, Sergey Senozhatsky wrote:
> > > Hello,
> > >
> > > Some use-cases and/or data patterns may benefit from
> > > larger zspages. Currently the limit on the number of physical
> > > pages that are linked into a zspage is hardcoded to 4. Higher
> > > limit changes key characteristics of a number of the size
> > > classes, improving compactness of the pool and redusing the
> > > amount of memory zsmalloc pool uses. More on this in 0002
> > > commit message.
> >
> > Hi Sergey,
> >
> > I think the idea that break of fixed subpages in zspage is
> > really good start to optimize further. However, I am worry
> > about introducing per-pool config this stage. How about
> > to introduce just one golden value for the zspage size?
> > order-3 or 4 in Kconfig with keeping default 2?
>
> Sorry, not sure I'm following. So you want a .config value
> for zspage limit? I really like the sysfs knob, because then
> one may set values on per-device basis (if they have multiple
> zram devices in a system with different data patterns):

Yes, I wanted to have just a global policy to drive zsmalloc smarter
without needing user's big effort to decide right tune value(I thought
the decision process would be quite painful for normal user who don't
have enough resources) since zsmalloc's design makes it possible.
But for the interim solution until we prove no regression, just
provide config and then remove the config later when we add aggressive
zpage compaction(if necessary, please see below) since it's easier to
deprecate syfs knob.

>
> zram0 which is used as a swap device uses, say, 4
> zram1 which is vfat block device uses, say, 6
> zram2 which is ext4 block device uses, say, 8
>
> The whole point of the series is that one single value does
> not fit all purposes. There is no silver bullet.

I understand what you want to achieve with per-pool config with exposing
the knob to user but my worry is still how user could decide best fit
since workload is so dynamic. Some groups have enough resouces to practice
under fleet experimental while many others don't so if we really need the
per-pool config step, at least, I'd like to provide default guide to user
in the documentation along with the tunable knobs for experimental.
Maybe, we can suggest 4 for swap case and 8 for fs case.

I don't disagree the sysfs knobs for use cases but can't we deal with the
issue better way?

In general, the bigger pages_per_zspage, the more memory saving. It would
be same with slab_order in slab allocator but slab has the limit due to
high-order allocation cost and internal fragmentation with bigger order
size slab. However, zsmalloc is different in that it doesn't expose memory
address directly and it knows when the object is accessed by user. And
it doesn't need high-order allocation, either. That's how zsmalloc could
support object migration and page migration. With those features, theoretically,
zsmalloc doesn't need limitation of the pages_per_zspage so I am looking
forward to seeing zsmalloc handles the memory fragmentation problem better way.

Only concern with bigger pages_per_zspage(e.g., 8 or 16) is exhausting memory
when zram is used for swap. The use case aims to help memory pressure but the
worst case, the bigger pages_per_zspage, more chance to out of memory.
However, we could bound the worst case memory consumption up to

for class in classes:
wasted_bytes += class->pages_per_zspage * PAGE_SIZE - an object size

with *aggressive zpage compaction*. Now, we are relying on shrinker
(it might be already enough) to trigger but we could change the policy
wasted memory in the class size crossed a threshold we defind for zram fs
usecase since it would be used without memory pressure.

What do you think about?