Re: [PATCH v7 12/12] mm: multigenerational LRU: documentation

From: Yu Zhao
Date: Mon Feb 21 2022 - 20:47:47 EST


On Mon, Feb 21, 2022 at 2:02 AM Mike Rapoport <rppt@xxxxxxxxxx> wrote:
>
> On Tue, Feb 15, 2022 at 08:22:10PM -0700, Yu Zhao wrote:
> > On Mon, Feb 14, 2022 at 12:28:56PM +0200, Mike Rapoport wrote:
> >
> > > > +====== ========
> > > > +Values Features
> > > > +====== ========
> > > > +0x0001 the multigenerational LRU
> > >
> > > The multigenerational LRU what?
> >
> > Itself? This depends on the POV, and I'm trying to determine what would
> > be the natural way to present it.
> >
> > MGLRU itself could be seen as an add-on atop the existing page reclaim
> > or an alternative in parallel. The latter would be similar to sl[aou]b,
> > and that's how I personally see it.
> >
> > But here I presented it more like the former because I feel this way is
> > more natural to users because they are like switches on a single panel.
>
> Than I think it should be described as "enable multigenerational LRU" or
> something like this.

Will do.

> > > What will happen if I write 0x2 to this file?
> >
> > Just like turning on a branch breaker while leaving the main breaker
> > off in a circuit breaker box. This is how I see it, and I'm totally
> > fine with changing it to whatever you'd recommend.
>
> That was my guess that when bit 0 is clear the rest do not matter :)
> What's important, IMO, is that it is stated explicitly in the description.

Will do.

> > > Please consider splitting "enable" and "features" attributes.
> >
> > How about s/Features/Components/?
>
> I meant to use two attributes:
>
> /sys/kernel/mm/lru_gen/enable for the main breaker, and
> /sys/kernel/mm/lru_gen/features (or components) for the branch breakers

It's a bit superfluous for my taste. I generally consider multiple
items to fall into the same category if they can be expressed by a
type of array, and I usually pack an array into a single file.

>From your last review, I gauged this would be too overloaded for your
taste. So I'd be happy to make the change if you think two files look
more intuitive from user's perspective.

> > > > +0x0002 clear the accessed bit in leaf page table entries **in large
> > > > + batches**, when MMU sets it (e.g., on x86)
> > >
> > > Is extra markup really needed here...
> > >
> > > > +0x0004 clear the accessed bit in non-leaf page table entries **as
> > > > + well**, when MMU sets it (e.g., on x86)
> > >
> > > ... and here?
> >
> > Will do.
> >
> > > As for the descriptions, what is the user-visible effect of these features?
> > > How different modes of clearing the access bit are reflected in, say, GUI
> > > responsiveness, database TPS, or probability of OOM?
> >
> > These remain to be seen :) I just added these switches in v7, per Mel's
> > request from the meeting we had. These were never tested in the field.
>
> I see :)
>
> It would be nice to have a description or/and examples of user-visible
> effects when there will be some insight on what these features do.

How does the following sound?

Clearing the accessed bit in large batches can theoretically cause
lock contention (mmap_lock), and if it happens the 0x0002 switch can
disable this feature. In this case the multigenerational LRU suffers a
minor performance degradation.
Clearing the accessed bit in non-leaf page table entries was only
verified on Intel and AMD, and if it causes problems on other x86
varieties the 0x0004 switch can disable this feature. In this case the
multigenerational LRU suffers a negligible performance degradation.

> > > > +:Debugfs interface: ``/sys/kernel/debug/lru_gen`` has the following
> > >
> > > Is debugfs interface relevant only for datacenters?
> >
> > For the moment, yes.
>
> And what will happen if somebody uses these interfaces outside
> datacenters? As soon as there is a sysfs intefrace, somebody will surely
> play with it.
>
> I think the job schedulers might be the most important user of that
> interface, but the documentation should not presume it is the only user.

Other ideas are more like brainstorming than concrete use cases, e.g.,
for desktop users, these interface can in theory speed up hibernation
(suspend to disk); for VM users, they can again in theory support auto
ballooning. These niches are really minor and less explored compared
with the data center use cases which have been dominant.

I was hoping we could focus on the essential and take one step at a
time. Later on, if there is additional demand and resource, then we
expand to cover more use cases.

> > > > + job scheduler writes to this file at a certain time interval to
> > > > + create new generations, and it ranks available servers based on the
> > > > + sizes of their cold memory defined by this time interval. For
> > > > + proactive reclaim, a job scheduler writes to this file before it
> > > > + tries to land a new job, and if it fails to materialize the cold
> > > > + memory without impacting the existing jobs, it retries on the next
> > > > + server according to the ranking result.
> > >
> > > Is this knob only relevant for a job scheduler? Or it can be used in other
> > > use-cases as well?
> >
> > There are other concrete use cases but I'm not ready to discuss them
> > yet.
>
> Here as well, as soon as there is an interface it's not necessarily "job
> scheduler" that will "write to this file", anybody can write to that file.
> Please adjust the documentation to be more neutral regarding the use-cases.

Will do.