Re: [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers

From: Ying Huang
Date: Wed Jun 08 2022 - 00:10:09 EST


On Mon, 2022-06-06 at 14:32 +0530, Aneesh Kumar K V wrote:
> On 6/6/22 2:22 PM, Ying Huang wrote:
> ....
> > > > > I can move the patch "mm/demotion/dax/kmem: Set node's memory tier to
> > > > > MEMORY_TIER_PMEM" before switching the demotion logic so that on systems
> > > > > with two memory tiers (DRAM and pmem) the demotion continues to work
> > > > > as expected after patch 3 ("mm/demotion: Build demotion targets based on
> > > > > explicit memory tiers"). With that, there will not be any regression in
> > > > > between the patch series.
> > > > >
> > > >
> > > > Thanks! Please do that. And I think you can add sysfs interface after
> > > > that patch too. That is, in [1/7]
> > > >
> > >
> > > I am not sure why you insist on moving sysfs interfaces later. They are
> > > introduced based on the helper added. It make patch review easier to
> > > look at both the helpers and the user of the helper together in a patch.
> >
> > Yes. We should introduce a function and its user in one patch for
> > review. But this doesn't mean that we should introduce the user space
> > interface as the first step. I think the user space interface should
> > output correct information when we expose it.
> >
>
> If you look at this patchset we are not exposing any wrong information.
>
> patch 1 -> adds ability to register the memory tiers and expose details
> of registered memory tier. At this point the patchset only support DRAM
> tier and hence only one tier is shown

But inside kernel, we actually work with 2 tiers and demote/prmote pages
between them. With the information from your interface, users would
think that there is no any demotion/promotion in kernel because there's
only 1 tier.

> patch 2 -> adds per node memtier attribute. So only DRAM nodes shows the
> details, because the patchset yet has not introduced a slower memory
> tier like PMEM.
>
> patch 4 -> introducing demotion. Will make that patch 5
>
> patch 5 -> add dax kmem numa nodes as slower memory tier. Now this
> becomes patch 4 at which point we will correctly show two memory tiers
> in the system.
>
>
> > > > +struct memory_tier {
> > > > + nodemask_t nodelist;
> > > > +};
> > > >
> > > > And struct device can be added after the kernel has switched the
> > > > implementation based on explicit memory tiers.
> > > >
> > > > +struct memory_tier {
> > > > + struct device dev;
> > > > + nodemask_t nodelist;
> > > > +};
> > > >
> > >
> > >
> > > Can you elaborate on this? or possibly review the v5 series indicating
> > > what change you are suggesting here?
> > >
> > >
> > > > But I don't think it's a good idea to have "struct device" embedded in
> > > > "struct memory_tier". We don't have "struct device" embedded in "struct
> > > > pgdata_list"...
> > > >
> > >
> > > I avoided creating an array for memory_tier (memory_tier[]) so that we
> > > can keep it dynamic. Keeping dev embedded in struct memory_tier simplify
> > > the life cycle management of that dynamic list. We free the struct
> > > memory_tier allocation via device release function (memtier->dev.release
> > > = memory_tier_device_release )
> > >
> > > Why do you think it is not a good idea?
> >
> > I think that we shouldn't bind our kernel internal implementation with
> > user space interface too much. Yes. We can expose kernel internal
> > implementation to user space in a direct way. I suggest you to follow
> > the style of "struct pglist_data" and "struct node". If we decouple
> > "struct memory_tier" and "struct memory_tier_dev" (or some other name),
> > we can refer to "struct memory_tier" without depending on all device
> > core. Memory tier should be accessible inside the kernel even without a
> > user interface. And memory tier isn't a device in concept.
> >
>
> memory_tiers are different from pglist_data and struct node in that we
> also allow the creation of them from userspace.

I don't think that there's much difference. struct pglist_data and
struct node can be created/destroyed dynamically too. Please take a
look at

__try_online_node()
register_one_node()
try_offline_node()
unregister_one_node()

> That is the life time of
> a memory tier is driven from userspace and it is much easier to manage
> them via sysfs file lifetime mechanism rather than inventing an
> independent and more complex way of doing the same.

You needs to manage the lifetime of struct memory_tier in kernel too.
Because there are kernel users. And even if you use device core
lifetime mechanism, you don't need to embed struct device in struct
memory_tier too, you can free "separate" struct memory_tier in "release"
callback of struct device.

> > For life cycle management, I think that we can do that without sysfs
> > too.
> >
>
> unless there are specific details that you think will be broken by
> embedding struct device inside struct memory_tier, IMHO I still consider
> the embedded implementation much simpler and in accordance with other
> kernel design patterns.

In concept, struct memory_tier isn't a device. Although we expose it as
a device in sysfs. That's just an implementation detail. So I think
it's better to make struct memory_tier independent of struct device if
possible.

Via not embeding struct device in struct memory_tier, it's much easier
to dereference struct memory_tier directly in inline function in ".h".
We don't need to introduce one accessor function for each field of
struct memory_tier for that.

Best Regards,
Huang, Ying