Re: [RFC PATCH 2/2] mm/vmscan.c: Prevent allocating shrinker_info on offlined nodes

From: Michal Hocko
Date: Mon Dec 06 2021 - 06:22:17 EST


On Mon 06-12-21 12:00:50, David Hildenbrand wrote:
> On 06.12.21 11:54, Michal Hocko wrote:
> > On Mon 06-12-21 11:45:54, David Hildenbrand wrote:
> >>> This doesn't seen complete. Slab shrinkers are used in the reclaim
> >>> context. Previously offline nodes could be onlined later and this would
> >>> lead to NULL ptr because there is no hook to allocate new shrinker
> >>> infos. This would be also really impractical because this would have to
> >>> update all existing memcgs...
> >>
> >> Instead of going through the trouble of updating...
> >>
> >> ... maybe just keep for_each_node() and check if the target node is
> >> offline. If it's offline, just allocate from the first online node.
> >> After all, we're not using __GFP_THISNODE, so there are no guarantees
> >> either way ...
> >
> > This looks like another way to paper over a deeper underlying problem
> > IMHO. Fundamentally we have a problem that some pgdata are not allocated
> > and that causes a lot of headache. Not to mention that node_online
> > is just adding to a confusion because it doesn't really tell anything
> > about the logical state of the node.
> >
> > I think we really should get rid of this approach rather than play a
> > whack-a-mole. We should really drop all notion of node_online and
> > instead allocate pgdat for each possible node. Arch specific code should
> > make sure that zone lists are properly initialized.
> >
>
> I'm not sure if it's rally whack-a-mole really applies. It's just the
> for_each_node_* calls that need love. In other cases, we shouldn't
> really stumble over an offline node.
>
> Someone deliberately decided to use "for_each_node()" instead of
> for_each_online_node() without taking care of online vs. offline
> semantics. That's just a BUG and needs fixing IMHO.

Well, I would argue this is too much to ask for the API users. It is
also a trap that is just too easy to fall into. Turning for_each_node
into for_each_online_node will not solve the problem just by itself.
As you have pointed out an offline node can become online and without a
hotplug notifier there is no way for_each_online_node would be a proper
fit for anything that allocates memory only for online nodes. See the
trap? People think they are doing the right thing but they are not in
fact.

Now practically speaking !node_online should not apear node_online (note
I am attentionally avoiding to say offline and online as that has a
completely different semantic) shouldn't really happen for some
architectures. x86 should allocate pgdat for each possible node. I do
not know what was the architecture in this case but we already have
another report for x86 that remains unexplained.

My existing experience tells me that we have two major problems in this
area. Arch inconsistent behavior and really confusing APIs. We should be
addressing those rather than adding more hacks.

> After all, we do need patches to backport, reworking pgdat init isn't
> really something feasible for that I think. And I heard of PPC that can
> hotplug thousands of nodes ...

If this turns out to be a big problem them we can think of putting pgdat
on diet.
--
Michal Hocko
SUSE Labs