Re: [RFC 0/6] mm: improve page allocator scalability via splitting zones

From: David Hildenbrand
Date: Wed May 17 2023 - 04:10:27 EST


If we could avoid instantiating more zones and rather improve existing
mechanisms (PCP), that would be much more preferred IMHO. I'm sure
it's not easy, but that shouldn't stop us from trying ;)

I do think improving PCP or adding another level of cache will help
performance and scalability.

And, I think that it has value too to improve the performance of zone
itself. Because there will be always some cases that the zone lock
itself is contended.

That is, PCP and zone works at different level, and both deserve to be
improved. Do you agree?

Spoiler: my humble opinion


Well, the zone is kind-of your "global" memory provider, and PCPs cache a fraction of that to avoid exactly having to mess with that global datastructure and lock contention.

One benefit I can see of such a "global" memory provider with caches on top is is that it is nicely integrated: for example, the concept of memory pressure exists for the zone as a whole. All memory is of the same kind and managed in a single entity, but free memory is cached for performance.

As soon as you manage the memory in multiple zones of the same kind, you lose that "global" view of your memory that is of the same kind, but managed in different bucks. You might end up with a lot of memory pressure in a single such zone, but still have plenty in another zone.

As one example, hot(un)plug of memory is easy: there is only a single zone. No need to make smart decisions or deal with having memory we're hotunplugging be stranded in multiple zones.


I did not look into the details of this proposal, but seeing the
change in include/linux/page-flags-layout.h scares me.

It's possible for us to use 1 more bit in page->flags. Do you think
that will cause severe issue? Or you think some other stuff isn't
acceptable?

The issue is, everybody wants to consume more bits in page->flags, so if we can get away without it that would be much better :)

The more bits you want to consume, the more people will ask for making this a compile-time option and eventually compile it out on distro kernels (e.g., with many NUMA nodes). So we end up with more code and complexity and eventually not get the benefits where we really want them.


Further, I'm not so sure how that change really interacts with
hot(un)plug of memory ... on a quick glimpse I feel like this series
hacks the code such that such that the split works based on the boot
memory size ...

Em..., the zone stuff is kind of static now. It's hard to add a zone at
run-time. So, in this series, we determine the number of zones per zone
type based on boot memory size. This may be improved in the future via
pre-allocate some empty zone instances during boot and hot-add some
memory to these zones.

Just to give you some idea: with virtio-mem, hyper-v, daxctl, and upcoming cxl dynamic memory pooling (some day I'm sure ;) ) you might see quite a small boot memory (e.g., 4 GiB) but a significant amount of memory getting hotplugged incrementally (e.g., up to 1 TiB) -- well, and hotunplugged. With multiple zone instances you really have to be careful and might have to re-balance between the multiple zones to keep the scalability, to not create imbalances between the zones ...

Something like PCP auto-tuning would be able to handle that mostly automatically, as there is only a single memory pool.


I agree with Michal that looking into auto-tuning PCP would be
preferred. If that can't be done, adding another layer might end up
cleaner and eventually cover more use cases.

I do agree that it's valuable to make PCP etc. cover more use cases. I
just think that this should not prevent us from optimizing zone itself
to cover remaining use cases.

I really don't like the concept of replicating zones of the same kind for the same NUMA node. But that's just my personal opinion maintaining some memory hot(un)plug code :)

Having that said, some kind of a sub-zone concept (additional layer) as outlined by Michal IIUC, for example, indexed by core id/has/whatsoever could eventually be worth exploring. Yes, such a design raises various questions ... :)

--
Thanks,

David / dhildenb