Re: collision between ZONE_MOVABLE and memblock allocations

From: David Hildenbrand
Date: Wed Jul 19 2023 - 04:15:42 EST


On 19.07.23 10:06, Michal Hocko wrote:
On Wed 19-07-23 10:59:52, Mike Rapoport wrote:
On Wed, Jul 19, 2023 at 08:14:48AM +0200, Michal Hocko wrote:
On Tue 18-07-23 16:01:06, Ross Zwisler wrote:
[...]
I do think that we need to fix this collision between ZONE_MOVABLE and memmap
allocations, because this issue essentially makes the movablecore= kernel
command line parameter useless in many cases, as the ZONE_MOVABLE region it
creates will often actually be unmovable.

movablecore is kinda hack and I would be more inclined to get rid of it
rather than build more into it. Could you be more specific about your
use case?

Here are the options I currently see for resolution:

1. Change the way ZONE_MOVABLE memory is allocated so that it is allocated from
the beginning of the NUMA node instead of the end. This should fix my use case,
but again is prone to breakage in other configurations (# of NUMA nodes, other
architectures) where ZONE_MOVABLE and memblock allocations might overlap. I
think that this should be relatively straightforward and low risk, though.

2. Make the code which processes the movablecore= command line option aware of
the memblock allocations, and have it choose a region for ZONE_MOVABLE which
does not have these allocations. This might be done by checking for
PageReserved() as we do with offlining memory, though that will take some boot
time reordering, or we'll have to figure out the overlap in another way. This
may also result in us having two ZONE_NORMAL zones for a given NUMA node, with
a ZONE_MOVABLE section in between them. I'm not sure if this is allowed?

Yes, this is no problem. Zones are allowed to be sparse.

The current initialization order is roughly

* very early initialization with some memblock allocations
* determine zone locations and sizes
* initialize memory map
- memblock_alloc(lots of memory)
* lots of unrelated initializations that may allocate memory
* release free pages from memblock to the buddy allocator

With 2) we can make sure the memory map and early allocations won't be in
the ZONE_MOVABLE, but we'll still may have reserved pages there.

Yes this will always be fragile. If the spefic placement of the movable
memory is not important and the only thing that matters is the size and
numa locality then an easier to maintain solution would be to simply
offline enough memory blocks very early in the userspace bring up and
online it back as movable. If offlining fails just try another
memblock. This doesn't require any kernel code change.

As an alternative, we might use the "memmap=nn[KMG]!ss[KMG]" [1] parameter to mark some memory as protected.

That memory can then be configured as devdax device and online to ZONE_MOVABLE (dev/dax).

[1] https://docs.pmem.io/persistent-memory/getting-started-guide/creating-development-environments/linux-environments/linux-memmap

--
Cheers,

David / dhildenb