collision between ZONE_MOVABLE and memblock allocations

From: Ross Zwisler
Date: Tue Jul 18 2023 - 18:01:20 EST


Hello,

I've been trying to use the 'movablecore=' kernel command line option to create
a ZONE_MOVABLE memory zone on my x86_64 systems, and have noticed that
offlining the resulting ZONE_MOVABLE area consistently fails in my setups
because that zone contains unmovable pages. My testing has been in a x86_64
QEMU VM with a single NUMA node and 4G, 8G or 16G of memory, all of which fail
100% of the time.

Digging into it a bit, these unmovable pages are Reserved pages which were
allocated in early boot as part of the memblock allocator. Many of these
allocations are for data structures for the SPARSEMEM memory model, including
'struct mem_section' objects. These memblock allocations can be tracked by
setting the 'memblock=debug' kernel command line parameter, and are marked as
reserved in:

memmap_init_reserved_pages()
reserve_bootmem_region()

With the command line params 'movablecore=256M memblock=debug' and a v6.5.0-rc2
kernel I get the following on my 4G system:

# lsmem --split ZONES --output-all
RANGE SIZE STATE REMOVABLE BLOCK NODE ZONES
0x0000000000000000-0x0000000007ffffff 128M online yes 0 0 None
0x0000000008000000-0x00000000bfffffff 2.9G online yes 1-23 0 DMA32
0x0000000100000000-0x000000012fffffff 768M online yes 32-37 0 Normal
0x0000000130000000-0x000000013fffffff 256M online yes 38-39 0 Movable

Memory block size: 128M
Total online memory: 4G
Total offline memory: 0B

And when I try to offline memory block 39, I get:

# echo 0 > /sys/devices/system/memory/memory39/online
bash: echo: write error: Device or resource busy

with dmesg saying:

[ 57.439849] page:0000000076a3e320 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x13ff00
[ 57.444073] flags: 0x1fffffc0001000(reserved|node=0|zone=3|lastcpupid=0x1fffff)
[ 57.447301] page_type: 0xffffffff()
[ 57.448754] raw: 001fffffc0001000 ffffdd6384ffc008 ffffdd6384ffc008 0000000000000000
[ 57.450383] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
[ 57.452011] page dumped because: unmovable page

Looking back at the memblock allocations, I can see that the physical address for
pfn:0x13ff00 was used in a memblock allocation:

[ 0.395180] memblock_reserve: [0x000000013ff00000-0x000000013ffbffff] memblock_alloc_range_nid+0xe0/0x150

The full dmesg output can be found here: https://pastebin.com/cNztqa4u

The 'movablecore=' command line parameter is handled in
'find_zone_movable_pfns_for_nodes()', which decides where ZONE_MOVABLE should
start and end. Currently ZONE_MOVABLE is always located at the end of a NUMA
node.

The issue is that the memblock allocator and the processing of the movablecore=
command line parameter don't know about one another, and in my x86_64 testing
they both always use memory at the end of the NUMA node and have collisions.

>From several comments in the code I believe that this is a known issue:

https://elixir.bootlin.com/linux/v6.5-rc2/source/mm/page_isolation.c#L59
/*
* Both, bootmem allocations and memory holes are marked
* PG_reserved and are unmovable. We can even have unmovable
* allocations inside ZONE_MOVABLE, for example when
* specifying "movablecore".
*/

https://elixir.bootlin.com/linux/v6.5-rc2/source/include/linux/mmzone.h#L765
* 2. memblock allocations: kernelcore/movablecore setups might create
* situations where ZONE_MOVABLE contains unmovable allocations
* after boot. Memory offlining and allocations fail early.

We check for these unmovable pages by scanning for 'PageReserved()' in the area
we are trying to offline, which happens in has_unmovable_pages().

Interestingly, the boot timing works out like this:

1. Allocate memblock areas to set up the SPARSEMEM model.
[ 0.369990] Call Trace:
[ 0.370404] <TASK>
[ 0.370759] ? dump_stack_lvl+0x43/0x60
[ 0.371410] ? sparse_init_nid+0x2dc/0x560
[ 0.372116] ? sparse_init+0x346/0x450
[ 0.372755] ? paging_init+0xa/0x20
[ 0.373349] ? setup_arch+0xa6a/0xfc0
[ 0.373970] ? slab_is_available+0x5/0x20
[ 0.374651] ? start_kernel+0x5e/0x770
[ 0.375290] ? x86_64_start_reservations+0x14/0x30
[ 0.376109] ? x86_64_start_kernel+0x71/0x80
[ 0.376835] ? secondary_startup_64_no_verify+0x167/0x16b
[ 0.377755] </TASK>

2. Process movablecore= kernel command line parameter and set up memory zones
[ 0.489382] Call Trace:
[ 0.489818] <TASK>
[ 0.490187] ? dump_stack_lvl+0x43/0x60
[ 0.490873] ? free_area_init+0x115/0xc80
[ 0.491588] ? __printk_cpu_sync_put+0x5/0x30
[ 0.492354] ? dump_stack_lvl+0x48/0x60
[ 0.493002] ? sparse_init_nid+0x2dc/0x560
[ 0.493697] ? zone_sizes_init+0x60/0x80
[ 0.494361] ? setup_arch+0xa6a/0xfc0
[ 0.494981] ? slab_is_available+0x5/0x20
[ 0.495674] ? start_kernel+0x5e/0x770
[ 0.496312] ? x86_64_start_reservations+0x14/0x30
[ 0.497123] ? x86_64_start_kernel+0x71/0x80
[ 0.497847] ? secondary_startup_64_no_verify+0x167/0x16b
[ 0.498768] </TASK>

3. Mark memblock areas as Reserved.
[ 0.761136] Call Trace:
[ 0.761534] <TASK>
[ 0.761876] dump_stack_lvl+0x43/0x60
[ 0.762474] reserve_bootmem_region+0x1e/0x170
[ 0.763201] memblock_free_all+0xe3/0x250
[ 0.763862] ? swiotlb_init_io_tlb_mem.constprop.0+0x11a/0x130
[ 0.764812] ? swiotlb_init_remap+0x195/0x2c0
[ 0.765519] mem_init+0x19/0x1b0
[ 0.766047] mm_core_init+0x9c/0x3d0
[ 0.766630] start_kernel+0x264/0x770
[ 0.767229] x86_64_start_reservations+0x14/0x30
[ 0.767987] x86_64_start_kernel+0x71/0x80
[ 0.768666] secondary_startup_64_no_verify+0x167/0x16b
[ 0.769534] </TASK>

So, during ZONE_MOVABLE setup we currently can't do the same
has_unmovable_pages() scan looking for PageReserved() to check for overlap
because the pages have not yet been marked as Reserved.

I do think that we need to fix this collision between ZONE_MOVABLE and memmap
allocations, because this issue essentially makes the movablecore= kernel
command line parameter useless in many cases, as the ZONE_MOVABLE region it
creates will often actually be unmovable.

Here are the options I currently see for resolution:

1. Change the way ZONE_MOVABLE memory is allocated so that it is allocated from
the beginning of the NUMA node instead of the end. This should fix my use case,
but again is prone to breakage in other configurations (# of NUMA nodes, other
architectures) where ZONE_MOVABLE and memblock allocations might overlap. I
think that this should be relatively straightforward and low risk, though.

2. Make the code which processes the movablecore= command line option aware of
the memblock allocations, and have it choose a region for ZONE_MOVABLE which
does not have these allocations. This might be done by checking for
PageReserved() as we do with offlining memory, though that will take some boot
time reordering, or we'll have to figure out the overlap in another way. This
may also result in us having two ZONE_NORMAL zones for a given NUMA node, with
a ZONE_MOVABLE section in between them. I'm not sure if this is allowed? If
we can get it working, this seems like the most correct solution to me, but
also the most difficult and risky because it involves significant changes in
the code for memory setup at early boot.

Am I missing anything are there other solutions we should consider, or do you
have an opinion on which solution we should pursue?

Thanks,
- Ross