Re: [PATCH 1/1] arch/x86/mm/numa: Do not initialize nodes twice

From: Michal Hocko
Date: Mon Feb 21 2022 - 04:52:17 EST


On Fri 18-02-22 23:43:02, Oscar Salvador wrote:
> On x86, prior to ("mm: handle uninitialized numa nodes gracecully"),
> NUMA nodes could be allocated at three different places.
>
> - numa_register_memblks
> - init_cpu_to_node
> - init_gi_nodes
>
> All these calls happen at setup_arch, and have the following order:
>
> setup_arch
> ...
> x86_numa_init
> numa_init
> numa_register_memblks
> ...
> init_cpu_to_node
> init_memory_less_node
> alloc_node_data
> free_area_init_memoryless_node
> init_gi_nodes
> init_memory_less_node
> alloc_node_data
> free_area_init_memoryless_node
>
> numa_register_memblks() is only interested in those nodes which have memory,
> so it skips over any memoryless node it founds.
> Later on, when we have read ACPI's SRAT table, we call init_cpu_to_node()
> and init_gi_nodes(), which initialize any memoryless node we might have
> that have either CPU or Initiator affinity, meaning we allocate pg_data_t
> struct for them and we mark them as ONLINE.
>
> So far so good, but the thing is that after ("mm: handle uninitialized numa
> nodes gracefully"), we allocate all possible NUMA nodes in free_area_init(),
> meaning we have a picture like the following:
>
> setup_arch
> x86_numa_init
> numa_init
> numa_register_memblks <-- allocate non-memoryless node
> x86_init.paging.pagetable_init
> ...
> free_area_init
> free_area_init_memoryless <-- allocate memoryless node
> init_cpu_to_node
> alloc_node_data <-- allocate memoryless node with CPU
> free_area_init_memoryless_node
> init_gi_nodes
> alloc_node_data <-- allocate memoryless node with Initiator
> free_area_init_memoryless_node

Thanks for diving into this and double checking after me. I misread the
ordering and thought free_area_init is the last one to be called.

> free_area_init() already allocates all possible NUMA nodes, but
> init_cpu_to_node() and init_gi_nodes() are clueless about that,
> so they go ahead and allocate a new pg_data_t struct without
> checking anything, meaning we end up allocating twice.
>
> It should be mad clear that this only happens in the case where
> memoryless NUMA node happens to have a CPU/Initiator affinity.
>
> So get rid of init_memory_less_node() and just set the node online.
>
> Note that setting the node online is needed, otherwise we choke
> down the chain when bringup_nonboot_cpus() ends up calling
> __try_online_node()->register_one_node()->... and we blow up in
> bus_add_device(). Like can be seen here:
>
> =========
> [ 0.585060] BUG: kernel NULL pointer dereference, address: 0000000000000060
> [ 0.586091] #PF: supervisor read access in kernel mode
> [ 0.586831] #PF: error_code(0x0000) - not-present page
> [ 0.586930] PGD 0 P4D 0
> [ 0.586930] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
> [ 0.586930] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.17.0-rc4-1-default+ #45
> [ 0.586930] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/4
> [ 0.586930] RIP: 0010:bus_add_device+0x5a/0x140
> [ 0.586930] Code: 8b 74 24 20 48 89 df e8 84 96 ff ff 85 c0 89 c5 75 38 48 8b 53 50 48 85 d2 0f 84 bb 00 004
> [ 0.586930] RSP: 0000:ffffc9000022bd10 EFLAGS: 00010246
> [ 0.586930] RAX: 0000000000000000 RBX: ffff888100987400 RCX: ffff8881003e4e19
> [ 0.586930] RDX: ffff8881009a5e00 RSI: ffff888100987400 RDI: ffff888100987400
> [ 0.586930] RBP: 0000000000000000 R08: ffff8881003e4e18 R09: ffff8881003e4c98
> [ 0.586930] R10: 0000000000000000 R11: ffff888100402bc0 R12: ffffffff822ceba0
> [ 0.586930] R13: 0000000000000000 R14: ffff888100987400 R15: 0000000000000000
> [ 0.586930] FS: 0000000000000000(0000) GS:ffff88853fc00000(0000) knlGS:0000000000000000
> [ 0.586930] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 0.586930] CR2: 0000000000000060 CR3: 000000000200a001 CR4: 00000000001706b0
> [ 0.586930] Call Trace:
> [ 0.586930] <TASK>
> [ 0.586930] device_add+0x4c0/0x910
> [ 0.586930] __register_one_node+0x97/0x2d0
> [ 0.586930] __try_online_node+0x85/0xc0
> [ 0.586930] try_online_node+0x25/0x40
> [ 0.586930] cpu_up+0x4f/0x100
> [ 0.586930] bringup_nonboot_cpus+0x4f/0x60
> [ 0.586930] smp_init+0x26/0x79
> [ 0.586930] kernel_init_freeable+0x130/0x2f1
> [ 0.586930] ? rest_init+0x100/0x100
> [ 0.586930] kernel_init+0x17/0x150
> [ 0.586930] ? rest_init+0x100/0x100
> [ 0.586930] ret_from_fork+0x22/0x30
> [ 0.586930] </TASK>
> [ 0.586930] Modules linked in:
> [ 0.586930] CR2: 0000000000000060
> [ 0.586930] ---[ end trace 0000000000000000 ]---
> =========
>
> The reason is simple, by the time bringup_nonboot_cpus() gets called,
> we did not register the node_subsys bus yet, so we crash when bus_add_device()
> tries to dereference bus()->p.
>
> The following shows the order of the calls:
>
> kernel_init_freeable
> smp_init
> bringup_nonboot_cpus
> ...
> bus_add_device() <- we did not register node_subsys yet
> do_basic_setup
> do_initcalls
> postcore_initcall(register_node_type);
> register_node_type
> subsys_system_register
> subsys_register
> bus_register <- register node_subsys bus
>
> Why setting the node online saves us then? Well, simply because
> __try_online_node() backs off when the node is online, meaning
> we do not end up calling register_one_node() in the first place.

This is really a mess and a house built on sand. Thanks for looking into
it and hopefully this can get cleaned up to a saner state.

> This is subtle, broken and deserves a deep analysis and thought
> about how to put this into shape, but for now let us have this
> easy fix for the leaking memory issue.
>
> Signed-off-by: Oscar Salvador <osalvador@xxxxxxx>
> Fixes: da4490c958ad ("mm: handle uninitialized numa nodes gracefully")

This sha1 is from linux-next very likely so it won't be persistent.
Please drop it.

Other than that
Acked-by: Michal Hocko <mhocko@xxxxxxxx>

One nit below

Thanks!

> ---
> arch/x86/mm/numa.c | 15 ++-------------
> include/linux/mm.h | 1 -
> mm/page_alloc.c | 2 +-
> 3 files changed, 3 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index c6b1213086d6..37039a6af8ae 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -738,17 +738,6 @@ void __init x86_numa_init(void)
> numa_init(dummy_numa_init);
> }
>
> -static void __init init_memory_less_node(int nid)
> -{
> - /* Allocate and initialize node data. Memory-less node is now online.*/
> - alloc_node_data(nid);
> - free_area_init_memoryless_node(nid);
> -
> - /*
> - * All zonelists will be built later in start_kernel() after per cpu
> - * areas are initialized.
> - */
> -}
>
> /*
> * A node may exist which has one or more Generic Initiators but no CPUs and no
> @@ -768,7 +757,7 @@ void __init init_gi_nodes(void)
>
> for_each_node_state(nid, N_GENERIC_INITIATOR)
> if (!node_online(nid))
> - init_memory_less_node(nid);
> + node_set_online(nid);

I would stick a TODO here.
/*
* Exclude this node from
* bringup_nonboot_cpus
* cpu_up
* __try_online_node
* register_one_node
* because node_subsys is not initialized yet
* TODO remove dependency on node_online()
*/
> }
>
> /*
> @@ -799,7 +788,7 @@ void __init init_cpu_to_node(void)
> continue;
>
> if (!node_online(node))
> - init_memory_less_node(node);
> + node_set_online(node);

and here as well.
--
Michal Hocko
SUSE Labs