Re: [PATCH] x86/numa: Map NUMA node to CPUs as per DeviceTree

From: Thomas Gleixner
Date: Sun Mar 24 2024 - 10:59:46 EST


On Tue, Mar 12 2024 at 10:43, Saurabh Sengar wrote:
> Currently for DeviceTree bootup, x86 code does the default mapping of
> CPUs to NUMA, which is wrong. This can cause incorrect mapping and WARN
> on a SMT enabled system like below:
>
> [0.417551] ------------[ cut here ]------------
> [0.417551] Saurabh sched: CPU #1's smt-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
> [0.417551] WARNING: CPU: 1 PID: 0 at topology_sane.isra.0+0x5c/0x6d
> [0.417551] Modules linked in:
> [0.417551] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.1.71-microsoft-hcl+ #4
> [0.417551] RIP: 0010:topology_sane.isra.0+0x5c/0x6d
> [0.417551] Code: 41 39 dc 74 27 80 3d 32 ae 2d 00 00 75 1e 41 89 d9 45 89 e0 44 89 d6 48 c7 c7 00 a6 4a 88 c6 05 19 ae 2d 00 01 e8 6e 1f cb ff <0f> 0b 41 39 dc 5b 41 5c 0f 94 c0 5d c3 cc cc cc cc 55 48 8b 05 05
> [0.417551] RSP: 0000:ffffc9000013feb0 EFLAGS: 00010086
> [0.417551] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> [0.417551] RDX: 0000000000000003 RSI: 0000000000000086 RDI: 00000000ffffffff
> [0.417551] RBP: ffffc9000013fec0 R08: ffffffff88778160 R09: ffffffff88778160
> [0.417551] R10: ffff888227fe26da R11: ffff888227fe26c1 R12: 0000000000000001
> [0.417551] R13: 0000000000000000 R14: ffff888216415040 R15: 0000000000000000
> [0.417551] FS: 0000000000000000(0000) GS:ffff888216400000(0000) knlGS:0000000000000000
> [0.417551] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [0.417551] CR2: 0000000000000000 CR3: 0000000208809001 CR4: 0000000000330ea0
> [0.417551] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [0.417551] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
> [0.417551] Call Trace:
> [0.417551] <TASK>
> [0.417551] ? show_regs.cold+0x1a/0x1f
> [0.417551] ? __warn+0x6e/0xc0
> [0.417551] ? report_bug+0x101/0x1a0
> [0.417551] ? handle_bug+0x40/0x70
> [0.417551] ? exc_invalid_op+0x19/0x70
> [0.417551] ? asm_exc_invalid_op+0x1b/0x20
> [0.417551] ? topology_sane.isra.0+0x5c/0x6d
> [0.417551] match_smt+0xf6/0xfc
> [0.417551] set_cpu_sibling_map.cold+0x24f/0x512
> [0.417551] start_secondary+0x5c/0x110
> [0.417551] secondary_startup_64_no_verify+0xcd/0xdb
> [0.417551] </TASK>
> [0.417551] ---[ end trace 0000000000000000 ]---

Can you please trim the backtrace like documented. 95% of the pasted
information above is completely irrelevant to understand the issue.

CPU #1's smt-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
WARNING: CPU: 1 PID: 0 at topology_sane.isra.0+0x5c/0x6d
match_smt+0xf6/0xfc
set_cpu_sibling_map.cold+0x24f/0x512
start_secondary+0x5c/0x110

is more than sufficient, no?

> +static void __init of_parse_and_init_cpus(void)
> +{
> + struct device_node *dn;
> + int cpuid = 0;
> + int nid;
> +
> + for_each_of_cpu_node(dn) {
> + if (cpuid >= NR_CPUS) {

This condition is wrong. nr_cpu_ids != NR_CPUS.

> + pr_warn("NR_CPUS too small for %d cpuid\n", cpuid);

What's this for? The APIC enumeration code warns about this already.

> + return;
> + }
> + nid = of_node_to_nid(dn);
> + numa_set_node(cpuid, nid);
> + cpuid++;
> + }
> +}
> +
> static int __init numa_init(int (*init_func)(void))
> {
> int i;
> @@ -645,6 +662,9 @@ static int __init numa_init(int (*init_func)(void))
> if (ret < 0)
> return ret;
>
> + if (acpi_disabled)
> + of_parse_and_init_cpus();

numa_init() is invoked from x86_numa_init() with the various NUMA init
functions as argument and x86_numa_init() already has OF NUMA logic:

#ifdef CONFIG_ACPI_NUMA
if (!numa_init(x86_acpi_numa_init))
return;
#endif
#ifdef CONFIG_AMD_NUMA
if (!numa_init(amd_numa_init))
return;
#endif
if (acpi_disabled && !numa_init(of_numa_init))
return;

of_numa_init() does not do the numa_set_node() part, but that's not a
justification to glue this into numa_init() which is firmware
independent except for the firmware specific callback argument.

Also x86_numa_init() is invoked _before_ the APIC ID to Linux CPU number
association happens, so doing the CPU number to node mapping at this
point is just wrong for any CPU number != 0.

It "works" for OF just by chance as the actual APIC enumeration which
associates Linux CPU numbers works in the same order, but that does not
make it correct in any way.

x86_acpi_numa_init() and amd_numa_init() set up the nodes like
of_numa_init() does and aside of that save the APIC ID to node mapping.

Aside of that the numa_set_node() variant happens to "work" for Intel
CPUs as srat_detect_node() falls back to cpu_to_node() when
numa_cpu_node() returns NO_NUMA_NODE.

Though the AMD variant falls back to cpu_info.topo.llc_id which is not
necessarily the same result as what the device tree enumerated.

I can see why you glued it into numa_init():

of_node_to_nid() requires node_possible_map to be initialized, which
happens in numa_register_memblks() invoked from numa_init() if the
firmware specific callback which enumerates the nodes was successful.

Of course the change log is silent about this.

But there is no reason to scan this right in numa_init() as nothing
needs the information to be set at this point, unless I'm missing
something obscure, which might be the case when staring at this NUMA
enumeration maze.

The CPU to node mapping based on the APIC ID to node mapping happens in
init_cpu_to_node() which is after the APIC enumeration and the
finalizing of cpu_possible_map completed. At this point the CPU number
to APIC ID mapping is stable.

So the uncompiled below should just work, no?

Thanks,

tglx
---
--- a/arch/x86/kernel/devicetree.c
+++ b/arch/x86/kernel/devicetree.c
@@ -24,6 +24,7 @@
#include <asm/pci_x86.h>
#include <asm/setup.h>
#include <asm/i8259.h>
+#include <asm/numa.h>
#include <asm/prom.h>

__initdata u64 initial_dtb;
@@ -137,6 +138,7 @@ static void __init dtb_cpu_setup(void)
continue;
}
topology_register_apic(apic_id, CPU_ACPIID_INVALID, true);
+ set_apicid_to_node(apic_id, of_node_to_nid(dn));
}
}