Re: [PATCH] x86/tsc: Extend the watchdog check exemption to 4S/8S machine

From: Feng Tang
Date: Tue Oct 11 2022 - 03:52:35 EST


On Tue, Oct 11, 2022 at 09:09:12AM +0800, Feng Tang wrote:
> On Mon, Oct 10, 2022 at 07:23:10AM -0700, Dave Hansen wrote:
> > On 10/9/22 18:23, Feng Tang wrote:
> > >>> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> > >>> index cafacb2e58cc..b4ea79cb1d1a 100644
> > >>> --- a/arch/x86/kernel/tsc.c
> > >>> +++ b/arch/x86/kernel/tsc.c
> > >>> @@ -1217,7 +1217,7 @@ static void __init check_system_tsc_reliable(void)
> > >>> if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> > >>> boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
> > >>> boot_cpu_has(X86_FEATURE_TSC_ADJUST) &&
> > >>> - nr_online_nodes <= 2)
> > >>> + nr_online_nodes <= 8)
> > >> So you're saying all 8 socket systems since Broadwell (?) are TSC
> > >> sync'ed ?
> > > No, I didn't mean that. I haven't got chance to any 8 sockets
> > > machine, and I got a report last month that on one 8S machine,
> > > the TSC was judged 'unstable' by HPET as watchdog.
> >
> > That's not a great check. Think about numa=fake=4U, for instance. Or a
> > single-socket system with persistent memory and high bandwidth memory.
> >
> > Basically 'nr_online_nodes' is a software construct. It's going to be
> > really hard to infer anything from it about what the _hardware_ is.
>
> You are right! How to get the socket number was indeed a trouble when
> I worked on commit b50db7095fe0, the problem is related to the
> initialization order. This tsc check needs to be done in tsc_init(),
> while the node_stats[] get initialized in later's call of smp_init().
>
> For the case you mentioned above, I dug out some old logs which showed
> its init order:
>
> numa=fake=4 on a SKL desktop
> ================
> [ 0.000066] [tsc_early_init()]: nr_online_nodes = 1
> [ 0.000068] [tsc_early_init()]: nr_cpu_nodes = 0
> [ 0.000070] [tsc_early_init()]: nr_mem_nodes = 0
> [ 0.104015] [tsc_init()]: nr_online_nodes = 4
> [ 0.104019] [tsc_init()]: nr_cpu_nodes = 0
> [ 0.104022] [tsc_init()]: nr_mem_nodes = 4
> [ 0.124778] smp: Brought up 4 nodes, 4 CPUs
> [ 0.760915] [init_tsc_clocksource()]: nr_online_nodes = 4
> [ 0.760919] [init_tsc_clocksource()]: nr_cpu_nodes = 4
> [ 0.760922] [init_tsc_clocksource()]: nr_mem_nodes = 4
>
> QEMU with 2 CPU-DRAM nodes + 2 Persistent memory nodes
> ========================================================
> [ 0.066651] [tsc_early_init()]: nr_online_nodes = 1
> [ 0.067494] [tsc_early_init()]: nr_cpu_nodes = 0
> [ 0.068288] [tsc_early_init()]: nr_mem_nodes = 0
> [ 0.677694] [tsc_init()]: nr_online_nodes = 4
> [ 0.678862] [tsc_init()]: nr_cpu_nodes = 0
> [ 0.679962] [tsc_init()]: nr_mem_nodes = 4
> [ 1.139240] [init_tsc_clocksource()]: nr_online_nodes = 4
> [ 1.140576] [init_tsc_clocksource()]: nr_cpu_nodes = 2
> [ 1.141823] [init_tsc_clocksource()]: nr_mem_nodes = 4
> [ 1.660100] [kernel_init()]: nr_online_nodes = 4
> [ 1.661234] [kernel_init()]: nr_cpu_nodes = 2
> [ 1.662300] [kernel_init()]: nr_mem_nodes = 4
>
> The 'nr_online_nodes' was chosed in the hope of that, in worse case
> the patch is just a nop and won't wrongly lift the check.
>
> One possible solution for this problem is to leverage the SRAT table
> early init which is called before tsc_init(), and can provide CPU
> nodes info. Will try this way.

Th simple patch below is to have a dedicate CPU nodemask and set it in
early SRAT CPU parsing, still it has problem when sub-numa is enabled
in BIOS where there are more NUMA nodes in SRAT table. (also I'm
not sure the change to amdtopology.c is right)

Thanks,
Feng

diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index e3bae2b60a0d..e745053a5f9a 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -31,6 +31,7 @@ extern int numa_off;
*/
extern s16 __apicid_to_node[MAX_LOCAL_APIC];
extern nodemask_t numa_nodes_parsed __initdata;
+extern nodemask_t numa_cpu_nodes __initdata;

extern int __init numa_add_memblk(int nodeid, u64 start, u64 end);
extern void __init numa_set_distance(int from, int to, int distance);
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 179e0b1ba5cc..a2a7fc5aa15c 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -29,6 +29,7 @@
#include <asm/intel-family.h>
#include <asm/i8259.h>
#include <asm/uv/uv.h>
+#include <asm/numa.h>

unsigned int __read_mostly cpu_khz; /* TSC clocks / usec, not used here */
EXPORT_SYMBOL(cpu_khz);
@@ -1218,7 +1219,7 @@ first_dump();
if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
boot_cpu_has(X86_FEATURE_TSC_ADJUST) &&
- nr_online_nodes <= 2)
+ nodes_weight(numa_cpu_nodes) <= 2)
tsc_disable_clocksource_watchdog();
}

diff --git a/arch/x86/mm/amdtopology.c b/arch/x86/mm/amdtopology.c
index b3ca7d23e4b0..6b982a16cc38 100644
--- a/arch/x86/mm/amdtopology.c
+++ b/arch/x86/mm/amdtopology.c
@@ -152,6 +152,7 @@ int __init amd_numa_init(void)
prevbase = base;
numa_add_memblk(nodeid, base, limit);
node_set(nodeid, numa_nodes_parsed);
+ node_set(nodeid, numa_cpu_nodes);
}

if (nodes_empty(numa_nodes_parsed))
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 090125b3ee1f..82798fee97a2 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -21,6 +21,7 @@

int numa_off;
nodemask_t numa_nodes_parsed __initdata;
+nodemask_t numa_cpu_nodes __initdata;

struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
EXPORT_SYMBOL(node_data);
diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index 7688117ac2f4..11b08b317306 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -59,6 +59,7 @@ acpi_numa_x2apic_affinity_init(struct acpi_srat_x2apic_cpu_affinity *pa)
}
set_apicid_to_node(apic_id, node);
node_set(node, numa_nodes_parsed);
+ node_set(node, numa_cpu_nodes);

printk(KERN_INFO "SRAT: PXM %u -> APIC 0x%04x -> Node %u\n",
@@ -106,6 +107,7 @@ acpi_numa_processor_affinity_init(struct acpi_srat_cpu_affinity *pa)

set_apicid_to_node(apic_id, node);
node_set(node, numa_nodes_parsed);
+ node_set(node, numa_cpu_nodes);

printk(KERN_INFO "SRAT: PXM %u -> APIC 0x%02x -> Node %u\n",