[PATCH 00/10] Add support for Sub-NUMA cluster (SNC) systems

From: Tony Luck
Date: Wed Mar 27 2024 - 16:05:43 EST


This series on top of v6.9-rc1 plus these two patches:

Link: https://lore.kernel.org/all/20240308213846.77075-1-tony.luck@xxxxxxxxx/

The Sub-NUMA cluster feature on some Intel processors partitions the CPUs
that share an L3 cache into two or more sets. This plays havoc with the
Resource Director Technology (RDT) monitoring features. Prior to this
patch Intel has advised that SNC and RDT are incompatible.

Some of these CPU support an MSR that can partition the RMID counters in
the same way. This allows monitoring features to be used. With the caveat
that users must be aware that Linux may migrate tasks more frequently
between SNC nodes than between "regular" NUMA nodes, so reading counters
from all SNC nodes may be needed to get a complete picture of activity
for tasks.

Cache and memory bandwidth allocation features continue to operate at
the scope of the L3 cache.

This is a new approach triggered by the discussions that started with
"How can users tell that SNC is enabled?" but then drifted into
whether users of the legacy interface would really get what they
expected when reading from monitor files in the mon_L3_* directories.

During that discussion I'd mentioned providing monitor values for both
the L3 level, and also for each SNC node. That would provide full ABI
compatibility while also giving the finer grained reporting from each
SNC node.

Implementation sets up a new rdt_resource to track monitor resources
with domains for each SNC node. This resource is only used when SNC
mode is detected.

On SNC systems there is a parent-child relationship between the
old L3 resource and the new SUBL3 resource. Reading from legacy
files like mon_data/mon_L3_00/llc_occupancy reads and sums the RMID
counters from all "child" domains in the SUBL3 resource. E.g. on
an SNC3 system:

$ grep . mon_L3_01/llc_occupancy mon_L3_01/*/llc_occupancy
mon_L3_01/llc_occupancy:413097984
mon_L3_01/mon_SUBL3_03/llc_occupancy:141484032
mon_L3_01/mon_SUBL3_04/llc_occupancy:135659520
mon_L3_01/mon_SUBL3_05/llc_occupancy:135954432

So the L3 occupancy shows the total L3 occupancy which is
the sum of the cache occupancy on each of the SNC nodes
that share that L3 cache instance.

Patch 0001 has been salvaged from the previous postings.
All the rest are new.

Signed-off-by: Tony Luck <tony.luck@xxxxxxxxx>

Tony Luck (10):
x86/resctrl: Prepare for new domain scope
x86/resctrl: Add new rdt_resource for sub-node monitoring
x86/resctrl: Add new "enabled" state for monitor resources
x86/resctrl: Add pointer to enabled monitor resource
x86/resctrl: Add parent/child information to rdt_resource and
rdt_domain
x86/resctrl: Update mkdir_mondata_subdir() for Sub-NUMA domains
x86/resctrl: Update rmdir_mondata_subdir_allrdtgrp() for Sub-NUMA
domains
x86/resctrl: Mark L3 monitor files with summation flag.
x86/resctrl: Update __mon_event_count() for Sub-NUMA domains
x86/resctrl: Determine Sub-NUMA configuration

include/linux/resctrl.h | 20 ++-
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/kernel/cpu/resctrl/internal.h | 23 ++-
arch/x86/kernel/cpu/resctrl/core.c | 76 +++++++---
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 3 +-
arch/x86/kernel/cpu/resctrl/monitor.c | 136 +++++++++++++++--
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 6 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 170 +++++++++++++++++-----
8 files changed, 364 insertions(+), 71 deletions(-)

--
2.44.0