Re: [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling()

From: Peter Zijlstra
Date: Wed Jul 05 2023 - 07:57:26 EST


On Fri, Jun 16, 2023 at 12:04:48PM +0530, K Prateek Nayak wrote:

> [2] https://lore.kernel.org/all/3402dcc4-d52f-d99f-e6ce-b435478a5a59@xxxxxxx/

With the crucial bit being:

per_cpu(cpu_llc_id, cpu) = c->apicid >> 3;
+ per_cpu(cpu_mc_id, cpu) = c->apicid >> 4;

Would need some adjustments for <Zen3 I would think, because this simply
groups two LLCs, but those chips have a 4 core LLC and might be better
off with something like >> 5 instead.

> Conclusion seems to be that most workloads would like to run on an idle
> thread as quickly as possible, however, once the system becomes
> overloaded, even iterating over the groups to find an idle CPU outside
> of the target group can affect the workload performance. TOPOEXT is a
> clean way to limit search (as long as marking the boundaries can be
> done in a clean way) but there are concerns about the load balancing
> jitters the new domain will introduce. There will also be an increase
> in amount of C2C transfers as some of the shared data structures are
> accessed and modified (for example sched_domain_shared->has_idle_cores
> updates).

So per the parent of all this, I do think we want something like
SIS_NODE, at the very least for the desktop parts, doubly so for the
<Zen3 parts that have super dinky LLCs (like TJs desktop).

It's just that your big-ass chips need a little 'help' and in that
regard the TOPOEXT thing does look the most reasonable of the bunch.

One variant I did consider was to make SIS_NODE a domain flag, that
way the architecture can decide and we run less risk of randomly
regressing other archs that might not want this.

(did not yet test the SD flag version below)

---
Subject: sched/fair: Multi-LLC select_idle_sibling()

Tejun reported that when he targets workqueues towards a specific LLC
on his Zen2 machine with 3 cores / LLC and 4 LLCs in total, he gets
significant idle time.

This is, of course, because of how select_idle_sibling() will not
consider anything outside of the local LLC, and since all these tasks
are short running the periodic idle load balancer is ineffective.

And while it is good to keep work cache local, it is better to not
have significant idle time. Therefore, have select_idle_sibling() try
other LLCs inside the same node when the local one comes up empty.

Reported-by: Tejun Heo <tj@xxxxxxxxxx>
Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
Link: https://lkml.kernel.org/r/168560901866.404.8439301702539997013.tip-bot2@tip-bot2
---
arch/x86/kernel/smpboot.c | 2 +-
include/linux/sched/sd_flags.h | 7 +++++++
kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++++++++++++
3 files changed, 46 insertions(+), 1 deletion(-)

--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -596,7 +596,7 @@ static inline int x86_sched_itmt_flags(v
#ifdef CONFIG_SCHED_MC
static int x86_core_flags(void)
{
- return cpu_core_flags() | x86_sched_itmt_flags();
+ return cpu_core_flags() | x86_sched_itmt_flags() | SD_IDLE_SIBLING;
}
#endif
#ifdef CONFIG_SCHED_SMT
--- a/include/linux/sched/sd_flags.h
+++ b/include/linux/sched/sd_flags.h
@@ -161,3 +161,10 @@ SD_FLAG(SD_OVERLAP, SDF_SHARED_PARENT |
* NEEDS_GROUPS: No point in preserving domain if it has a single group.
*/
SD_FLAG(SD_NUMA, SDF_SHARED_PARENT | SDF_NEEDS_GROUPS)
+
+/*
+ * Search for idle CPUs in sibling groups
+ *
+ * NEEDS_GROUPS: Load balancing flag.
+ */
+SD_FLAG(SD_IDLE_SIBLING, SDF_NEEDS_GROUPS)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7046,6 +7046,38 @@ static int select_idle_cpu(struct task_s
}

/*
+ * For the multiple-LLC per node case, make sure to try the other LLC's if the
+ * local LLC comes up empty.
+ */
+static int
+select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
+{
+ struct sched_domain *parent = sd->parent;
+ struct sched_group *sg;
+
+ /* Make sure to not cross nodes. */
+ if (!parent || parent->flags & SD_NUMA)
+ return -1;
+
+ sg = parent->groups;
+ do {
+ int cpu = cpumask_first(sched_group_span(sg));
+ struct sched_domain *sd_child = per_cpu(sd_llc, cpu);
+
+ if (!cpus_share_cache(cpu, target) && sd_child) {
+ int i = select_idle_cpu(p, sd_child,
+ test_idle_cores(cpu), cpu);
+ if ((unsigned)i < nr_cpumask_bits)
+ return i;
+ }
+
+ sg = sg->next;
+ } while (sg != parent->groups);
+
+ return -1;
+}
+
+/*
* Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
* the task fits. If no CPU is big enough, but there are idle ones, try to
* maximize capacity.
@@ -7217,6 +7249,12 @@ static int select_idle_sibling(struct ta
if ((unsigned)i < nr_cpumask_bits)
return i;

+ if (sd->flags & SD_IDLE_SIBLING) {
+ i = select_idle_node(p, sd, target);
+ if ((unsigned)i < nr_cpumask_bits)
+ return i;
+ }
+
return target;
}