Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality

From: Libo Chen
Date: Wed Jun 07 2023 - 10:42:51 EST

Next message: David Howells: "[PATCH net-next v5 10/14] splice, net: Fix SPLICE_F_MORE signalling in splice_direct_to_actor()"
Previous message: Christian Marangi: "[PATCH] binfmt_elf: dynamically allocate note.data in parse_elf_properties"
In reply to: Gautham R. Shenoy: "Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 6/4/23 9:46 PM, Gautham R. Shenoy wrote:

Hello Vincent,

On Tue, May 23, 2023 at 06:12:45PM +0200, Vincent Guittot wrote:

On Tue, 23 May 2023 at 13:18, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:

So wakeup based placement is mostly all about LLC, and given this thing
has dinky small LLCs it will pile up on the one LLC you target and leave
the others idle until the regular idle balancer decides to make an
appearance and move some around.

But if these are fairly short running tasks, I can well see that not
going to help much.

Much of this was tuned back when there was 1 L3 per Node; something
which is still more or less true for Intel but clearly not for these
things.

The below is a bit crude and completely untested, but it might help. The
flip side of that coin is of course that people are going to complain
about how selecting a CPU is more expensive now and how this hurts their
performance :/

Basically it will try and iterate all L3s in a node; wakeup will still
refuse to cross node boundaries.

That remember me some discussion about system with fast on die
interconnect where we would like to run wider than llc at wakeup (i.e.
DIE level) something like the CLUSTER level but on the other side of
MC

Adding Libo Chen who was a part of this discussion. IIRC, the problem was
that there was no MC domain on that system, which would have made the
SMT domain to be the sd_llc. But since the core is single threaded,
the SMT domain would be degnerated thus leaving no domain which has
the SD_SHARE_PKG_RESOURCES flag.

If I understand correctly, Peter's patch won't help in such a
situation.

Yes, you have some ARM platforms that have no L3 cache (they have SLCs though which are memory-side caches)
and no hyperthreading, so the lowest domain level is DIE and every single wakee task goes back to previous
CPU no matter what because LLC doesn't exist. For such platforms, we would want to expand the search range
similar to AMD to take advantage of idle cores on the same DIE.

I think we need a new abstraction here to accommodate all these different cache topologies,
replace SD_LLC with, for example, SD_WAKEUP and allow wider search range independent of LLC.

Libo

However, it should help POWER10 which has the SMT domain as the LLC
and previously it was observed that moving the wakeup search to the
parent domain was helpful (Ref:
https://urldefense.com/v3/__https://lore.kernel.org/lkml/1617341874-1205-1-git-send-email-ego@xxxxxxxxxxxxxxxxxx/__;!!ACWV5N9M2RV99hQ!MWwLTGkdtGeyndhqBm2g_RRyVZQP9lTDPPEawQldKgmaE0QL20ET4F_2mpJcd_ghyOAGyvrk3Blc5FWXCvKbnFk$ )

--
Thanks and Regards
gautham.

Another possibility to investigate would be that each wakeup of a
worker is mostly unrelated to the previous one and it cares only
waker. so we should use -1 for the prev_cpu

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 48b6f0ca13ac..ddb7f16a07a9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7027,6 +7027,33 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
return idle_cpu;
}

+static int
+select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
+{
+ struct sched_domain *sd_node = rcu_dereference(per_cpu(sd_node, target));
+ struct sched_group *sg;
+
+ if (!sd_node || sd_node == sd)
+ return -1;
+
+ sg = sd_node->groups;
+ do {
+ int cpu = cpumask_first(sched_group_span(sg));
+ struct sched_domain *sd_child;
+
+ sd_child = per_cpu(sd_llc, cpu);
+ if (sd_child != sd) {
+ int i = select_idle_cpu(p, sd_child, test_idle_cores(cpu), cpu);
+ if ((unsigned int)i < nr_cpumask_bits)
+ return i;
+ }
+
+ sg = sg->next;
+ } while (sg != sd_node->groups);
+
+ return -1;
+}
+
/*
* Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
* the task fits. If no CPU is big enough, but there are idle ones, try to
@@ -7199,6 +7226,12 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
if ((unsigned)i < nr_cpumask_bits)
return i;

+ if (sched_feat(SIS_NODE)) {
+ i = select_idle_node(p, sd, target);
+ if ((unsigned)i < nr_cpumask_bits)
+ return i;
+ }
+
return target;
}

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..f965cd4a981e 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
*/
SCHED_FEAT(SIS_PROP, false)
SCHED_FEAT(SIS_UTIL, true)
+SCHED_FEAT(SIS_NODE, true)

/*
* Issue a WARN when we do multiple update_rq_clock() calls
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 678446251c35..d2e0e2e496a6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1826,6 +1826,7 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
DECLARE_PER_CPU(int, sd_llc_size);
DECLARE_PER_CPU(int, sd_llc_id);
DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
+DECLARE_PER_CPU(struct sched_domain __rcu *, sd_node);
DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index ca4472281c28..d94cbc2164ca 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -667,6 +667,7 @@ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
DEFINE_PER_CPU(int, sd_llc_size);
DEFINE_PER_CPU(int, sd_llc_id);
DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
+DEFINE_PER_CPU(struct sched_domain __rcu *, sd_node);
DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
@@ -691,6 +692,18 @@ static void update_top_cache_domain(int cpu)
per_cpu(sd_llc_id, cpu) = id;
rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);

+ while (sd && sd->parent) {
+ /*
+ * SD_NUMA is the first domain spanning nodes, therefore @sd
+ * must be the domain that spans a single node.
+ */
+ if (sd->parent->flags & SD_NUMA)
+ break;
+
+ sd = sd->parent;
+ }
+ rcu_assign_pointer(per_cpu(sd_node, cpu), sd);
+
sd = lowest_flag_domain(cpu, SD_NUMA);
rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);

Next message: David Howells: "[PATCH net-next v5 10/14] splice, net: Fix SPLICE_F_MORE signalling in splice_direct_to_actor()"
Previous message: Christian Marangi: "[PATCH] binfmt_elf: dynamically allocate note.data in parse_elf_properties"
In reply to: Gautham R. Shenoy: "Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]