Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality

From: Gautham R. Shenoy
Date: Mon Jun 05 2023 - 00:46:40 EST


Hello Vincent,

On Tue, May 23, 2023 at 06:12:45PM +0200, Vincent Guittot wrote:
> On Tue, 23 May 2023 at 13:18, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> >
> > So wakeup based placement is mostly all about LLC, and given this thing
> > has dinky small LLCs it will pile up on the one LLC you target and leave
> > the others idle until the regular idle balancer decides to make an
> > appearance and move some around.
> >
> > But if these are fairly short running tasks, I can well see that not
> > going to help much.
> >
> >
> > Much of this was tuned back when there was 1 L3 per Node; something
> > which is still more or less true for Intel but clearly not for these
> > things.
> >
> >
> > The below is a bit crude and completely untested, but it might help. The
> > flip side of that coin is of course that people are going to complain
> > about how selecting a CPU is more expensive now and how this hurts their
> > performance :/
> >
> > Basically it will try and iterate all L3s in a node; wakeup will still
> > refuse to cross node boundaries.
>
> That remember me some discussion about system with fast on die
> interconnect where we would like to run wider than llc at wakeup (i.e.
> DIE level) something like the CLUSTER level but on the other side of
> MC
>

Adding Libo Chen who was a part of this discussion. IIRC, the problem was
that there was no MC domain on that system, which would have made the
SMT domain to be the sd_llc. But since the core is single threaded,
the SMT domain would be degnerated thus leaving no domain which has
the SD_SHARE_PKG_RESOURCES flag.

If I understand correctly, Peter's patch won't help in such a
situation.

However, it should help POWER10 which has the SMT domain as the LLC
and previously it was observed that moving the wakeup search to the
parent domain was helpful (Ref:
https://lore.kernel.org/lkml/1617341874-1205-1-git-send-email-ego@xxxxxxxxxxxxxxxxxx/)


--
Thanks and Regards
gautham.


> Another possibility to investigate would be that each wakeup of a
> worker is mostly unrelated to the previous one and it cares only
> waker. so we should use -1 for the prev_cpu
>
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 48b6f0ca13ac..ddb7f16a07a9 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7027,6 +7027,33 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> > return idle_cpu;
> > }
> >
> > +static int
> > +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
> > +{
> > + struct sched_domain *sd_node = rcu_dereference(per_cpu(sd_node, target));
> > + struct sched_group *sg;
> > +
> > + if (!sd_node || sd_node == sd)
> > + return -1;
> > +
> > + sg = sd_node->groups;
> > + do {
> > + int cpu = cpumask_first(sched_group_span(sg));
> > + struct sched_domain *sd_child;
> > +
> > + sd_child = per_cpu(sd_llc, cpu);
> > + if (sd_child != sd) {
> > + int i = select_idle_cpu(p, sd_child, test_idle_cores(cpu), cpu);
> > + if ((unsigned int)i < nr_cpumask_bits)
> > + return i;
> > + }
> > +
> > + sg = sg->next;
> > + } while (sg != sd_node->groups);
> > +
> > + return -1;
> > +}
> > +
> > /*
> > * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
> > * the task fits. If no CPU is big enough, but there are idle ones, try to
> > @@ -7199,6 +7226,12 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> > if ((unsigned)i < nr_cpumask_bits)
> > return i;
> >
> > + if (sched_feat(SIS_NODE)) {
> > + i = select_idle_node(p, sd, target);
> > + if ((unsigned)i < nr_cpumask_bits)
> > + return i;
> > + }
> > +
> > return target;
> > }
> >
> > diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> > index ee7f23c76bd3..f965cd4a981e 100644
> > --- a/kernel/sched/features.h
> > +++ b/kernel/sched/features.h
> > @@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
> > */
> > SCHED_FEAT(SIS_PROP, false)
> > SCHED_FEAT(SIS_UTIL, true)
> > +SCHED_FEAT(SIS_NODE, true)
> >
> > /*
> > * Issue a WARN when we do multiple update_rq_clock() calls
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 678446251c35..d2e0e2e496a6 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1826,6 +1826,7 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> > DECLARE_PER_CPU(int, sd_llc_size);
> > DECLARE_PER_CPU(int, sd_llc_id);
> > DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> > +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_node);
> > DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> > DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> > DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index ca4472281c28..d94cbc2164ca 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -667,6 +667,7 @@ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> > DEFINE_PER_CPU(int, sd_llc_size);
> > DEFINE_PER_CPU(int, sd_llc_id);
> > DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> > +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_node);
> > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
> > @@ -691,6 +692,18 @@ static void update_top_cache_domain(int cpu)
> > per_cpu(sd_llc_id, cpu) = id;
> > rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
> >
> > + while (sd && sd->parent) {
> > + /*
> > + * SD_NUMA is the first domain spanning nodes, therefore @sd
> > + * must be the domain that spans a single node.
> > + */
> > + if (sd->parent->flags & SD_NUMA)
> > + break;
> > +
> > + sd = sd->parent;
> > + }
> > + rcu_assign_pointer(per_cpu(sd_node, cpu), sd);
> > +
> > sd = lowest_flag_domain(cpu, SD_NUMA);
> > rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
> >