Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs

From: Mel Gorman
Date: Fri Feb 04 2022 - 04:04:17 EST


On Fri, Feb 04, 2022 at 12:36:54PM +0530, Srikar Dronamraju wrote:
> * Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> [2022-02-03 14:46:52]:
>
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index d201a7052a29..e6cd55951304 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -2242,6 +2242,59 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> > }
> > }
> >
> > + /*
> > + * Calculate an allowed NUMA imbalance such that LLCs do not get
> > + * imbalanced.
> > + */
>
> We seem to adding this hunk before the sched_domains may be degenerated.
> Wondering if we really want to do it before degeneration.
>

There was no obvious advantage versus doing it at the same time
characteristics like groups were being determined.

> Let say we have 3 sched domains and we calculated the sd->imb_numa_nr for
> all the 3 domains, then lets say the middle sched_domain gets degenerated.
> Would the sd->imb_numa_nr's still be relevant?
>

It's expected that it is still relevant as the ratios with respect to
SD_SHARE_PKG_RESOURCES should still be consistent.

>
> > + for_each_cpu(i, cpu_map) {
> > + unsigned int imb = 0;
> > + unsigned int imb_span = 1;
> > +
> > + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> > + struct sched_domain *child = sd->child;
> > +
> > + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> > + (child->flags & SD_SHARE_PKG_RESOURCES)) {
> > + struct sched_domain *top, *top_p;
> > + unsigned int nr_llcs;
> > +
> > + /*
> > + * For a single LLC per node, allow an
> > + * imbalance up to 25% of the node. This is an
> > + * arbitrary cutoff based on SMT-2 to balance
> > + * between memory bandwidth and avoiding
> > + * premature sharing of HT resources and SMT-4
> > + * or SMT-8 *may* benefit from a different
> > + * cutoff.
> > + *
> > + * For multiple LLCs, allow an imbalance
> > + * until multiple tasks would share an LLC
> > + * on one node while LLCs on another node
> > + * remain idle.
> > + */
> > + nr_llcs = sd->span_weight / child->span_weight;
> > + if (nr_llcs == 1)
> > + imb = sd->span_weight >> 2;
> > + else
> > + imb = nr_llcs;
> > + sd->imb_numa_nr = imb;
> > +
> > + /* Set span based on the first NUMA domain. */
> > + top = sd;
> > + top_p = top->parent;
> > + while (top_p && !(top_p->flags & SD_NUMA)) {
> > + top = top->parent;
> > + top_p = top->parent;
> > + }
> > + imb_span = top_p ? top_p->span_weight : sd->span_weight;
>
> I am getting confused by imb_span.
> Let say we have a topology of SMT -> MC -> DIE -> NUMA -> NUMA, with SMT and
> MC domains having SD_SHARE_PKG_RESOURCES flag set.
> We come here only for DIE domain.
>
> imb_span set here is being used for both the subsequent sched domains
> most likely they will be NUMA domains. Right?
>

Right.

> > + } else {
> > + int factor = max(1U, (sd->span_weight / imb_span));
> > +
> > + sd->imb_numa_nr = imb * factor;
>
> For SMT, (or any sched domains below the llcs) factor would be
> sd->span_weight but imb_numa_nr and imb would be 0.

Yes.

> For NUMA (or any sched domain just above DIE), factor would be
> sd->imb_numa_nr would be nr_llcs.
> For subsequent sched_domains, the sd->imb_numa_nr would be some multiple of
> nr_llcs. Right?
>

Right.

--
Mel Gorman
SUSE Labs