Re: [RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer

From: Vincent Guittot
Date: Wed Jul 05 2023 - 03:52:38 EST


Le lundi 05 juin 2023 à 10:07:16 (+0200), Tobias Huschle a écrit :
> On 2023-05-16 15:36, Vincent Guittot wrote:
> > On Mon, 15 May 2023 at 13:46, Tobias Huschle <huschle@xxxxxxxxxxxxx>
> > wrote:
> > >
> > > The current load balancer implementation implies that scheduler
> > > groups,
> > > within the same domain, all host the same number of CPUs. This is
> > > reflected in the condition, that a scheduler group, which is load
> > > balancing and classified as having spare capacity, should pull work
> > > from the busiest group, if the local group runs less processes than
> > > the busiest one. This implies that these two groups should run the
> > > same number of processes, which is problematic if the groups are not
> > > of the same size.
> > >
> > > The assumption that scheduler groups within the same scheduler domain
> > > host the same number of CPUs appears to be true for non-s390
> > > architectures. Nevertheless, s390 can have scheduler groups of unequal
> > > size.
> > >
> > > This introduces a performance degredation in the following scenario:
> > >
> > > Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket,
> > > the remaining 2 are located on another socket:
> > >
> > > Socket -----1----- -2-
> > > CPU 1 2 3 4 5 6 7 8
> > >
> > > Placing some workload ( x = one task ) yields the following
> > > scenarios:
> > >
> > > The first 5 tasks are distributed evenly across the two groups.
> > >
> > > Socket -----1----- -2-
> > > CPU 1 2 3 4 5 6 7 8
> > > x x x x x
> > >
> > > Adding a 6th task yields the following distribution:
> > >
> > > Socket -----1----- -2-
> > > CPU 1 2 3 4 5 6 7 8
> > > SMT1 x x x x x
> > > SMT2 x
> >
> > Your description is a bit confusing for me. What you name CPU above
> > should be named Core, doesn' it ?
> >
> > Could you share with us your scheduler topology ?
> >
>
> You are correct, it should say core instead of CPU.
>
> One actual configuration from one of my test machines (lscpu -e):
>

[...]

>
> So, 6 cores / 12 CPUs in one group 2 cores / 4 CPUs in the other.

Thaks for the details

>
> If I run stress-ng with 8 cpu stressors on the original code I get a
> distribution
> like this:
>
> 00 01 02 03 04 05 06 07 08 09 10 11 || 12 13 14 15
> x x x x x x x x
>
> Which means that the two cores in the smaller group are running into SMT
> while two
> cores in the larger group are still idle. This is caused by the
> prefer_sibling path
> which really wants to see both groups run the same number of tasks.

yes and it considers that there are the same number of CPUs per group

>
> > >
> > > The task is added to the 2nd scheduler group, as the scheduler has the
> > > assumption that scheduler groups are of the same size, so they should
> > > also host the same number of tasks. This makes CPU 7 run into SMT
> > > thread, which comes with a performance penalty. This means, that in
> > > the window of 6-8 tasks, load balancing is done suboptimally, because
> > > SMT is used although there is no reason to do so as fully idle CPUs
> > > are still available.
> > >
> > > Taking the weight of the scheduler groups into account, ensures that
> > > a load balancing CPU within a smaller group will not try to pull tasks
> > > from a bigger group while the bigger group still has idle CPUs
> > > available.
> > >
> > > Signed-off-by: Tobias Huschle <huschle@xxxxxxxxxxxxx>
> > > ---
> > > kernel/sched/fair.c | 3 ++-
> > > 1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 48b6f0ca13ac..b1307d7e4065 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -10426,7 +10426,8 @@ static struct sched_group
> > > *find_busiest_group(struct lb_env *env)
> > > * group's child domain.
> > > */
> > > if (sds.prefer_sibling && local->group_type ==
> > > group_has_spare &&
> > > - busiest->sum_nr_running > local->sum_nr_running + 1)
> > > + busiest->sum_nr_running * local->group_weight >
> > > + local->sum_nr_running *
> > > busiest->group_weight + 1)


what you want to test here is that moving 1 task from busiest to local group
would help and balance the ratio of tasks per cpu

(busiest->sum_nr_running - 1) / busiest->group_weight > (local->sum_nr_running + 1) / local->group_weight

which can be develop into

busiest->sum_nr_running * local->group_weight >= local->sum_nr_running * busiest->group_weight + busiest->group_weight + local->group_weight

and you also have to change how we calculate the imbalance which just provide the half of the diff of nr_running

by something like

(busiest->sum_nr_running * local->group_weight) - (local->sum_nr_running * busiest->group_weight) / (busiest->group_weight + local->group_weight)

> >
> > This is the prefer_sibling path. Could it be that you should disable
> > prefer_siling between your sockets for such topology ? the default
> > path compares the number of idle CPUs when groups has spare capacity
> >
> >
>
> If I disable prefer_sibling (I played around with it a bit), I run into the
> problem,
> that the tasks are distributed s.t. each group has the same amount of idle
> CPUs, which
> yields distributions similar to this:
>
> 00 01 02 03 04 05 06 07 08 09 10 11 || 12 13 14 15
> x x x x x x x x
>
> Now both groups have 4 idle CPUs which fulfills the criteria imposed by the
> load balancer,
> but the larger group is now running SMT while the smaller one is just idle.
>
> So, in this asymmetric setup, both criteria seem to not work in an optimal
> way. Going for
> the same number of idle CPUs or alternatively for the same number of running
> processes
> both cause a sub-optimal distribution of tasks, leading to unnecessary SMT.

there is the same behavior and assumption here too


>
> It seems also to be possible to address the regular load balancing path by
> aiming to have the
> same unused capacity between groups instead of the same number of idle CPUs.
> This seems to
> have been considered in the past, but the choice went in favor of the number
> of idle CPUs.

unused capacity doesn't give the instantaneous state so a group can be idle but without
unused capacity

> Since this decision was actively taken already, I focused on the
> prefer_sibling path.
>
> The question now would be how to address this properly (or if I'm missing
> something here).
> As mentioned in the cover letter, this was the most simplistic and least
> invasive approach
> I could find, others might be more sophisticated but also have some
> side-effects.
>
> I have a bit of a hard time leaving this one as-is, as it just introduces
> two additional
> multiplications with no effect for most architectures. Maybe an
> architectures specific
> inline function that the compiler can optimize away if not needed?
>
> > > goto force_balance;
> > >
> > > if (busiest->group_type != group_overloaded) {
> > > --
> > > 2.34.1
> > >
>