Re: [RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer

From: Tobias Huschle
Date: Fri Jul 07 2023 - 03:46:07 EST


On 2023-07-06 19:19, Shrikanth Hegde wrote:
On 5/15/23 5:16 PM, Tobias Huschle wrote:
The current load balancer implementation implies that scheduler groups,
within the same domain, all host the same number of CPUs. This is
reflected in the condition, that a scheduler group, which is load
balancing and classified as having spare capacity, should pull work
from the busiest group, if the local group runs less processes than
the busiest one. This implies that these two groups should run the
same number of processes, which is problematic if the groups are not
of the same size.

The assumption that scheduler groups within the same scheduler domain
host the same number of CPUs appears to be true for non-s390
architectures. Nevertheless, s390 can have scheduler groups of unequal
size.

This introduces a performance degredation in the following scenario:

Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket,
the remaining 2 are located on another socket:

Socket -----1----- -2-
CPU 1 2 3 4 5 6 7 8

Placing some workload ( x = one task ) yields the following
scenarios:

The first 5 tasks are distributed evenly across the two groups.

Socket -----1----- -2-
CPU 1 2 3 4 5 6 7 8
x x x x x

Adding a 6th task yields the following distribution:

Socket -----1----- -2-
CPU 1 2 3 4 5 6 7 8
SMT1 x x x x x
SMT2 x

The task is added to the 2nd scheduler group, as the scheduler has the
assumption that scheduler groups are of the same size, so they should
also host the same number of tasks. This makes CPU 7 run into SMT
thread, which comes with a performance penalty. This means, that in
the window of 6-8 tasks, load balancing is done suboptimally, because
SMT is used although there is no reason to do so as fully idle CPUs
are still available.

Taking the weight of the scheduler groups into account, ensures that
a load balancing CPU within a smaller group will not try to pull tasks
from a bigger group while the bigger group still has idle CPUs
available.

Signed-off-by: Tobias Huschle <huschle@xxxxxxxxxxxxx>
---
kernel/sched/fair.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 48b6f0ca13ac..b1307d7e4065 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10426,7 +10426,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
* group's child domain.
*/
if (sds.prefer_sibling && local->group_type == group_has_spare &&
- busiest->sum_nr_running > local->sum_nr_running + 1)
+ busiest->sum_nr_running * local->group_weight >
+ local->sum_nr_running * busiest->group_weight + 1)
goto force_balance;



I assume its SMT2 here. sched group is enclosed in [busy_cpus/idle_cpus/weight]

Lets take an example: we will begin the with the said imbalance.
[3/9/12] -- local group
[3/1/4] -- busy group.
we will evaluate 3*12 > (4*(3+1)) is true and proceeds further to balance.
but calculate_imbalance will lead to zero as the imbalance no? in case
of prefer sibling
case it find the difference of sum_nr_running of the two groups. It
will be 3-3 = 0

this would need modifications to calculate_imbalance.
Maybe something like this will help? NOT TESTED AT ALL.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a80a73909dc2..e027d4086edc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10296,7 +10296,9 @@ static inline void calculate_imbalance(struct
lb_env *env, struct sd_lb_stats *s
return;
}

- if (busiest->group_weight == 1 || sds->prefer_sibling) {
+ if (busiest->group_weight == 1 ||
+ (sds->prefer_sibling &&
+ busiest->group_weight == local->group_weight)) {
unsigned int nr_diff = busiest->sum_nr_running;
/*
* When prefer sibling, evenly spread running tasks on
@@ -10305,6 +10307,11 @@ static inline void calculate_imbalance(struct
lb_env *env, struct sd_lb_stats *s
env->migration_type = migrate_task;
lsub_positive(&nr_diff, local->sum_nr_running);
env->imbalance = nr_diff;
+ }
+ if (sds->prefer_sibling &&
+ busiest->group_weight != local->group_weight) {
+ env->migration_type = migrate_task;
+ env->imbalance = 1;
} else {


I also had a look at this when Vincent pointed out that this part is missing.
The formula proposed by Vincent works pretty well, it is not even necessary
to add additional if-statements here.

---------------------------------------------------------------------------------------------------
On a tangential dimension.


Since prefer_siblings make inherent assumption that all groups have
equal weight,
it will cause complications when group_weights are different. I think
that becomes very
tricky when there are more than two groups. Depending on which cpu is doing
load_balance there can be unnecessary migrations.


Currently even in NUMA this can be similar case right? There will be
unequal number of CPU's right?
If we fix that case and remove prefer siblings in your arch, will that work?

Maybe something like this? NOT TESTED AT ALL.


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a80a73909dc2..fc6377f48ced 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10514,6 +10514,7 @@ static struct sched_group
*find_busiest_group(struct lb_env *env)
goto out_balanced;

if (busiest->group_weight > 1 &&
+ busiest->group_weight == local->group_weight &&
local->idle_cpus <= (busiest->idle_cpus + 1))
/*
* If the busiest group is not overloaded
@@ -10526,6 +10527,11 @@ static struct sched_group
*find_busiest_group(struct lb_env *env)
*/
goto out_balanced;

+ if ((busiest->group_weight != local->group_weight) &&
+ (local->idle_cpus * busiest->group_weight <=
+ local->group_weight * (busiest->idle_cpus + 1)))
+ goto out_balanced;
+
if (busiest->sum_h_nr_running == 1)
/*
* busiest doesn't have any tasks waiting to run





if (busiest->group_type != group_overloaded) {

I played around with alternate solutions as well, yours could be interesting in order
to declare the problematic state as balanced essentially. I wouldn't be opposed to
remove prefer_siblings. The question would be if it is necessary to address both
scenarios anyway.