Re: [Patch v3 3/6] sched/fair: Implement prefer sibling imbalance calculation between asymmetric groups

From: Tobias Huschle
Date: Fri Jul 14 2023 - 10:23:40 EST


On 2023-07-14 15:14, Shrikanth Hegde wrote:
On 7/8/23 4:27 AM, Tim Chen wrote:
From: Tim C Chen <tim.c.chen@xxxxxxxxxxxxxxx>

In the current prefer sibling load balancing code, there is an implicit
assumption that the busiest sched group and local sched group are
equivalent, hence the tasks to be moved is simply the difference in
number of tasks between the two groups (i.e. imbalance) divided by two.

However, we may have different number of cores between the cluster groups,
say when we take CPU offline or we have hybrid groups. In that case,
we should balance between the two groups such that #tasks/#cores ratio
is the same between the same between both groups. Hence the imbalance

nit: type here. the same between is repeated.

computed will need to reflect this.

Adjust the sibling imbalance computation to take into account of the
above considerations.

Signed-off-by: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
---
kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++++++++++----
1 file changed, 37 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f636d6c09dc6..f491b94908bf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9372,6 +9372,41 @@ static inline bool smt_balance(struct lb_env *env, struct sg_lb_stats *sgs,
return false;
}

+static inline long sibling_imbalance(struct lb_env *env,
+ struct sd_lb_stats *sds,
+ struct sg_lb_stats *busiest,
+ struct sg_lb_stats *local)
+{
+ int ncores_busiest, ncores_local;
+ long imbalance;

can imbalance be unsigned int or unsigned long? as sum_nr_running is
unsigned int.

+
+ if (env->idle == CPU_NOT_IDLE || !busiest->sum_nr_running)
+ return 0;
+
+ ncores_busiest = sds->busiest->cores;
+ ncores_local = sds->local->cores;
+
+ if (ncores_busiest == ncores_local) {
+ imbalance = busiest->sum_nr_running;
+ lsub_positive(&imbalance, local->sum_nr_running);
+ return imbalance;
+ }
+
+ /* Balance such that nr_running/ncores ratio are same on both groups */
+ imbalance = ncores_local * busiest->sum_nr_running;
+ lsub_positive(&imbalance, ncores_busiest * local->sum_nr_running);
+ /* Normalize imbalance and do rounding on normalization */
+ imbalance = 2 * imbalance + ncores_local + ncores_busiest;
+ imbalance /= ncores_local + ncores_busiest;
+

Could this work for case where number of CPU/cores would differ
between two sched groups in a sched domain? Such as problem pointed
by tobias on S390. It would be nice if this patch can work for that case
as well. Ran numbers for a few cases. It looks to work.
https://lore.kernel.org/lkml/20230704134024.GV4253@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/T/#rb0a7dcd28532cafc24101e1d0aed79e6342e3901



Just stumbled upon this patch series as well. In this version it looks
similar to the prototypes I played around with, but more complete.
So I'm happy that my understanding of the load balancer was kinda correct :)

From a functional perspective this appears to address the issues we saw on s390.



+ /* Take advantage of resource in an empty sched group */
+ if (imbalance == 0 && local->sum_nr_running == 0 &&
+ busiest->sum_nr_running > 1)
+ imbalance = 2;
+

I don't see how this case would be true. When there are unequal number
of cores and local->sum_nr_ruuning
is 0, and busiest->sum_nr_running is atleast 2, imbalance will be atleast 1.


Reviewed-by: Shrikanth Hegde <sshegde@xxxxxxxxxxxxxxxxxx>

+ return imbalance;
+}
+
static inline bool
sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
{
@@ -10230,14 +10265,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
}

if (busiest->group_weight == 1 || sds->prefer_sibling) {
- unsigned int nr_diff = busiest->sum_nr_running;
/*
* When prefer sibling, evenly spread running tasks on
* groups.
*/
env->migration_type = migrate_task;
- lsub_positive(&nr_diff, local->sum_nr_running);
- env->imbalance = nr_diff;
+ env->imbalance = sibling_imbalance(env, sds, busiest, local);
} else {

/*
@@ -10424,7 +10457,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
* group's child domain.
*/
if (sds.prefer_sibling && local->group_type == group_has_spare &&
- busiest->sum_nr_running > local->sum_nr_running + 1)
+ sibling_imbalance(env, &sds, busiest, local) > 1)
goto force_balance;

if (busiest->group_type != group_overloaded) {