Re: [Patch v3 1/6] sched/fair: Determine active load balance for SMT sched groups

From: Tobias Huschle
Date: Tue Jul 18 2023 - 02:10:19 EST


On 2023-07-15 01:05, Tim Chen wrote:
On Fri, 2023-07-14 at 18:36 +0530, Shrikanth Hegde wrote:



If we consider symmetric platforms which have SMT4 such as power10.
we have a topology like below. multiple such MC will form DIE(PKG)


[0 2 4 6][1 3 5 7][8 10 12 14][9 11 13 15]
[--SMT--][--SMT--][----SMT---][---SMT----]
[--sg1--][--sg1--][---sg1----][---sg1----]
[--------------MC------------------------]

In case of SMT4, if there is any group which has 2 or more tasks, that
group will be marked as group_smt_balance. previously, if that group had 2
or 3 tasks, it would have been marked as group_has_spare. Since all the groups have
SMT that means behavior would be same fully busy right? That can cause some
corner cases. No?

You raised a good point. I was looking from SMT2
perspective so group_smt_balance implies group_fully_busy.
That is no longer true for SMT4.

I am thinking of the following fix on the current patch
to take care of SMT4. Do you think this addresses
concerns from you and Tobias?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 294a662c9410..3fc8d3a3bd22 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9588,6 +9588,17 @@ static bool update_sd_pick_busiest(struct lb_env *env,
break;

case group_smt_balance:
+ /* no idle cpus on both groups handled by
group_fully_busy below */
+ if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0) {
+ if (sgs->idle_cpus > busiest->idle_cpus)
+ return false;
+ if (sgs->idle_cpus < busiest->idle_cpus)
+ return true;
+ if (sgs->sum_nr_running <= busiest_sum_nr_running)
+ return false;
+ else
+ return true;
+ }


I will be on vacation next three weeks so my response will be slow.

Tim


What if the setup is asymmetric, where SMT2 and SMT4 would mix, e.g.

[0 1][2 3 4 5]
[SMT][--SMT--]

If now CPUs 0,2,3 have a running task, both groups would be classified as
smt_balance. But if it comes to the selection of the busiest group, the smaller
group would be selected, as it has less idle CPUs, right? Which could lead
to the smaller group being left with no tasks.
Using the absolute numbers of task is what made the prefer_sibling path problematic,
I would assume that the same holds true here. Therefore, I would prefer avg_load,
or, similar to prefer_siblings, a ratio over the number of cores.

I can't really test that on s390 as we always have SMT2. But, we can have these
asymmetries on higher levels, e.g.

[0 1][2 3][4 5][6 7][8 9]
[SMT][SMT][SMT][SMT][SMT]
[-----core----][--core--]

For large configurations this can be true for even higher levels.
Therefore, the idea was to move the smt_balance state around and adapt its
conditions to something like this (which would require to reorder the commits):

@@ -8330,6 +8330,11 @@ enum fbq_type { regular, remote, all };
enum group_type {
/* The group has spare capacity that can be used to run more tasks. */
group_has_spare = 0,
+ /*
+ * Balance SMT group that's fully busy. Can benefit from migration
+ * a task on SMT with busy sibling to another CPU on idle core.
+ */
+ group_smt_balance,
/*
* The group is fully used and the tasks don't compete for more CPU
* cycles. Nevertheless, some tasks might wait before running.
@@ -8340,11 +8345,6 @@ enum group_type {
* more powerful CPU.
*/
group_misfit_task,
- /*
- * Balance SMT group that's fully busy. Can benefit from migration
- * a task on SMT with busy sibling to another CPU on idle core.
- */
- group_smt_balance,
/*
* SD_ASYM_PACKING only: One local CPU with higher capacity is available,
* and the task should be migrated to it instead of running on the
@@ -9327,15 +9327,15 @@ group_type group_classify(unsigned int imbalance_pct,
if (sgs->group_asym_packing)
return group_asym_packing;

- if (sgs->group_smt_balance)
- return group_smt_balance;
-
if (sgs->group_misfit_task_load)
return group_misfit_task;

if (!group_has_capacity(imbalance_pct, sgs))
return group_fully_busy;

+ if (sgs->group_smt_balance)
+ return group_smt_balance;
+
return group_has_spare;
}

@@ -9457,8 +9457,7 @@ static inline bool smt_balance(struct lb_env *env, struct sg_lb_stats *sgs,
* Note that if a group has a single SMT, SD_SHARE_CPUCAPACITY
* will not be on.
*/
- if (group->flags & SD_SHARE_CPUCAPACITY &&
- sgs->sum_h_nr_running > 1)
+ if (sgs->sum_h_nr_running > group->cores)
return true;

return false;

The s390 problem is currently solved by changing the prefer_sibling path. When
disabling that flag, we might have an issue, will have to verify that though.


One example is Lets say sg1 has 4 tasks. and sg2 has 0 tasks and is trying to do
load balance. Previously imbalance would have been 2, instead now imbalance would be 1.
But in subsequent lb it would be balanced.



> + return false;
> +}
> +
> static inline bool
> sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
> {
> @@ -9425,6 +9464,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> sgs->group_asym_packing = 1;
> }
>
> + /* Check for loaded SMT group to be balanced to dst CPU */
> + if (!local_group && smt_balance(env, sgs, group))
> + sgs->group_smt_balance = 1;
> +
> sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
>
> /* Computing avg_load makes sense only when group is overloaded */
> @@ -9509,6 +9552,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> return false;
> break;
>
> + case group_smt_balance:
> case group_fully_busy:
> /*
> * Select the fully busy group with highest avg_load. In
> @@ -9537,6 +9581,18 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> break;
>
> case group_has_spare:
> + /*
> + * Do not pick sg with SMT CPUs over sg with pure CPUs,
> + * as we do not want to pull task off SMT core with one task
> + * and make the core idle.
> + */
> + if (smt_vs_nonsmt_groups(sds->busiest, sg)) {
> + if (sg->flags & SD_SHARE_CPUCAPACITY && sgs->sum_h_nr_running <= 1)
> + return false;
> + else
> + return true;> + }
> +
> /*
> * Select not overloaded group with lowest number of idle cpus
> * and highest number of running tasks. We could also compare
> @@ -9733,6 +9789,7 @@ static bool update_pick_idlest(struct sched_group *idlest,
>
> case group_imbalanced:
> case group_asym_packing:
> + case group_smt_balance:
> /* Those types are not used in the slow wakeup path */
> return false;
>
> @@ -9864,6 +9921,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>
> case group_imbalanced:
> case group_asym_packing:
> + case group_smt_balance:
> /* Those type are not used in the slow wakeup path */
> return NULL;
>
> @@ -10118,6 +10176,13 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> return;
> }
>
> + if (busiest->group_type == group_smt_balance) {
> + /* Reduce number of tasks sharing CPU capacity */
> + env->migration_type = migrate_task;
> + env->imbalance = 1;
> + return;
> + }
> +
> if (busiest->group_type == group_imbalanced) {
> /*
> * In the group_imb case we cannot rely on group-wide averages
> @@ -10363,16 +10428,23 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> goto force_balance;
>
> if (busiest->group_type != group_overloaded) {
> - if (env->idle == CPU_NOT_IDLE)
> + if (env->idle == CPU_NOT_IDLE) {
> /*
> * If the busiest group is not overloaded (and as a
> * result the local one too) but this CPU is already
> * busy, let another idle CPU try to pull task.
> */
> goto out_balanced;
> + }
> +
> + if (busiest->group_type == group_smt_balance &&
> + smt_vs_nonsmt_groups(sds.local, sds.busiest)) {
> + /* Let non SMT CPU pull from SMT CPU sharing with sibling */
> + goto force_balance;
> + }
>
> if (busiest->group_weight > 1 &&
> - local->idle_cpus <= (busiest->idle_cpus + 1))
> + local->idle_cpus <= (busiest->idle_cpus + 1)) {
> /*
> * If the busiest group is not overloaded
> * and there is no imbalance between this and busiest
> @@ -10383,12 +10455,14 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> * there is more than 1 CPU per group.
> */
> goto out_balanced;
> + }
>
> - if (busiest->sum_h_nr_running == 1)
> + if (busiest->sum_h_nr_running == 1) {
> /*
> * busiest doesn't have any tasks waiting to run
> */
> goto out_balanced;
> + }
> }
>
> force_balance: