Re: [Patch v3 1/6] sched/fair: Determine active load balance for SMT sched groups

From: Shrikanth Hegde
Date: Fri Jul 14 2023 - 09:07:01 EST




On 7/8/23 4:27 AM, Tim Chen wrote:
> From: Tim C Chen <tim.c.chen@xxxxxxxxxxxxxxx>
>

Hi Tim. Sorry for the delayed response.

> On hybrid CPUs with scheduling cluster enabled, we will need to
> consider balancing between SMT CPU cluster, and Atom core cluster.
>
> Below shows such a hybrid x86 CPU with 4 big cores and 8 atom cores.
> Each scheduling cluster span a L2 cache.
>
> --L2-- --L2-- --L2-- --L2-- ----L2---- -----L2------
> [0, 1] [2, 3] [4, 5] [5, 6] [7 8 9 10] [11 12 13 14]
> Big Big Big Big Atom Atom
> core core core core Module Module
>
> If the busiest group is a big core with both SMT CPUs busy, we should
> active load balance if destination group has idle CPU cores. Such
> condition is considered by asym_active_balance() in load balancing but not
> considered when looking for busiest group and computing load imbalance.
> Add this consideration in find_busiest_group() and calculate_imbalance().
>
> In addition, update the logic determining the busier group when one group
> is SMT and the other group is non SMT but both groups are partially busy
> with idle CPU. The busier group should be the group with idle cores rather
> than the group with one busy SMT CPU. We do not want to make the SMT group
> the busiest one to pull the only task off SMT CPU and causing the whole core to
> go empty.
>
> Otherwise suppose in the search for the busiest group, we first encounter
> an SMT group with 1 task and set it as the busiest. The destination
> group is an atom cluster with 1 task and we next encounter an atom
> cluster group with 3 tasks, we will not pick this atom cluster over the
> SMT group, even though we should. As a result, we do not load balance
> the busier Atom cluster (with 3 tasks) towards the local atom cluster
> (with 1 task). And it doesn't make sense to pick the 1 task SMT group
> as the busier group as we also should not pull task off the SMT towards
> the 1 task atom cluster and make the SMT core completely empty.
>
> Signed-off-by: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
> ---
> kernel/sched/fair.c | 80 +++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 77 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 87317634fab2..f636d6c09dc6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8279,6 +8279,11 @@ enum group_type {
> * more powerful CPU.
> */
> group_misfit_task,
> + /*
> + * Balance SMT group that's fully busy. Can benefit from migration
> + * a task on SMT with busy sibling to another CPU on idle core.
> + */
> + group_smt_balance,

Could you please explain what group_smt_balance does differently? AFAIU it is doing the same
thing as group_fully_busy but for one domain above SMT domains right?


> /*
> * SD_ASYM_PACKING only: One local CPU with higher capacity is available,
> * and the task should be migrated to it instead of running on the
> @@ -8987,6 +8992,7 @@ struct sg_lb_stats {
> unsigned int group_weight;
> enum group_type group_type;
> unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
> + unsigned int group_smt_balance; /* Task on busy SMT be moved */
> unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
> #ifdef CONFIG_NUMA_BALANCING
> unsigned int nr_numa_running;
> @@ -9260,6 +9266,9 @@ group_type group_classify(unsigned int imbalance_pct,
> if (sgs->group_asym_packing)
> return group_asym_packing;
>
> + if (sgs->group_smt_balance)
> + return group_smt_balance;
> +
> if (sgs->group_misfit_task_load)
> return group_misfit_task;
>
> @@ -9333,6 +9342,36 @@ sched_asym(struct lb_env *env, struct sd_lb_stats *sds, struct sg_lb_stats *sgs
> return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
> }
>
> +/* One group has more than one SMT CPU while the other group does not */
> +static inline bool smt_vs_nonsmt_groups(struct sched_group *sg1,
> + struct sched_group *sg2)
> +{
> + if (!sg1 || !sg2)
> + return false;
> +
> + return (sg1->flags & SD_SHARE_CPUCAPACITY) !=
> + (sg2->flags & SD_SHARE_CPUCAPACITY);
> +}
> +
> +static inline bool smt_balance(struct lb_env *env, struct sg_lb_stats *sgs,
> + struct sched_group *group)
> +{
> + if (env->idle == CPU_NOT_IDLE)
> + return false;
> +
> + /*
> + * For SMT source group, it is better to move a task
> + * to a CPU that doesn't have multiple tasks sharing its CPU capacity.
> + * Note that if a group has a single SMT, SD_SHARE_CPUCAPACITY
> + * will not be on.
> + */
> + if (group->flags & SD_SHARE_CPUCAPACITY &&
> + sgs->sum_h_nr_running > 1)
> + return true;
> +

If we consider symmetric platforms which have SMT4 such as power10.
we have a topology like below. multiple such MC will form DIE(PKG)


[0 2 4 6][1 3 5 7][8 10 12 14][9 11 13 15]
[--SMT--][--SMT--][----SMT---][---SMT----]
[--sg1--][--sg1--][---sg1----][---sg1----]
[--------------MC------------------------]

In case of SMT4, if there is any group which has 2 or more tasks, that
group will be marked as group_smt_balance. previously, if that group had 2
or 3 tasks, it would have been marked as group_has_spare. Since all the groups have
SMT that means behavior would be same fully busy right? That can cause some
corner cases. No?

One example is Lets say sg1 has 4 tasks. and sg2 has 0 tasks and is trying to do
load balance. Previously imbalance would have been 2, instead now imbalance would be 1.
But in subsequent lb it would be balanced.



> + return false;
> +}
> +
> static inline bool
> sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
> {
> @@ -9425,6 +9464,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> sgs->group_asym_packing = 1;
> }
>
> + /* Check for loaded SMT group to be balanced to dst CPU */
> + if (!local_group && smt_balance(env, sgs, group))
> + sgs->group_smt_balance = 1;
> +
> sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
>
> /* Computing avg_load makes sense only when group is overloaded */
> @@ -9509,6 +9552,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> return false;
> break;
>
> + case group_smt_balance:
> case group_fully_busy:
> /*
> * Select the fully busy group with highest avg_load. In
> @@ -9537,6 +9581,18 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> break;
>
> case group_has_spare:
> + /*
> + * Do not pick sg with SMT CPUs over sg with pure CPUs,
> + * as we do not want to pull task off SMT core with one task
> + * and make the core idle.
> + */
> + if (smt_vs_nonsmt_groups(sds->busiest, sg)) {
> + if (sg->flags & SD_SHARE_CPUCAPACITY && sgs->sum_h_nr_running <= 1)
> + return false;
> + else
> + return true;> + }
> +
> /*
> * Select not overloaded group with lowest number of idle cpus
> * and highest number of running tasks. We could also compare
> @@ -9733,6 +9789,7 @@ static bool update_pick_idlest(struct sched_group *idlest,
>
> case group_imbalanced:
> case group_asym_packing:
> + case group_smt_balance:
> /* Those types are not used in the slow wakeup path */
> return false;
>
> @@ -9864,6 +9921,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>
> case group_imbalanced:
> case group_asym_packing:
> + case group_smt_balance:
> /* Those type are not used in the slow wakeup path */
> return NULL;
>
> @@ -10118,6 +10176,13 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> return;
> }
>
> + if (busiest->group_type == group_smt_balance) {
> + /* Reduce number of tasks sharing CPU capacity */
> + env->migration_type = migrate_task;
> + env->imbalance = 1;
> + return;
> + }
> +
> if (busiest->group_type == group_imbalanced) {
> /*
> * In the group_imb case we cannot rely on group-wide averages
> @@ -10363,16 +10428,23 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> goto force_balance;
>
> if (busiest->group_type != group_overloaded) {
> - if (env->idle == CPU_NOT_IDLE)
> + if (env->idle == CPU_NOT_IDLE) {
> /*
> * If the busiest group is not overloaded (and as a
> * result the local one too) but this CPU is already
> * busy, let another idle CPU try to pull task.
> */
> goto out_balanced;
> + }
> +
> + if (busiest->group_type == group_smt_balance &&
> + smt_vs_nonsmt_groups(sds.local, sds.busiest)) {
> + /* Let non SMT CPU pull from SMT CPU sharing with sibling */
> + goto force_balance;
> + }
>
> if (busiest->group_weight > 1 &&
> - local->idle_cpus <= (busiest->idle_cpus + 1))
> + local->idle_cpus <= (busiest->idle_cpus + 1)) {
> /*
> * If the busiest group is not overloaded
> * and there is no imbalance between this and busiest
> @@ -10383,12 +10455,14 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> * there is more than 1 CPU per group.
> */
> goto out_balanced;
> + }
>
> - if (busiest->sum_h_nr_running == 1)
> + if (busiest->sum_h_nr_running == 1) {
> /*
> * busiest doesn't have any tasks waiting to run
> */
> goto out_balanced;
> + }
> }
>
> force_balance: