Re: [RFC PATCH] sched/fair: Fix impossible migrate_util scenario in load balance

From: Vincent Guittot
Date: Tue Jul 18 2023 - 08:48:28 EST


Le dimanche 16 juil. 2023 à 02:41:25 (+0100), Qais Yousef a écrit :
> We've seen cases while running geekbench that an idle little core never
> pulls a task from a bigger overloaded cluster for 100s of ms and
> sometimes over a second.
>
> It turned out that the load balance identifies this as a migrate_util
> type since the local group (little cluster) has a spare capacity and
> will try to pull a task. But the little cluster capacity is very small
> nowadays (around 200 or less) and if two busy tasks are stuck on a mid
> core which has a capacity of over 700, this means the util of each tasks
> will be around 350+ range. Which is always bigger than the spare
> capacity of the little group with a single idle core.
>
> When trying to detach_tasks() we bail out then because of the comparison
> of:
>
> if (util > env->imbalance)
> goto next;
>
> In calculate_imbalance() we convert a migrate_util into migrate_task
> type if the CPU trying to do the pull is idle. But we only do this if
> env->imbalance is 0; which I can't understand. AFAICT env->imbalance
> contains the local group's spare capacity. If it is 0, this means it's
> fully busy.
>
> Removing this condition fixes the problem, but since I can't fully
> understand why it checks for 0, sending this as RFC. It could be a typo
> and meant to check for
>
> env->imbalance != 0
>
> instead?
>
> Signed-off-by: Qais Yousef (Google) <qyousef@xxxxxxxxxxx>
> ---
> kernel/sched/fair.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a80a73909dc2..682d9d6a8691 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10288,7 +10288,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> * waiting task in this overloaded busiest group. Let's
> * try to pull it.
> */
> - if (env->idle != CPU_NOT_IDLE && env->imbalance == 0) {
> + if (env->idle != CPU_NOT_IDLE) {

With this change you completely skip migrate_util for idle and newly idle case
and this would be too aggressive.

We can do something similar to migrate_load in detach_tasks():

---
kernel/sched/fair.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d3df5b1642a6..64111ac7e137 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8834,7 +8834,13 @@ static int detach_tasks(struct lb_env *env)
case migrate_util:
util = task_util_est(p);

- if (util > env->imbalance)
+ /*
+ * Make sure that we don't migrate too much utilization.
+ * Nevertheless, let relax the constraint if
+ * scheduler fails to find a good waiting task to
+ * migrate.
+ */
+ if (shr_bound(util, env->sd->nr_balance_failed) > env->imbalance)
goto next;

env->imbalance -= util;
--



> env->migration_type = migrate_task;
> env->imbalance = 1;
> }
> --
> 2.25.1
>