Re: [RFC PATCH] sched/fair: Fix impossible migrate_util scenario in load balance

From: Vincent Guittot
Date: Tue Jul 18 2023 - 12:32:13 EST


On Tue, 18 Jul 2023 at 18:18, Qais Yousef <qyousef@xxxxxxxxxxx> wrote:
>
> On 07/18/23 14:48, Vincent Guittot wrote:
> > Le dimanche 16 juil. 2023 à 02:41:25 (+0100), Qais Yousef a écrit :
> > > We've seen cases while running geekbench that an idle little core never
> > > pulls a task from a bigger overloaded cluster for 100s of ms and
> > > sometimes over a second.
> > >
> > > It turned out that the load balance identifies this as a migrate_util
> > > type since the local group (little cluster) has a spare capacity and
> > > will try to pull a task. But the little cluster capacity is very small
> > > nowadays (around 200 or less) and if two busy tasks are stuck on a mid
> > > core which has a capacity of over 700, this means the util of each tasks
> > > will be around 350+ range. Which is always bigger than the spare
> > > capacity of the little group with a single idle core.
> > >
> > > When trying to detach_tasks() we bail out then because of the comparison
> > > of:
> > >
> > > if (util > env->imbalance)
> > > goto next;
> > >
> > > In calculate_imbalance() we convert a migrate_util into migrate_task
> > > type if the CPU trying to do the pull is idle. But we only do this if
> > > env->imbalance is 0; which I can't understand. AFAICT env->imbalance
> > > contains the local group's spare capacity. If it is 0, this means it's
> > > fully busy.
> > >
> > > Removing this condition fixes the problem, but since I can't fully
> > > understand why it checks for 0, sending this as RFC. It could be a typo
> > > and meant to check for
> > >
> > > env->imbalance != 0
> > >
> > > instead?
> > >
> > > Signed-off-by: Qais Yousef (Google) <qyousef@xxxxxxxxxxx>
> > > ---
> > > kernel/sched/fair.c | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index a80a73909dc2..682d9d6a8691 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -10288,7 +10288,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> > > * waiting task in this overloaded busiest group. Let's
> > > * try to pull it.
> > > */
> > > - if (env->idle != CPU_NOT_IDLE && env->imbalance == 0) {
> > > + if (env->idle != CPU_NOT_IDLE) {
> >
> > With this change you completely skip migrate_util for idle and newly idle case
> > and this would be too aggressive.
>
> Yeah I didn't have great confidence in it to be honest.
>
> Could you help me understand the meaning of env->imbalance == 0 though? At this
> stage its value is
>
> env->imbalance = max(local->group_capacity, local->group_util) - local->group_util;
>
> which AFAICT is calculating the _spare_ capacity, right? So when we check
> env->imbalance == 0 we say if this_cpu is (idle OR newly idle) AND the local
> group is fully utilized? Why it must be fully utilized to do the pull? It's
> counter intuitive to me. I'm probably misinterpreting something but can't see

This is a special case. We have some situations where group_util is
higher than capacity because of tasks newly migrated to this group for
example so the spare capacity is null but one cpu is idle or newly
idle. In this case we try to pull a task with the risk that this group
becomes overloaded. That's why we do not try to pull a task every
time.
But that might be good choice all the time

> it.
>
> >
> > We can do something similar to migrate_load in detach_tasks():
> >
> > ---
> > kernel/sched/fair.c | 8 +++++++-
> > 1 file changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index d3df5b1642a6..64111ac7e137 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8834,7 +8834,13 @@ static int detach_tasks(struct lb_env *env)
> > case migrate_util:
> > util = task_util_est(p);
> >
> > - if (util > env->imbalance)
> > + /*
> > + * Make sure that we don't migrate too much utilization.
> > + * Nevertheless, let relax the constraint if
> > + * scheduler fails to find a good waiting task to
> > + * migrate.
> > + */
> > + if (shr_bound(util, env->sd->nr_balance_failed) > env->imbalance)
> > goto next;
>
> Thanks! This looks better but I still see a 100 or 200 ms delay sometimes.
> Still debugging it but I _think_ it's a combination of two things:
>
> 1. nr_balance_failed doesn't increment as fast - I see a lot of 0s with
> occasional 1s and less frequent 2s
> 2. something might wake up briefly on that cpu in between load balance,
> and given how small the littles are they make the required
> nr_balance_failed to tip the scale even higher
>
>
> Thanks
>
> --
> Qais Yousef
>
> >
> > env->imbalance -= util;
> > --
> >
> >
> >
> > > env->migration_type = migrate_task;
> > > env->imbalance = 1;
> > > }
> > > --
> > > 2.25.1
> > >