Re: [RFC PATCH] sched/fair: Fix impossible migrate_util scenario in load balance

From: Vincent Guittot
Date: Thu Jul 20 2023 - 08:32:02 EST

Next message: Baolu Lu: "Re: [PATCH 01/11] RISC-V: drivers/iommu: Add RISC-V IOMMU - Ziommu support."
Previous message: Alexandre Mergnat: "Re: [PATCH v3,3/3] drm/mediatek: dp: Add the audio divider to mtk_dp_data struct"
In reply to: Qais Yousef: "Re: [RFC PATCH] sched/fair: Fix impossible migrate_util scenario in load balance"
Next in thread: Qais Yousef: "Re: [RFC PATCH] sched/fair: Fix impossible migrate_util scenario in load balance"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, 18 Jul 2023 at 19:25, Qais Yousef <qyousef@xxxxxxxxxxx> wrote:
>
> On 07/18/23 18:31, Vincent Guittot wrote:
> > On Tue, 18 Jul 2023 at 18:18, Qais Yousef <qyousef@xxxxxxxxxxx> wrote:
> > >
> > > On 07/18/23 14:48, Vincent Guittot wrote:
> > > > Le dimanche 16 juil. 2023 à 02:41:25 (+0100), Qais Yousef a écrit :
> > > > > We've seen cases while running geekbench that an idle little core never
> > > > > pulls a task from a bigger overloaded cluster for 100s of ms and
> > > > > sometimes over a second.
> > > > >
> > > > > It turned out that the load balance identifies this as a migrate_util
> > > > > type since the local group (little cluster) has a spare capacity and
> > > > > will try to pull a task. But the little cluster capacity is very small
> > > > > nowadays (around 200 or less) and if two busy tasks are stuck on a mid
> > > > > core which has a capacity of over 700, this means the util of each tasks
> > > > > will be around 350+ range. Which is always bigger than the spare
> > > > > capacity of the little group with a single idle core.
> > > > >
> > > > > When trying to detach_tasks() we bail out then because of the comparison
> > > > > of:
> > > > >
> > > > > if (util > env->imbalance)
> > > > > goto next;
> > > > >
> > > > > In calculate_imbalance() we convert a migrate_util into migrate_task
> > > > > type if the CPU trying to do the pull is idle. But we only do this if
> > > > > env->imbalance is 0; which I can't understand. AFAICT env->imbalance
> > > > > contains the local group's spare capacity. If it is 0, this means it's
> > > > > fully busy.
> > > > >
> > > > > Removing this condition fixes the problem, but since I can't fully
> > > > > understand why it checks for 0, sending this as RFC. It could be a typo
> > > > > and meant to check for
> > > > >
> > > > > env->imbalance != 0
> > > > >
> > > > > instead?
> > > > >
> > > > > Signed-off-by: Qais Yousef (Google) <qyousef@xxxxxxxxxxx>
> > > > > ---
> > > > > kernel/sched/fair.c | 2 +-
> > > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > > index a80a73909dc2..682d9d6a8691 100644
> > > > > --- a/kernel/sched/fair.c
> > > > > +++ b/kernel/sched/fair.c
> > > > > @@ -10288,7 +10288,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> > > > > * waiting task in this overloaded busiest group. Let's
> > > > > * try to pull it.
> > > > > */
> > > > > - if (env->idle != CPU_NOT_IDLE && env->imbalance == 0) {
> > > > > + if (env->idle != CPU_NOT_IDLE) {
> > > >
> > > > With this change you completely skip migrate_util for idle and newly idle case
> > > > and this would be too aggressive.
> > >
> > > Yeah I didn't have great confidence in it to be honest.
> > >
> > > Could you help me understand the meaning of env->imbalance == 0 though? At this
> > > stage its value is
> > >
> > > env->imbalance = max(local->group_capacity, local->group_util) - local->group_util;
> > >
> > > which AFAICT is calculating the _spare_ capacity, right? So when we check
> > > env->imbalance == 0 we say if this_cpu is (idle OR newly idle) AND the local
> > > group is fully utilized? Why it must be fully utilized to do the pull? It's
> > > counter intuitive to me. I'm probably misinterpreting something but can't see
> >
> > This is a special case. We have some situations where group_util is
> > higher than capacity because of tasks newly migrated to this group for
> > example so the spare capacity is null but one cpu is idle or newly
> > idle. In this case we try to pull a task with the risk that this group
> > becomes overloaded. That's why we do not try to pull a task every
> > time.
> > But that might be good choice all the time
>
> So on misfit, I do see that a bigger cpu will pull the task quickly as soon as
> a bigger cpu gets idle.
>
> This scenario is the opposite. Maybe the exception in my case is that the
> little cpu has spare capacity as it's mostly idle all the time. It's just
> unlucky circumstances at wake up ended up putting two tasks on bigger core.

I was trying to reproduce the behavior but I was failing until I
realized that this code path is used when the 2 groups are not sharing
their cache. Which topology do you use ? I thought that dynamiQ and
shares cache between all 8 cpus was the norm for arm64 embedded device
now

Also when you say "the little cluster capacity is very small nowadays
(around 200 or less)", it is the capacity of 1 core or the cluster ?

>
> Specifically, at the start of some of the sub-tests, there's a good chance that
> we have simultaneous wake ups and there's a limitation/race in EAS because of
> the gap between select_task_rq_fair() and enqueue_task_fair(). If two task wake
> up simultaneously, select_task_rq_fair() could be called twice before the
> enqueue_task_fair() and end up selecting the same CPU for both tasks not
> realizing one of them is just waiting to be enqueued. IOW, EAS will not take
> into account the updated util of one of the CPUs because of the (short) delay
> to enqueue it.
>
> This should be fixed (the wake up race), but this is a different story and
> a bit trickier.
>
> The risk of pulling always is:
>
> 1. Risk force migrating prev task if it woke up shortly after the pull.
> Which is no worse IMHO than misfit going almost immediately to
> bigger core.
>
> 2. Not sure of not being too smart about which task to pull. I can
> envisage other scenarios where one of the two tasks is better to
> pull. In geekbench both tasks are equal. But maybe in other use
> cases one of them less impactful. For example if one of them has
> a low uclamp_max but the other doesn't. But this case is unsupported
> feature at the moment. My plan (hope) to treat these uclamp_max as
> misfit migration. Which I think is the better path in general to
> treat special cases. So for migration_util this behavior might be
> sensible all the time. We are working too hard, let's use all of our
> resources and make use all of idle cpus. If prev_task wakes up,
> there's no harm; I doubt the cache hotness is a problem even given
> two tasks are busy all the time trashing L1 anyway.
>
> 3. Not sure what will happen in cases we have nu_running > nr_cpus and
> some tasks happen to sleep for brief period of times. Two tasks
> stuck on little core is worse than two tasks stuck on mid or big
> core. But maybe migrate_util will pull it back again given how
> little capacity they have?
>
> Cheers
>
> --
> Qais Yousef
>
> >
> > > it.
> > >
> > > >
> > > > We can do something similar to migrate_load in detach_tasks():
> > > >
> > > > ---
> > > > kernel/sched/fair.c | 8 +++++++-
> > > > 1 file changed, 7 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index d3df5b1642a6..64111ac7e137 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -8834,7 +8834,13 @@ static int detach_tasks(struct lb_env *env)
> > > > case migrate_util:
> > > > util = task_util_est(p);
> > > >
> > > > - if (util > env->imbalance)
> > > > + /*
> > > > + * Make sure that we don't migrate too much utilization.
> > > > + * Nevertheless, let relax the constraint if
> > > > + * scheduler fails to find a good waiting task to
> > > > + * migrate.
> > > > + */
> > > > + if (shr_bound(util, env->sd->nr_balance_failed) > env->imbalance)
> > > > goto next;
> > >
> > > Thanks! This looks better but I still see a 100 or 200 ms delay sometimes.
> > > Still debugging it but I _think_ it's a combination of two things:
> > >
> > > 1. nr_balance_failed doesn't increment as fast - I see a lot of 0s with
> > > occasional 1s and less frequent 2s
> > > 2. something might wake up briefly on that cpu in between load balance,
> > > and given how small the littles are they make the required
> > > nr_balance_failed to tip the scale even higher
> > >
> > >
> > > Thanks
> > >
> > > --
> > > Qais Yousef
> > >
> > > >
> > > > env->imbalance -= util;
> > > > --
> > > >
> > > >
> > > >
> > > > > env->migration_type = migrate_task;
> > > > > env->imbalance = 1;
> > > > > }
> > > > > --
> > > > > 2.25.1
> > > > >

Next message: Baolu Lu: "Re: [PATCH 01/11] RISC-V: drivers/iommu: Add RISC-V IOMMU - Ziommu support."
Previous message: Alexandre Mergnat: "Re: [PATCH v3,3/3] drm/mediatek: dp: Add the audio divider to mtk_dp_data struct"
In reply to: Qais Yousef: "Re: [RFC PATCH] sched/fair: Fix impossible migrate_util scenario in load balance"
Next in thread: Qais Yousef: "Re: [RFC PATCH] sched/fair: Fix impossible migrate_util scenario in load balance"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]