Re: [PATCH v4 1/2] sched/fair: Check a task has a fitting cpu when updating misfit

From: Qais Yousef
Date: Sun Jan 28 2024 - 18:50:23 EST


On 01/26/24 15:08, Vincent Guittot wrote:

> > TBH I had a bit of confirmation bias that this is a problem based on the fix
> > (0ae78eec8aa6) that we had in the past. So on verification I looked at
> > balance_interval and this reproducer which is a not the same as the original
> > one and it might be exposing another problem and I didn't think twice about it.
>
> I checked the behavior more deeply and I confirm that I don't see
> improvement for the use case described above. I would say that it's
> even worse as I can see some runs where the task stays on little
> whereas a big core has been added in the affinity. Having in mind that
> my system is pretty idle which means that there is almost no other
> reason to trigger an ilb than the misfit task, the change in
> check_misfit_status() is probably the reason for never kicking an ilb
> for such case

It seems I reproduced another problem while trying to reproduce the original
issue, eh.

I did dig more and from what I see the issue is that the rd->overload is not
being set correctly. Which I believe what causes the delays (see attached
picture how rd.overloaded is 0 with some spikes). Only when CPU7
newidle_balance() coincided with rd->overload being 1 that the migration
happens. With the below hack I can see that rd->overload is 1 all the time
(even after the move as we still trigger a misfit on the big CPU). With my
patch only rd->overload is set to 1 (because of this task) only for a short
period after we change affinity.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df348aa55d3c..86069fe527f9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9707,8 +9707,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
continue;
}

- if (local_group)
- continue;
+ /* if (local_group) */
+ /* continue; */

if (env->sd->flags & SD_ASYM_CPUCAPACITY) {
/* Check for a misfit task on the cpu */

I am not sure what the right fix is, but it seems this condition is required
for the 2nd leg of this if condition when we compare with load? I don't think
we should skip the misfit check.


Thanks

--
Qais Yousef

Attachment: misfit_overloaded.png
Description: PNG image