Re: [PATCH] sched/fair: remove util_est boosting

From: Vincent Guittot
Date: Mon Jul 24 2023 - 09:06:49 EST


On Fri, 21 Jul 2023 at 18:09, Dietmar Eggemann <dietmar.eggemann@xxxxxxx> wrote:
>
> On 12/07/2023 17:30, Vincent Guittot wrote:
> > On Tue, 11 Jul 2023 at 17:47, Qais Yousef <qyousef@xxxxxxxxxxx> wrote:
> >>
> >> On 07/06/23 15:51, Vincent Guittot wrote:
> >>> There is no need to use runnable_avg when estimating util_est and that
> >>> even generates wrong behavior because one includes blocked tasks whereas
> >>> the other one doesn't. This can lead to accounting twice the waking task p,
> >>> once with the blocked runnable_avg and another one when adding its
> >>> util_est.
>
> ... and we don't have this issue for the util_avg case since we have:
>
> 7317 } else if (p && task_cpu(p) != cpu && dst_cpu == cpu)
> ^^^^^^^^^^^^^^^^^^
> 7318 util += task_util(p);
>
> >>> cpu's runnable_avg is already used when computing util_avg which is then
> >>> compared with util_est.
>
> We discussed why I have to use max(X, runnable) for X=util and
> X=util_est in v2:
>
> https://lkml.kernel.org/r/251b524a-2c44-3892-1bae-03f879d6a64b@xxxxxxx
>
> -->
>
> I need the util_est = max(util_est, runnable) further down as well. Just
> want to fetch runnable only once.
>
> util = 50, task_util = 5, util_est = 60, task_util_est = 10, runnable = 70
>
> max(70 + 5, 60 + 10) != max (70 + 5, 70 + 10) when dst_cpu == cpu
>

Hmm, I don't get your point here. Why should they be equal ?

Below is a example to describe my problem:

task A with util_avg=200 util_est=300 runnable=200
task A is attached to CPU0 so it contributes to CPU0's util_avg and
runnable_avg.

In eenv_pd_max_util() we call cpu_util(cpu, p, dst_cpu, 1) to get the
max utilization and the OPP to use to compute energy.

Let say that there is nothing else running on CPU0 and CPU1 and the
both belong to the same performance domain so
CPU0 util_avg= 200 util_est=0 runnable_avg=200
CPU1 util_avg=0 util_est=0 runnable_avg=0

For CPU0, cpu_util(cpu, p, dst_cpu, 1) will return (200 + 300) = 500
For CPU1, cpu_util(cpu, p, dst_cpu, 1) will return (0 + 300) = 300

If there is an OPP with a capacity between these 2 values, CPU1 will
use a lower OPP than CPU0 and its computed energy will be lower.

The condition if (max_spare_cap_cpu >= 0 && max_spare_cap >
prev_spare_cap) filters some cases when CPU0 and CPU1 have the exact
same spare capacity. But we often see a smaller spare capacity for
CPU0 because of small side activities like cpufreq, timer, irq, rcu
... The difference is often only 1 but enough to bypass the condition
above. task A will migrate to CPU1 whereas there is no need. Then it
will move back to CPU0 once CPU1 will have a smaller spare capacity

I ran a test on snapdragon RB5 with the latest tip/sched/core. I start
3 tasks: 1 large enough to be on medium CPUs and 2 small enough to
stay on little CPUs during 30 seconds
With tip/sched/core, the 3 tasks are migrating around 3665
With the patch, there is only 8 migration at the beginning of the test

> <--
>
> But I assume your point is that:
>
> 7327 if (boost)
> 7328 util_est = max(util_est, runnable);
>
> 7356 if (dst_cpu == cpu) <-- (1)
> 7357 util_est += _task_util_est(p);
> 7358 else if (p && unlikely(task_on_rq_queued(p) || current == p))
> 7359 lsub_positive(&util_est, _task_util_est(p));
> 7360
> 7361 util = max(util, util_est);
>
> --> (1) doesn't work anymore in case `util_est == runnable`.
>
> It will break the assumption for the if condition depicted in
> cpu_util()'s comment:

exactly
>
> 7331 * During wake-up (2) @p isn't enqueued yet and doesn't contribute
> 7332 * to any cpu_rq(cpu)->cfs.avg.util_est.enqueued.
> 7333 * If @dst_cpu == @cpu add it to "simulate" cpu_util after @p
> 7334 * has been enqueued.
>
> (2) eenv_pd_max_util() and find_energy_efficient_cpu() call-site.
>
> <---
>
> Rerunning Jankbench tests on Pix6 will tell if boosting util_avg instead
> of both will still show the anticipated results. Likelihood is high that
> it will since we do `util = max(util, util_est)` at the end of cpu_util().

I think the same

>
> >>> In some situation, feec will not select prev_cpu but another one on the
> >>> same performance domain because of higher max_util
> >>>
> >>> Fixes: 7d0583cf9ec7 ("sched/fair, cpufreq: Introduce 'runnable boosting'")
> >>> Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> >>> ---
> >>
> >> Can we verify the numbers that introduced this magic boost are still valid
> >> please?
> >
> > TBH I don't expect it but I agree it's worth checking. Dietmar could
> > you rerun your tests with this change ?
>
> Could do. But first lets understand the issue properly.
>
> >> Otherwise LGTM.
> >>
> >>
> >> Thanks!
> >>
> >> --
> >>
> >> Qais Yousef
> >>
> >>> kernel/sched/fair.c | 3 ---
> >>> 1 file changed, 3 deletions(-)
> >>>
> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>> index a80a73909dc2..77c9f5816c31 100644
> >>> --- a/kernel/sched/fair.c
> >>> +++ b/kernel/sched/fair.c
> >>> @@ -7289,9 +7289,6 @@ cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
> >>>
> >>> util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
> >>>
> >>> - if (boost)
> >>> - util_est = max(util_est, runnable);
> >>> -
> >>> /*
> >>> * During wake-up @p isn't enqueued yet and doesn't contribute
> >>> * to any cpu_rq(cpu)->cfs.avg.util_est.enqueued.
> >>> --
> >>> 2.34.1
> >>>
>