Re: [RFC PATCH 3/3] sched/fair: Traverse cpufreq policies to detect capacity inversion

From: Vincent Guittot
Date: Thu Dec 15 2022 - 12:47:12 EST


On Tue, 13 Dec 2022 at 18:38, Dietmar Eggemann <dietmar.eggemann@xxxxxxx> wrote:
>
> On 12/12/2022 19:43, Qais Yousef wrote:
> > On 12/09/22 17:47, Vincent Guittot wrote:
>
> [...]
>
> > HMP systems for 1k servers just don't make any sense. A desktop with 128 or
> > even 256 HMP cores is a big stretch; and if that exist I don't think there's an
> > overhead to worry about here; and I *did* consider this. I measured the impact
> > if we have 128 and it was mere 1 or 2 us extra. And that's on under powered
> > pine book pro. If such a system exist it'd probably be more performant.
> >
> >> uclamp_min must not set a CPU overutilized because the CPU is not overutilized
> >> in this case. It's only the task that is misfit. You mostly try to bias some
> >> behavior to fit your use case.
> >
> > Maybe we are talking about different things over here. As long as we agree it's
> > a misfit task then we are aligned.
>
> IMHO, utilization is about the running task and uclamp is maintained
> taking the runnable tasks into consideration as well. Maybe that's the
> source of the different views here?
>
> > As far as I know misfit required overutilized to re-enable load balance. But
> > maybe there's a detail that's creating this confusion.
>
> I think that Vincent is suggesting to let MF load balance happening even
> in !OverUtilized (OU). We gather the necessary load-balance statistics
> already today in !OU so it is easily to do.
>
> >>>> doesn means that the cpu is overutilized and uclamp_max should be used
> >>>
> >>> It is a misfit task; which requires overutilized to be set to re-enable load
> >>> balance for uclamp_min to upgrate it. For uclamp max we should not set
> >>> overutilized, agreed and that's what we should be doing.
> >>
> >> That's probably the root of your problem here. The load balance is still
> >> periodically called even when EAS is enabled but the latter prevents task
> >> migration unless overutilized in order to not break the energy aware task
> >
> > Okay. For me this means load_balance is disabled since it's effectively doing
> > nothing. So maybe it's a terminology problem of what I meant with load balance
> > is disabled.
> >
> >> placement. But if a task is misplaced and a cpu finds it can help,
> >> we should let it pull the task without disabling EAS. This will not enable the
> >> performance spread behavior and we can expect the other small tasks to be
> >> packed by EAS on the best cpu at next wakeup.
> >>
> >> So intead of trying to detect a very specific capacity inversion use case
> >> during this periodic this load balance, it's better to let a CPU that can
> >> fix the misfit situation to pull the task.
> >
> > I can't see the relation here.
>
> To me it looks like that there are 2 stories here:
>
> (A) Why do we need `CPU capacity inversion` (CapInv)? What do we gain
> by changing `cap = capacity_orig - arch_scale_thermal_pressure()`
> (i) to `cap = capacity_orig - thermal_load_avg()` (ii)? Do we want
> to avoid a CPU as a possible target in cases where pressure is
> already back to 0 and we still have a thermal PELT signal > 0
> because of it's historic information? But this is then only
> considering the decaying side of the thermal PELT signal.
>
> Vincent replaces (i) by (ii) so no need to switch to (ii) for
> CapInv.
>
> (B) Don't use util_fits_cpu() in cpu_overutilized() and replace this by
> MF load balance in !OU.
>
> [...]
>
> >> I have reverted patches:
> >> Revert "sched/fair: Detect capacity inversion"
> >> Revert "sched/fair: Consider capacity inversion in util_fits_cpu()"
> >> Revert "sched/uclamp: Make cpu_overutilized() use util_fits_cpu()"
> >>
> >> ---
> >> kernel/sched/fair.c | 32 +++++++++++++++++---------------
> >> 1 file changed, 17 insertions(+), 15 deletions(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 4423681baf15..6e54afc0a6ec 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -4539,7 +4539,7 @@ static inline int util_fits_cpu(unsigned long util,
> >> * the time.
> >> */
> >> capacity_orig = capacity_orig_of(cpu);
> >> - capacity_orig_thermal = capacity_orig - arch_scale_thermal_pressure(cpu);
> >> + capacity_orig_thermal = capacity_orig - thermal_load_avg(cpu_rq(cpu));
> >>
> >> /*
> >> * We want to force a task to fit a cpu as implied by uclamp_max.
> >> @@ -4578,8 +4578,7 @@ static inline int util_fits_cpu(unsigned long util,
> >> * 2. The system is being saturated when we're operating near
> >> * max capacity, it doesn't make sense to block overutilized.
> >> */
> >> - uclamp_max_fits = (capacity_orig == SCHED_CAPACITY_SCALE) && (uclamp_max == SCHED_CAPACITY_SCALE);
> >> - uclamp_max_fits = !uclamp_max_fits && (uclamp_max <= capacity_orig);
> >> + uclamp_max_fits = (uclamp_max <= capacity_orig) || (capacity_orig == SCHED_CAPACITY_SCALE);
> >> fits = fits || uclamp_max_fits;
> >>
> >> /*
> >> @@ -4614,7 +4613,7 @@ static inline int util_fits_cpu(unsigned long util,
> >> * handle the case uclamp_min > uclamp_max.
> >> */
> >> uclamp_min = min(uclamp_min, uclamp_max);
> >> - if (util < uclamp_min && capacity_orig != SCHED_CAPACITY_SCALE)
> >> + if (util < uclamp_min)
> >> fits = fits && (uclamp_min <= capacity_orig_thermal);
> >>
> >> return fits;
> >> @@ -6064,7 +6063,10 @@ static inline void hrtick_update(struct rq *rq)
> >> #ifdef CONFIG_SMP
> >> static inline bool cpu_overutilized(int cpu)
> >> {
> >> - return !fits_capacity(cpu_util_cfs(cpu), capacity_of(cpu));
> >> + unsigned long rq_util_max = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MAX);
> >> +
> >> + return !(fits_capacity(cpu_util_cfs(cpu), capacity_of(cpu)) ||
> >> + rq_util_max <= capacity_orig_of(cpu));
> >> }
> >>
> >> static inline void update_overutilized_status(struct rq *rq)
> >> @@ -10164,24 +10166,23 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> >> */
> >> update_sd_lb_stats(env, &sds);
> >>
> >> - if (sched_energy_enabled()) {
> >> - struct root_domain *rd = env->dst_rq->rd;
> >> -
> >> - if (rcu_dereference(rd->pd) && !READ_ONCE(rd->overutilized))
> >> - goto out_balanced;
> >> - }
> >> -
> >> - local = &sds.local_stat;
> >> - busiest = &sds.busiest_stat;
> >> -
> >> /* There is no busy sibling group to pull tasks from */
> >> if (!sds.busiest)
> >> goto out_balanced;
> >>
> >> + busiest = &sds.busiest_stat;
> >> +
> >> /* Misfit tasks should be dealt with regardless of the avg load */
> >> if (busiest->group_type == group_misfit_task)
> >> goto force_balance;
> >>
> >> + if (sched_energy_enabled()) {
> >> + struct root_domain *rd = env->dst_rq->rd;
> >> +
> >> + if (rcu_dereference(rd->pd) && !READ_ONCE(rd->overutilized))
> >> + goto out_balanced;
> >> + }
> >> +
> >> /* ASYM feature bypasses nice load balance check */
> >> if (busiest->group_type == group_asym_packing)
> >> goto force_balance;
> >> @@ -10194,6 +10195,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> >> if (busiest->group_type == group_imbalanced)
> >> goto force_balance;
> >>
> >> + local = &sds.local_stat;
> >> /*
> >> * If the local group is busier than the selected busiest group
> >> * don't try and pull any tasks.
> >> --
> >> 2.17.1
>
> Specific questions regarding this patch proposal so I can fully
> understand the intention:
>
> I see 4 major changes here:
>
> (1) Remove `CPU capacity inversion` (CapInv)
>
> (2) Don't use util_fits_cpu() in cpu_overutilized()
>
> (3) Use `capacity = capacity_orig - thermal_load_avg() (ii)` (and not `-
> arch_scale_thermal_pressure() (i)` in util_fits_cpu(), i.e. in
> sis(), feec(), MF handling (update + load balance (lb))
>
> (4) Allow MF lb in !OverUtilized (OU)
>
> (1) CapInv wouldn't be needed in case only (ii) is used (so we don't
> have to switch from (i) to (ii) (because of 3)?
>
> (2) is there since we don't have to raise OU anymore when `rq_util_min >
> capacity_orig_thermal`. This is replaced by MF handling in !OU lb (4)?

yes and also because a task can misfit on a cpu because of uclamp_min
but the cpu is not overloaded

>
> [...]