Re: [PATCH 1/2] sched/schedutil: rework performance estimation

From: Dietmar Eggemann
Date: Fri Oct 20 2023 - 05:48:35 EST


On 13/10/2023 17:14, Vincent Guittot wrote:
> The current method to take into account uclamp hints when estimating the
> target frequency can end into situation where the selected target
> frequency is finally higher than uclamp hints whereas there are no real
> needs. Such cases mainly happen because we are currently mixing the
> traditional scheduler utilization signal with the uclamp performance
> hints. By adding these 2 metrics, we loose an important information when
> it comes to select the target frequency and we have to make some
> assumptions which can't fit all cases.
>
> Rework the interface between the scheduler and schedutil governor in order
> to propagate all information down to the cpufreq governor.

So we change from:

max(util -> uclamp, iowait_boost -> uclamp) -> head_room()

to:

util = max(util, iowait_boost) -> util =
head_room(util)

_min = max(irq + cpu_bw_dl,
uclamp_min) -> -> max(_min, _max)

_max = min(scale, uclamp_max) -> _max =
min(util, _max)

> effective_cpu_util() interface changes and now returns the actual
> utilization of the CPU with 2 optional inputs:
> - The minimum performance for this CPU; typically the capacity to handle
> the deadline task and the interrupt pressure. But also uclamp_min
> request when available.
> - The maximum targeting performance for this CPU which reflects the
> maximum level that we would like to not exceed. By default it will be
> the CPU capacity but can be reduced because of some performance hints
> set with uclamp. The value can be lower than actual utilization and/or
> min performance level.
>
> A new sugov_effective_cpu_perf() interface is also available to compute
> the final performance level that is targeted for the CPU after applying
> some cpufreq headroom and taking into account all inputs.
>
> With these 2 functions, schedutil is now able to decide when it must go
> above uclamp hints. It now also have a generic way to get the min
> perfromance level.
>
> The dependency between energy model and cpufreq governor and its headroom
> policy doesn't exist anymore.

But the dependency that both are doing the same thing still exists, right?

sugov_get_util() and eenv_pd_max_util() are calling the same functions:

util = effective_cpu_util(cpu, util, &min, &max)

/* ioboost, bw_min = head_room(min) resp. uclamp tsk handling */

util = sugov_effective_cpu_perf(cpu, util, min, max)

[...]

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index a3f9cd52eec5..78228abd1219 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7381,18 +7381,13 @@ int sched_core_idle_cpu(int cpu)
> * required to meet deadlines.
> */
> unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> - enum cpu_util_type type,
> - struct task_struct *p)
> + unsigned long *min,
> + unsigned long *max)

FREQUENCY_UTIL relates to *min != NULL and *max != NULL

ENERGY_UTIL relates to *min == NULL and *max == NULL

so both must be either NULL or !NULL.

Calling it with one equa NULL and the other with !NULL should be
undefined, right?

[...]

> @@ -7400,45 +7395,36 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> * update_irq_load_avg().
> */
> irq = cpu_util_irq(rq);
> - if (unlikely(irq >= max))
> - return max;
> + if (unlikely(irq >= scale)) {
> + if (min)
> + *min = scale;
> + if (max)
> + *max = scale;
> + return scale;
> + }
> +
> + /* The minimum utilization returns the highest level between:
> + * - the computed DL bandwidth needed with the irq pressure which
> + * steals time to the deadline task.
> + * - The minimum bandwidth requirement for CFS.

rq UCLAMP_MIN can also be driven by RT, not only CFS.

> + */
> + if (min)
> + *min = max(irq + cpu_bw_dl(rq), uclamp_rq_get(rq, UCLAMP_MIN));
>
> /*
> * Because the time spend on RT/DL tasks is visible as 'lost' time to
> * CFS tasks and we use the same metric to track the effective
> * utilization (PELT windows are synchronized) we can directly add them
> * to obtain the CPU's actual utilization.
> - *
> - * CFS and RT utilization can be boosted or capped, depending on
> - * utilization clamp constraints requested by currently RUNNABLE
> - * tasks.
> - * When there are no CFS RUNNABLE tasks, clamps are released and
> - * frequency will be gracefully reduced with the utilization decay.
> */
> util = util_cfs + cpu_util_rt(rq);
> - if (type == FREQUENCY_UTIL)
> - util = uclamp_rq_util_with(rq, util, p);
> -
> - dl_util = cpu_util_dl(rq);
> -
> - /*
> - * For frequency selection we do not make cpu_util_dl() a permanent part
> - * of this sum because we want to use cpu_bw_dl() later on, but we need
> - * to check if the CFS+RT+DL sum is saturated (ie. no idle time) such
> - * that we select f_max when there is no idle time.
> - *
> - * NOTE: numerical errors or stop class might cause us to not quite hit
> - * saturation when we should -- something for later.
> - */
> - if (util + dl_util >= max)
> - return max;
> + util += cpu_util_dl(rq);
>
> - /*
> - * OTOH, for energy computation we need the estimated running time, so
> - * include util_dl and ignore dl_bw.
> - */
> - if (type == ENERGY_UTIL)
> - util += dl_util;
> + if (util >= scale) {
> + if (max)
> + *max = scale;

But that means that ucamp_max cannot constrain a system in which the
'util > ucamp_max'. I guess that's related to you saying uclamp_min is a
hard req and uclamp_max is a soft req. I don't think that's in sync with
the rest of the uclamp_max implantation.

> + return scale;
> + }
>
> /*
> * There is still idle time; further improve the number by using the
> @@ -7449,28 +7435,21 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> * U' = irq + --------- * U
> * max
> */
> - util = scale_irq_capacity(util, irq, max);
> + util = scale_irq_capacity(util, irq, scale);
> util += irq;
>
> - /*
> - * Bandwidth required by DEADLINE must always be granted while, for
> - * FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
> - * to gracefully reduce the frequency when no tasks show up for longer
> - * periods of time.
> - *
> - * Ideally we would like to set bw_dl as min/guaranteed freq and util +
> - * bw_dl as requested freq. However, cpufreq is not yet ready for such
> - * an interface. So, we only do the latter for now.
> + /* The maximum hint is a soft bandwidth requirement which can be lower
> + * than the actual utilization because of max uclamp requirments
> */
> - if (type == FREQUENCY_UTIL)
> - util += cpu_bw_dl(rq);
> + if (max)
> + *max = min(scale, uclamp_rq_get(rq, UCLAMP_MAX));
>
> - return min(max, util);
> + return min(scale, util);
> }

effective_cpu_util for FREQUENCY_UTIL (i.e. (*min != NULL && *max !=
NULL)) is slightly different.

missing:

if (!uclamp_is_used() && rt_rq_is_runnable(&rq->rt)
return max

probably moved into sugov_effective_cpu_perf() (which is only called
for `FREQUENCY_UTIL`) ?


old:

irq_cap_scaling(util_cfs, util_rt) + irq + cpu_bw_dl()
^^^^^^^^^^^

new:

irq_cap_scaling(util_cfs + util_rt + util_dl) + irq
^^^^^^^

[...]

> +unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
> + unsigned long min,
> + unsigned long max)
> +{
> + unsigned long target;
> + struct rq *rq = cpu_rq(cpu);
> +
> + if (rt_rq_is_runnable(&rq->rt))
> + return max;
> +
> + /* Provide at least enough capacity for DL + irq */
> + target = min;
> +
> + actual = map_util_perf(actual);
> + /* Actually we don't need to target the max performance */
> + if (actual < max)
> + max = actual;
> +
> + /*
> + * Ensure at least minimum performance while providing more compute
> + * capacity when possible.
> + */
> + return max(target, max);

Can you not just use:

return max(min, max)

and skip target?

> +}
> +
> static void sugov_get_util(struct sugov_cpu *sg_cpu)
> {
> - unsigned long util = cpu_util_cfs_boost(sg_cpu->cpu);
> - struct rq *rq = cpu_rq(sg_cpu->cpu);
> + unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
>
> - sg_cpu->bw_dl = cpu_bw_dl(rq);
> - sg_cpu->util = effective_cpu_util(sg_cpu->cpu, util,
> - FREQUENCY_UTIL, NULL);
> + util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
> + sg_cpu->bw_min = map_util_perf(min);
> + sg_cpu->util = sugov_effective_cpu_perf(sg_cpu->cpu, util, min, max);
> }
>
> /**
> @@ -306,7 +329,7 @@ static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
> */
> static inline void ignore_dl_rate_limit(struct sugov_cpu *sg_cpu)
> {
> - if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_dl)
> + if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_min)

bw_min is more than DL right?

bw_min = head_room(max(irq + cpu_bw_dl, rq's UCLAMP_MIN)

[...]