Re: [RFC PATCH 3/6] sched/fair: Use CFS util_avg_uclamp for utilization and frequency

From: Dietmar Eggemann
Date: Wed Nov 01 2023 - 18:34:27 EST


On 04/10/2023 11:04, Hongyan Xia wrote:
> From: Hongyan Xia <hongyan.xia2@xxxxxxx>
>
> Switch to the new util_avg_uclamp for task and runqueue utilization.
> Since util_est() calls task_util(), this means util_est is now also a
> clamped value.

s/util_est()/task_util_est()

but task_util_est() is max(task_util(p), _task_util_est(p))

So I don't immediately spot why util_est is a clamped value as well now.

We have a naming mismatch between CPU and task related function on this
level: cpu_util() vs. task_util_est().

> Now that we have the sum aggregated CFS util value, we do not need to
> consult uclamp buckets to know how the frequency should be clamped. We
> simply look at the aggregated top level root_cfs_util_uclamp to know
> what frequency to choose. Because we simulate PELT decay in
> root_cfs_util_uclamp anyway, there's no need in cpufreq_schedutil.c to
> avoid premature frequency drops.
>
> Consequently, there is no need for uclamp_rq_util_with(). This function
> takes the un-clamped util value and sends it through various clamping
> filters to get the final value. However, util_avg_uclamp is propagated
> with clamping in mind already, so it does not need to be clamped again.
>
> TODO: There are two major caveats in this patch.
> 1. At the moment sum aggregation does not consider RT tasks. The avg_rt
> signal considers all RT tasks on this rq as a single entity, which
> means the utilization of individual RT tasks is not tracked
> separately. If we want to use sum aggregation, we might have to track
> utilization of RT tasks individually.

Not sure if the RT class will except PELT task tracking (plus there is
CONFIG_RT_GROUP_SCHED too).

> 2. Busy time accounting in compute_energy() now takes the uclamp'ed
> value. Ideally, it should reflect reality and use the un-clamp'ed
> values. However, that would require maintaining both the normal and
> uclamp'ed values for util_est. This needs to be revisited if it
> causes real problems in practice.

You could use your new rq->root_cfs_util_uclamp for eenv_pd_max_util
(FREQUENCY_UTIL) and use rq->cfs.avg.util_avg in eenv_pd_busy_time()
(ENERGY_UTIL).

[...]

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index efe3848978a0..32511ee63f01 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7402,10 +7402,12 @@ int sched_core_idle_cpu(int cpu)
> * The DL bandwidth number otoh is not a measured metric but a value computed
> * based on the task model parameters and gives the minimal utilization
> * required to meet deadlines.
> + *
> + * The util_cfs parameter has already taken uclamp into account (unless uclamp
> + * support is not compiled in).
> */
> unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> - enum cpu_util_type type,
> - struct task_struct *p)
> + enum cpu_util_type type)

There are changes proposed in the area of uclamping right now in:

https://lkml.kernel.org/r/20231026170913.32605-2-vincent.guittot@xxxxxxxxxx

[...]

> /**
> @@ -282,7 +281,11 @@ static void sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
> * into the same scale so we can compare.
> */
> boost = (sg_cpu->iowait_boost * max_cap) >> SCHED_CAPACITY_SHIFT;
> - boost = uclamp_rq_util_with(cpu_rq(sg_cpu->cpu), boost, NULL);
> + /*
> + * TODO: Investigate what should be done here. In sum aggregation there
> + * is no such thing as uclamp_max on a rq, so how do we cap the boost
> + * value, or do we want to cap the boost frequency here at all?
> + */

https://lkml.kernel.org/r/20231026170913.32605-3-vincent.guittot@xxxxxxxxxx

is proposing to cap iowait boost with max (set in effective_cpu_util()
and max depends on uclamp_rq_get(rq, UCLAMP_MAX) too.

So you could cap iowait boost in case uclamp_rq_is_capped(), i.e. when:

if (rq->cfs.avg.util_avg > rq->root_cfs_util_uclamp + margin)

[...]

> @@ -7468,11 +7459,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> static unsigned long
> cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
> {
> - struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
> - unsigned long util = READ_ONCE(cfs_rq->avg.util_avg);
> + struct rq *rq = cpu_rq(cpu);
> + struct cfs_rq *cfs_rq = &rq->cfs;
> + unsigned long util = root_cfs_util(rq);
> + bool capped = uclamp_rq_is_capped(rq);
> unsigned long runnable;
>
> - if (boost) {
> + if (boost && !capped) {
> runnable = READ_ONCE(cfs_rq->avg.runnable_avg);
> util = max(util, runnable);
> }

IMHO, this makes sense. Only allow runnable boosting in case the rq is
not uclamp_max capped.

[...]