Re: [PATCH 1/4] sched/fair: Be less aggressive in calling cpufreq_update_util()

From: Vincent Guittot
Date: Wed Jan 03 2024 - 08:43:17 EST


On Fri, 29 Dec 2023 at 01:25, Qais Yousef <qyousef@xxxxxxxxxxx> wrote:
>
> On 12/12/23 12:40, Qais Yousef wrote:
> > On 12/12/23 12:06, Vincent Guittot wrote:
> >
> > > > @@ -6772,6 +6737,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > > > enqueue_throttle:
> > > > assert_list_leaf_cfs_rq(rq);
> > > >
> > >
> > > Here and in the other places below, you lose :
> > >
> > > - } else if (decayed) {
> > >
> > > The decayed condition ensures a rate limit (~1ms) in the number of
> > > calls to cpufreq_update_util.
> > >
> > > enqueue/dequeue/tick don't create any sudden change in the PELT
> > > signals that would require to update cpufreq of the change unlike
> > > attach/detach
> >
> > Okay, thanks for the clue. Let me rethink this again.
>
> Thinking more about this. Do we really need to send freq updates at
> enqueue/attach etc?

Yes, attach and detach are the 2 events which can make abrupt and
significant changes in the utilization of the CPU.

>
> I did an experiment to remove all the updates except in three places:
>
> 1. Context switch (done unconditionally)
> 2. Tick
> 2. update_blocked_averages()

>From the PoV of util_avg, attach, detach, tick and
update_blocked_averages are mandatory events to report to cpufreq to
correctly follow utilization.

>
> I tried the below patch on mac mini with m1 SoC, 6.6 kernel; speedometers
> scores were the same (against this series).
>
> I ran perf bench sched pipe to see if the addition in context switch will make
> things worse
>
> before (this series applied):
>
> # Running 'sched/pipe' benchmark:
> # Executed 1000000 pipe operations between two processes
>
> Total time: 20.505 [sec]
>
> 20.505628 usecs/op
> 48767 ops/sec
>
> after (proposed patch below applied on top):
>
> # Running 'sched/pipe' benchmark:
> # Executed 1000000 pipe operations between two processes
>
> Total time: 19.475 [sec]
>
> 19.475838 usecs/op
> 51345 ops/sec
>
> I tried against vanilla kernel (without any patches, including some dependency
> backports) using schedutil governor
>
> # Running 'sched/pipe' benchmark:
> # Executed 1000000 pipe operations between two processes
>
> Total time: 22.428 [sec]
>
> 22.428166 usecs/op
> 44586 ops/sec
>
> (I got higher run to run variance against this kernel)
>
> So things got better. I think we can actually unify all updates to be at
> context switch and tick for all classes.
>
> The one hole is for long running CFS tasks without context switch; no updates
> until tick this means the dvfs headroom needs to cater for that as worst case
> scenario now. I think this is the behavior today actually; but accidental
> updates due to enqueue/dequeue could have helped to issue more updates. If none
> of that happens, then updating load_avg at entity_tick() is what would have
> triggered a frequency update.
>
> I'm not sure if the blocked average one is required to be honest. But feared
> that when the cpu goes to idle we might miss reducing frequencies vote for that
> CPU - which matters on shared cpufreq policies.
>
> I haven't done comprehensive testing of course. But would love to hear any
> thoughts on how we can be more strategic and reduce the number of cpufreq
> updates we send. And iowait boost state needs to be verified.
>
> While testing this series more I did notice that short kworker context switches
> still can cause the frequency to go high. I traced this back to
> __balance_callbacks() in finish_lock_switch() after calling
> uclamp_context_switch(). So I think we do have a problem of updates being
> 'accidental' and we need to improve the interaction with the governor to be
> more intentional keeping in mind all the constraints we have today in h/w and
> software.
>
> --->8---
>
>
> From 6deed09be1d075afa3858ca62dd54826cdb60d44 Mon Sep 17 00:00:00 2001
> From: Qais Yousef <qyousef@xxxxxxxxxxx>
> Date: Tue, 26 Dec 2023 01:23:57 +0000
> Subject: [PATCH] sched/fair: Update freq on tick and context switch and
> blocked avgs
>
> Signed-off-by: Qais Yousef (Google) <qyousef@xxxxxxxxxxx>
> ---
> kernel/sched/cpufreq_schedutil.c | 3 ---
> kernel/sched/fair.c | 13 +------------
> kernel/sched/sched.h | 15 +--------------
> 3 files changed, 2 insertions(+), 29 deletions(-)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index c0879a985097..553a3d7f02d8 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -166,9 +166,6 @@ static inline bool ignore_short_tasks(int cpu,
> struct task_struct *p = cpu_rq(cpu)->curr;
> unsigned long task_util;
>
> - if (!(flags & SCHED_CPUFREQ_PERF_HINTS))
> - return false;
> -
> if (!fair_policy(p->policy))
> return false;
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d63eae534cec..3a30f78b37d3 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5717,8 +5717,6 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
> sub_nr_running(rq, task_delta);
>
> done:
> - cpufreq_update_util(rq, 0);
> -
> /*
> * Note: distribution will already see us throttled via the
> * throttled-list. rq->lock protects completion.
> @@ -5811,8 +5809,6 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> unthrottle_throttle:
> assert_list_leaf_cfs_rq(rq);
>
> - cpufreq_update_util(rq, 0);
> -
> /* Determine whether we need to wake up potentially idle CPU: */
> if (rq->curr == rq->idle && rq->cfs.nr_running)
> resched_curr(rq);
> @@ -6675,8 +6671,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> enqueue_throttle:
> assert_list_leaf_cfs_rq(rq);
>
> - cpufreq_update_util(rq, p->in_iowait ? SCHED_CPUFREQ_IOWAIT : 0);
> -
> hrtick_update(rq);
> }
>
> @@ -6754,7 +6748,6 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>
> dequeue_throttle:
> util_est_update(&rq->cfs, p, task_sleep);
> - cpufreq_update_util(rq, 0);
> hrtick_update(rq);
> }
>
> @@ -8338,7 +8331,6 @@ done: __maybe_unused;
>
> update_misfit_status(p, rq);
> sched_fair_update_stop_tick(rq, p);
> - cpufreq_update_util(rq, 0);
>
> return p;
>
> @@ -12460,7 +12452,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>
> update_misfit_status(curr, rq);
> update_overutilized_status(task_rq(curr));
> - cpufreq_update_util(rq, SCHED_CPUFREQ_PERF_HINTS);
> + cpufreq_update_util(rq, current->in_iowait ? SCHED_CPUFREQ_IOWAIT : 0);
>
> task_tick_core(rq, curr);
> }
> @@ -12585,7 +12577,6 @@ static void detach_task_cfs_rq(struct task_struct *p)
> struct sched_entity *se = &p->se;
>
> detach_entity_cfs_rq(se);
> - cpufreq_update_util(task_rq(p), 0);
> }
>
> static void attach_task_cfs_rq(struct task_struct *p)
> @@ -12593,7 +12584,6 @@ static void attach_task_cfs_rq(struct task_struct *p)
> struct sched_entity *se = &p->se;
>
> attach_entity_cfs_rq(se);
> - cpufreq_update_util(task_rq(p), 0);
> }
>
> static void switched_from_fair(struct rq *rq, struct task_struct *p)
> @@ -12839,7 +12829,6 @@ static int __sched_group_set_shares(struct task_group *tg, unsigned long shares)
> update_load_avg(cfs_rq_of(se), se, UPDATE_TG);
> update_cfs_group(se);
> }
> - cpufreq_update_util(rq, 0);
> rq_unlock_irqrestore(rq, &rf);
> }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 516187ea2b81..e1622e2b82be 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3076,20 +3076,7 @@ static inline bool uclamp_rq_is_capped(struct rq *rq)
> /* Request freq update on context switch if necessary */
> static inline void uclamp_context_switch(struct rq *rq)
> {
> - unsigned long uclamp_min;
> - unsigned long uclamp_max;
> - unsigned long util;
> -
> - /* Only RT and FAIR tasks are aware of uclamp */
> - if (!rt_policy(current->policy) && !fair_policy(current->policy))
> - return;
> -
> - uclamp_min = uclamp_eff_value(current, UCLAMP_MIN);
> - uclamp_max = uclamp_eff_value(current, UCLAMP_MAX);
> - util = rq->cfs.avg.util_avg;
> -
> - if (uclamp_min > util || uclamp_max < util)
> - cpufreq_update_util(rq, SCHED_CPUFREQ_PERF_HINTS);
> + cpufreq_update_util(rq, current->in_iowait ? SCHED_CPUFREQ_IOWAIT : 0);
> }
> #else /* CONFIG_UCLAMP_TASK */
> static inline unsigned long uclamp_eff_value(struct task_struct *p,
> --
> 2.40.1
>