Re: [PATCH 1/4] sched/fair: Be less aggressive in calling cpufreq_update_util()

From: Qais Yousef
Date: Thu Dec 28 2023 - 19:25:45 EST


On 12/12/23 12:40, Qais Yousef wrote:
> On 12/12/23 12:06, Vincent Guittot wrote:
>
> > > @@ -6772,6 +6737,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > > enqueue_throttle:
> > > assert_list_leaf_cfs_rq(rq);
> > >
> >
> > Here and in the other places below, you lose :
> >
> > - } else if (decayed) {
> >
> > The decayed condition ensures a rate limit (~1ms) in the number of
> > calls to cpufreq_update_util.
> >
> > enqueue/dequeue/tick don't create any sudden change in the PELT
> > signals that would require to update cpufreq of the change unlike
> > attach/detach
>
> Okay, thanks for the clue. Let me rethink this again.

Thinking more about this. Do we really need to send freq updates at
enqueue/attach etc?

I did an experiment to remove all the updates except in three places:

1. Context switch (done unconditionally)
2. Tick
2. update_blocked_averages()

I tried the below patch on mac mini with m1 SoC, 6.6 kernel; speedometers
scores were the same (against this series).

I ran perf bench sched pipe to see if the addition in context switch will make
things worse

before (this series applied):

# Running 'sched/pipe' benchmark:
# Executed 1000000 pipe operations between two processes

Total time: 20.505 [sec]

20.505628 usecs/op
48767 ops/sec

after (proposed patch below applied on top):

# Running 'sched/pipe' benchmark:
# Executed 1000000 pipe operations between two processes

Total time: 19.475 [sec]

19.475838 usecs/op
51345 ops/sec

I tried against vanilla kernel (without any patches, including some dependency
backports) using schedutil governor

# Running 'sched/pipe' benchmark:
# Executed 1000000 pipe operations between two processes

Total time: 22.428 [sec]

22.428166 usecs/op
44586 ops/sec

(I got higher run to run variance against this kernel)

So things got better. I think we can actually unify all updates to be at
context switch and tick for all classes.

The one hole is for long running CFS tasks without context switch; no updates
until tick this means the dvfs headroom needs to cater for that as worst case
scenario now. I think this is the behavior today actually; but accidental
updates due to enqueue/dequeue could have helped to issue more updates. If none
of that happens, then updating load_avg at entity_tick() is what would have
triggered a frequency update.

I'm not sure if the blocked average one is required to be honest. But feared
that when the cpu goes to idle we might miss reducing frequencies vote for that
CPU - which matters on shared cpufreq policies.

I haven't done comprehensive testing of course. But would love to hear any
thoughts on how we can be more strategic and reduce the number of cpufreq
updates we send. And iowait boost state needs to be verified.

While testing this series more I did notice that short kworker context switches
still can cause the frequency to go high. I traced this back to
__balance_callbacks() in finish_lock_switch() after calling
uclamp_context_switch(). So I think we do have a problem of updates being
'accidental' and we need to improve the interaction with the governor to be
more intentional keeping in mind all the constraints we have today in h/w and
software.

--->8---