Re: [RFC PATCH 3/3] sched/fair: Traverse cpufreq policies to detect capacity inversion

From: Qais Yousef
Date: Mon Dec 05 2022 - 06:02:12 EST


On 12/04/22 12:35, Vincent Guittot wrote:
> On Sat, 3 Dec 2022 at 15:33, Qais Yousef <qyousef@xxxxxxxxxxx> wrote:
> >
> > On 12/02/22 15:57, Vincent Guittot wrote:
> >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index 7c0dd57e562a..4bbbca85134b 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -8856,23 +8856,20 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
> > > > * * Thermal pressure will impact all cpus in this perf domain
> > > > * equally.
> > > > */
> > > > - if (sched_energy_enabled()) {
> > > > + if (static_branch_unlikely(&sched_asym_cpucapacity)) {
> > > > unsigned long inv_cap = capacity_orig - thermal_load_avg(rq);
> > > > - struct perf_domain *pd = rcu_dereference(rq->rd->pd);
> > > > + struct cpufreq_policy *policy, __maybe_unused *policy_n;
> > > >
> > > > rq->cpu_capacity_inverted = 0;
> > > >
> > > > - SCHED_WARN_ON(!rcu_read_lock_held());
> > > > -
> > > > - for (; pd; pd = pd->next) {
> > > > - struct cpumask *pd_span = perf_domain_span(pd);
> > > > + for_each_active_policy_safe(policy, policy_n) {
> > >
> > > So you are looping all cpufreq policy (and before the perf domain) in
> > > the period load balance. That' really not something we should or want
> > > to do
> >
> > Why is it not acceptable in the period load balance but acceptable in the hot
> > wake up path in feec()? What's the difference?
>
> This patch loops on all cpufreq policy in sched softirq, how can this
> be sane ? and not only in eas mode but also in the default asymmetric

Hmm I'm still puzzled. Why it's not sane to do it here but it's okay to do it
in the wake up path in feec()?

AFAICT the number of cpufreq policies and perf domains have 1:1 mapping. In
feec() we not only loop perf domains, but we go through each cpu of each domain
to find max_spare_cap then compute_energy(). Which is more intensive.

In worst case scenario where there's a perf domain for each cpu, then we'll
loop through every CPU *and* compute energy its energy cost in the wake up
path.

Why it's deemed acceptable in wake up path which is more critical AFAIU but not
period load balance which is expected to do more work? What am I missing?

> performance one.
>
> This inverted detection doesn't look like the right way to fix your
> problem IMO. That being said, i agree that I haven't made any other
> proposal apart that I think that you should use a different rules for
> task and for overutilized and part of your problem comes from this.

We discussed this before; I need to revisit the thread but I can't see how
overutilized different than task will fix the issue. They should be unified by
design.

I'm all ears if there's a simpler way to address the problem :-)

The problem is that thermal pressure on big cpu is not important from
uclamp perspective until it is in inversion state. It is quite common to have
a system where the medium capacity is in 500 range. If the big is under thermal
pressure that it drops to 800, then it is still a fitting CPU from uclamp
perspective. Keep in mind uclamp_min is useful for tasks whose utilization is
small So we need to be selective when thermal pressure is actually helping out
or just creating unnecessary problems.

The only other option I had in mind was to do the detection when we update the
thermal_pressure in the topology code. But that didn't look better alternative
to me.

>
> Then this make eas and util_fits_cpu even more Arm specific and I

What is the Arm specific part about it? Why it wouldn't work on non-Arm
systems?

> still hope to merge sched_asym_cpucapacity and asym_packing a some
> levels because they looks more and more similar but each side is
> trying to add some SoC specific policy

Oh, it seems Intel relies on asym_packing for their hybrid support approach?
I think sched_asym_cpucapacity was designed to be generic. If I gathered
correctly lack of support for SMT and inability to provide energy model outside
of DT were some required extensions.

To be honest, I personally think EAS can be useful on SMP systems and it would
be nice to enable it outside of sched_asym_cpucapacity.

I'm interested to hear more about this unification idea actually. If you feel
a bit chatty to describe in more detail how do you see this being unified, that
could be enlightening for some of us who work in this area :-)


Thanks!

--
Qais Yousef