Re: [RFC PATCH 3/3] sched/fair: Traverse cpufreq policies to detect capacity inversion

From: Qais Yousef
Date: Mon Dec 05 2022 - 09:09:35 EST


On 12/05/22 13:39, Rafael J. Wysocki wrote:
> On Sat, Dec 3, 2022 at 3:32 PM Qais Yousef <qyousef@xxxxxxxxxxx> wrote:
> >
> > On 11/30/22 19:27, Rafael J. Wysocki wrote:
> >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index 7c0dd57e562a..4bbbca85134b 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -8856,23 +8856,20 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
> > > > * * Thermal pressure will impact all cpus in this perf domain
> > > > * equally.
> > > > */
> > > > - if (sched_energy_enabled()) {
> > > > + if (static_branch_unlikely(&sched_asym_cpucapacity)) {
> > > > unsigned long inv_cap = capacity_orig - thermal_load_avg(rq);
> > > > - struct perf_domain *pd = rcu_dereference(rq->rd->pd);
> > > > + struct cpufreq_policy *policy, __maybe_unused *policy_n;
> > > >
> > > > rq->cpu_capacity_inverted = 0;
> > > >
> > > > - SCHED_WARN_ON(!rcu_read_lock_held());
> > > > -
> > > > - for (; pd; pd = pd->next) {
> > > > - struct cpumask *pd_span = perf_domain_span(pd);
> > > > + for_each_active_policy_safe(policy, policy_n) {
> > >
> > > 1. Is the "safe" part sufficient for protection against concurrent
> > > deletion and freeing of list entries? cpufreq driver removal can do
> > > that AFAICS.
> >
> > The freeing part is not safe probably.
>
> Well, I don't even think that the traversal part is safe against
> concurrent removal of list entries (it is safe against removal of list
> entries in the loop itself).
>
> list_for_each_entry_safe() assumes that n will always point to a valid
> list entry, but what if the entry pointed to by it is deleted by a
> concurrent thread?

Yeah I was being too hopeful here that safe does magic.

>
> > I need to research this more. Do you
> > have issues against the exportation of this traversal in principle?
> >
> > Switching them to be RCU protected could be the best safe option, anything
> > against that too?
>
> Not really, it just occurred to me that you may end up dealing with a
> corrupted list here.

Yes.

>
> > I might not end up needing that. I need to dig more.
> >
> > > 2. For a casual reader of this code it may not be clear why cpufreq
> > > policies matter here.
> >
> > I'm looking for a way to traverse the list of capacities of the system and
> > know their related CPUs.
>
> So why don't you mention this in a comment, for instance?

Sorry my bad. I realize my questions could have been clearer now.

>
> > AFAICT this information already exists in the performance domains and
> > cpufreq_policy. Performance domains are conditional to energy model and
> > schedutil. So trying to switch to cpufreq_policy.
> >
> > Assuming this question wasn't a request to add a comment :-)
>
> Yes, it was. :-)
>
> That said though, this change makes the scheduler kind of depend on
> cpufreq which feels a bit like a corner cut TBH.

I share your sentiment; hence my request for feedback before I spend more time
on this. And apologies for not being clearer about my questions.

The deps is soft and it already exist in the form that performance domains,
schedutil, thermal pressure are all interacting with cpufreq subsystem already.

I can add a better comment for sure; if this ends up being the right thing to
do.

> I do realize that cpufreq happens to be maintaining a data structure
> that turns out to be useful here, but the reason why it is there has
> nothing to do with this code AFAICS.

The comments about needing to check for performance domains to detect capacity
inversion is not clear? Sorry I realize again now that all of this might look
out of the blue for you.

The issue I'm dealing with is that I want to know when thermal pressure has
gone too far that the capacity of the performance domain now is lower than
another domain in the system; aka capacity inversion.

The issue at hand is that when using uclamp_min; we need to consider thermal
pressure. But not unconditionally as we have corner cases like this one:

p->util_avg = 10%
p->uclamp_min = 90%

capacity_orig_of(big) = 100%
capacity_orig_of(med) = 60%

the task wants to run near the top perf level of the system; but if we take
thermal pressure blindly then we will fail to place the task on the big core
even though it's the most performant core. For example if

capacity_orig_of(big) - thermal_pressure = 80%

then from uclamp_min hint perspective, the big core is fine and it's the best
placement we can get. The task is not getting what it gets, but 80% is better
than going down to 60% or lower.

On the other hand if

p->util_avg = 10%
p->uclamp_min = 60%

The task can be placed on a medium or big core. But if medium core is under
thermal pressure; then we do have the option to place it on big; and if we
ignore thermal pressure the task could miss a deadline because of this
unnecessary bad placement decision.

By knowing when we are in capacity inversion; we can ignore thermal pressure
for big cores until we hit the inversion point since then we know that we do
have another higher performance point in the system to consider.

We do update the capacity of the CPUs in this code. Hence why I added this
extra logic to detect the inversion here. And for that; the only way I found to
do it is by going via perf domain, but Dietmar made me realize it's dependent
on energy_model; so I'm trying to use cpufreq policies instead. If cpufreq is
not present, then there's no thermal pressure signals too; and this whole loop
should be optimized out by the compiler.

Hope this helps to clarify and not add more confusion! This logic is used in
the scheduler when deciding whether a task fits the CPU based on its util_avg
and uclamp values.

I wanted to check if it's alright to make these macros visible in cpufreq.h and
use them here. And check with the right locking rules. It seems converting the
list to be rcu protected is the right way forward.

Assuming Vincent (or you) don't tell me there's actually a better way to handle
all of that that I missed :-)


Thanks!

--
Qais Yousef