Re: [RFC PATCH 3/3] sched/fair: Traverse cpufreq policies to detect capacity inversion

From: Rafael J. Wysocki
Date: Mon Dec 05 2022 - 07:40:26 EST


On Sat, Dec 3, 2022 at 3:32 PM Qais Yousef <qyousef@xxxxxxxxxxx> wrote:
>
> On 11/30/22 19:27, Rafael J. Wysocki wrote:
>
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 7c0dd57e562a..4bbbca85134b 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -8856,23 +8856,20 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
> > > * * Thermal pressure will impact all cpus in this perf domain
> > > * equally.
> > > */
> > > - if (sched_energy_enabled()) {
> > > + if (static_branch_unlikely(&sched_asym_cpucapacity)) {
> > > unsigned long inv_cap = capacity_orig - thermal_load_avg(rq);
> > > - struct perf_domain *pd = rcu_dereference(rq->rd->pd);
> > > + struct cpufreq_policy *policy, __maybe_unused *policy_n;
> > >
> > > rq->cpu_capacity_inverted = 0;
> > >
> > > - SCHED_WARN_ON(!rcu_read_lock_held());
> > > -
> > > - for (; pd; pd = pd->next) {
> > > - struct cpumask *pd_span = perf_domain_span(pd);
> > > + for_each_active_policy_safe(policy, policy_n) {
> >
> > 1. Is the "safe" part sufficient for protection against concurrent
> > deletion and freeing of list entries? cpufreq driver removal can do
> > that AFAICS.
>
> The freeing part is not safe probably.

Well, I don't even think that the traversal part is safe against
concurrent removal of list entries (it is safe against removal of list
entries in the loop itself).

list_for_each_entry_safe() assumes that n will always point to a valid
list entry, but what if the entry pointed to by it is deleted by a
concurrent thread?

> I need to research this more. Do you
> have issues against the exportation of this traversal in principle?
>
> Switching them to be RCU protected could be the best safe option, anything
> against that too?

Not really, it just occurred to me that you may end up dealing with a
corrupted list here.

> I might not end up needing that. I need to dig more.
>
> > 2. For a casual reader of this code it may not be clear why cpufreq
> > policies matter here.
>
> I'm looking for a way to traverse the list of capacities of the system and
> know their related CPUs.

So why don't you mention this in a comment, for instance?

> AFAICT this information already exists in the performance domains and
> cpufreq_policy. Performance domains are conditional to energy model and
> schedutil. So trying to switch to cpufreq_policy.
>
> Assuming this question wasn't a request to add a comment :-)

Yes, it was. :-)

That said though, this change makes the scheduler kind of depend on
cpufreq which feels a bit like a corner cut TBH.

I do realize that cpufreq happens to be maintaining a data structure
that turns out to be useful here, but the reason why it is there has
nothing to do with this code AFAICS.