Re: [PATCH v4 1/2] sched/fair: Check a task has a fitting cpu when updating misfit

From: Qais Yousef
Date: Mon Jan 22 2024 - 13:36:47 EST


On 01/22/24 09:59, Dietmar Eggemann wrote:
> On 05/01/2024 23:20, Qais Yousef wrote:
> > From: Qais Yousef <qais.yousef@xxxxxxx>
> >
> > If a misfit task is affined to a subset of the possible cpus, we need to
> > verify that one of these cpus can fit it. Otherwise the load balancer
> > code will continuously trigger needlessly leading the balance_interval
> > to increase in return and eventually end up with a situation where real
> > imbalances take a long time to address because of this impossible
> > imbalance situation.
> >
> > This can happen in Android world where it's common for background tasks
> > to be restricted to little cores.
> >
> > Similarly if we can't fit the biggest core, triggering misfit is
> > pointless as it is the best we can ever get on this system.
> >
> > To be able to detect that; we use asym_cap_list to iterate through
> > capacities in the system to see if the task is able to run at a higher
> > capacity level based on its p->cpus_ptr. To do so safely, we convert the
> > list to be RCU protected.
> >
> > To be able to iterate through capacity levels, export asym_cap_list to
> > allow for fast traversal of all available capacity levels in the system.
> >
> > Test:
> > =====
> >
> > Add
> >
> > trace_printk("balance_interval = %lu\n", interval)
> >
> > in get_sd_balance_interval().
> >
> > run
> > if [ "$MASK" != "0" ]; then
> > adb shell "taskset -a $MASK cat /dev/zero > /dev/null"
> > fi
> > sleep 10
> > // parse ftrace buffer counting the occurrence of each valaue
> >
> > Where MASK is either:
> >
> > * 0: no busy task running
>
> ... no busy task stands for no misfit scenario?

Yes

>
> > * 1: busy task is pinned to 1 cpu; handled today to not cause
> > misfit
> > * f: busy task pinned to little cores, simulates busy background
> > task, demonstrates the problem to be fixed
> >
>
> [...]
>
> > + /*
> > + * If the task affinity is not set to default, make sure it is not
> > + * restricted to a subset where no CPU can ever fit it. Triggering
> > + * misfit in this case is pointless as it has no where better to move
> > + * to. And it can lead to balance_interval to grow too high as we'll
> > + * continuously fail to move it anywhere.
> > + */
> > + if (!cpumask_equal(p->cpus_ptr, cpu_possible_mask)) {
>
> Shouldn't this be cpu_active_mask ?

Hmm. So the intention was to check if the affinity was changed from default.

If we hotplug all but little we could end up with the same problem, yes you're
right.

But if the affinity is set to only to littles and cpu_active_mask is only for
littles too, then we'll also end up with the same problem as they both are
equal.

Better to drop this check then? With the sorted list the common case should be
quick to return as they'll have 1024 as a possible CPU.

>
> include/linux/cpumask.h
>
> * cpu_possible_mask- has bit 'cpu' set iff cpu is populatable
> * cpu_present_mask - has bit 'cpu' set iff cpu is populated
> * cpu_online_mask - has bit 'cpu' set iff cpu available to scheduler
> * cpu_active_mask - has bit 'cpu' set iff cpu available to migration
>
>
> > + unsigned long clamped_util = clamp(util, uclamp_min, uclamp_max);
> > + bool has_fitting_cpu = false;
> > + struct asym_cap_data *entry;
> > +
> > + rcu_read_lock();
> > + list_for_each_entry_rcu(entry, &asym_cap_list, link) {
> > + if (entry->capacity > cpu_cap) {
> > + cpumask_t *cpumask;
> > +
> > + if (clamped_util > entry->capacity)
> > + continue;
> > +
> > + cpumask = cpu_capacity_span(entry);
> > + if (!cpumask_intersects(p->cpus_ptr, cpumask))
> > + continue;
> > +
> > + has_fitting_cpu = true;
> > + break;
> > + }
> > + }
>
> What happen when we hotplug out all CPUs of one CPU capacity value?
> IMHO, we don't call asym_cpu_capacity_scan() with !new_topology
> (partition_sched_domains_locked()).

Right. I missed that. We can add another intersection check against
cpu_active_mask.

I am assuming the skipping was done by design, not a bug that needs fixing?
I see for suspend (cpuhp_tasks_frozen) the domains are rebuilt, but not for
hotplug.

>
> > + rcu_read_unlock();
> > +
> > + if (!has_fitting_cpu)
> > + goto out;
> > }
> >
> > /*
> > @@ -5083,6 +5127,9 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> > * task_h_load() returns 0.
> > */
> > rq->misfit_task_load = max_t(unsigned long, task_h_load(p), 1);
> > + return;
> > +out:
> > + rq->misfit_task_load = 0;
> > }
> >
> > #else /* CONFIG_SMP */
> > @@ -9583,9 +9630,7 @@ check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
> > */
> > static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
> > {
> > - return rq->misfit_task_load &&
> > - (arch_scale_cpu_capacity(rq->cpu) < rq->rd->max_cpu_capacity ||
> > - check_cpu_capacity(rq, sd));
> > + return rq->misfit_task_load && check_cpu_capacity(rq, sd);
>
> You removed 'arch_scale_cpu_capacity(rq->cpu) <
> rq->rd->max_cpu_capacity' here. Why? I can see that with the standard

Based on Pierre review since we no longer trigger misfit for big cores.
I thought Pierre's remark was correct so did the change in v3

https://lore.kernel.org/lkml/bae88015-4205-4449-991f-8104436ab3ba@xxxxxxx/

> setup (max CPU capacity equal 1024) which is what we probably use 100%
> of the time now. It might get useful again when Vincent will introduce
> his 'user space system pressure' implementation?

I don't mind putting it back if you think it'd be required again in the near
future. I still didn't get a chance to look at Vincent patches properly, but if
there's a clash let's reduce the work.

>
> > }
>
> [...]
>
> > @@ -1423,8 +1418,8 @@ static void asym_cpu_capacity_scan(void)
> >
> > list_for_each_entry_safe(entry, next, &asym_cap_list, link) {
> > if (cpumask_empty(cpu_capacity_span(entry))) {
> > - list_del(&entry->link);
> > - kfree(entry);
> > + list_del_rcu(&entry->link);
> > + call_rcu(&entry->rcu, free_asym_cap_entry);
>
> Looks like there could be brief moments in which one CPU capacity group
> of CPUs could be twice in asym_cap_list. I'm thinking about initial
> startup + max CPU frequency related adjustment of CPU capacity
> (init_cpu_capacity_callback()) for instance. Not sure if this is really
> an issue?

I don't think so. As long as the reader sees a consistent value and no crashes
ensued, a momentarily wrong decision in transient or extra work is fine IMO.
I don't foresee a big impact.


Thanks!

--
Qais Yousef

>
> [...]
>