Re: [PATCH v4 1/2] sched/fair: Check a task has a fitting cpu when updating misfit

From: Qais Yousef
Date: Wed Jan 24 2024 - 17:39:07 EST


On 01/23/24 18:22, Vincent Guittot wrote:

> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index bcea3d55d95d..0830ceb7ca07 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5065,17 +5065,61 @@ static inline int task_fits_cpu(struct task_struct *p, int cpu)
> >
> > static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> > {
> > + unsigned long uclamp_min, uclamp_max;
> > + unsigned long util, cpu_cap;
> > + int cpu = cpu_of(rq);
> > +
> > if (!sched_asym_cpucap_active())
> > return;
> >
> > - if (!p || p->nr_cpus_allowed == 1) {
> > - rq->misfit_task_load = 0;
> > - return;
> > - }
> > + if (!p || p->nr_cpus_allowed == 1)
> > + goto out;
> >
> > - if (task_fits_cpu(p, cpu_of(rq))) {
> > - rq->misfit_task_load = 0;
> > - return;
> > + cpu_cap = arch_scale_cpu_capacity(cpu);
> > +
> > + /* If we can't fit the biggest CPU, that's the best we can ever get. */
> > + if (cpu_cap == SCHED_CAPACITY_SCALE)
> > + goto out;
> > +
> > + uclamp_min = uclamp_eff_value(p, UCLAMP_MIN);
> > + uclamp_max = uclamp_eff_value(p, UCLAMP_MAX);
> > + util = task_util_est(p);
> > +
> > + if (util_fits_cpu(util, uclamp_min, uclamp_max, cpu) > 0)
> > + goto out;
> > +
> > + /*
> > + * If the task affinity is not set to default, make sure it is not
> > + * restricted to a subset where no CPU can ever fit it. Triggering
> > + * misfit in this case is pointless as it has no where better to move
> > + * to. And it can lead to balance_interval to grow too high as we'll
> > + * continuously fail to move it anywhere.
> > + */
> > + if (!cpumask_equal(p->cpus_ptr, cpu_possible_mask)) {
> > + unsigned long clamped_util = clamp(util, uclamp_min, uclamp_max);
> > + bool has_fitting_cpu = false;
> > + struct asym_cap_data *entry;
> > +
> > + rcu_read_lock();
> > + list_for_each_entry_rcu(entry, &asym_cap_list, link) {
>
> Do we really want to potentially do this loop at every pick_next task ?

The common case should return quickly as the biggest CPU should be present
in every task by default. And after sorting the biggest CPU will be the first
entry and we should return after one check.

Could we move the update to another less expensive location instead?

We could try to do better tracking for CPUs that has their affinity changed,
but I am not keen on sprinkling more complexity else where to deal with this.

We could keep the status quouo and just prevent the misfit load balancing from
increment nr_failed similar to newidle_balance too. I think this should have
a similar effect. Not ideal but if this is considered too expensive still
I can't think of other options that don't look ugly to me FWIW.


Thanks

--
Qais Yousef