Re: [PATCH v4 1/2] sched/fair: Check a task has a fitting cpu when updating misfit

From: Qais Yousef
Date: Wed Jan 24 2024 - 17:43:47 EST


On 01/23/24 18:07, Dietmar Eggemann wrote:
> On 22/01/2024 19:02, Qais Yousef wrote:
> > On 01/22/24 09:59, Dietmar Eggemann wrote:
> >> On 05/01/2024 23:20, Qais Yousef wrote:
> >>> From: Qais Yousef <qais.yousef@xxxxxxx>
>
> [...]
>
> >>> + /*
> >>> + * If the task affinity is not set to default, make sure it is not
> >>> + * restricted to a subset where no CPU can ever fit it. Triggering
> >>> + * misfit in this case is pointless as it has no where better to move
> >>> + * to. And it can lead to balance_interval to grow too high as we'll
> >>> + * continuously fail to move it anywhere.
> >>> + */
> >>> + if (!cpumask_equal(p->cpus_ptr, cpu_possible_mask)) {
> >>
> >> Shouldn't this be cpu_active_mask ?
> >
> > Hmm. So the intention was to check if the affinity was changed from default.
> >
> > If we hotplug all but little we could end up with the same problem, yes you're
> > right.
> >
> > But if the affinity is set to only to littles and cpu_active_mask is only for
> > littles too, then we'll also end up with the same problem as they both are
> > equal.
>
> Yes, that's true.
>
> > Better to drop this check then? With the sorted list the common case should be
> > quick to return as they'll have 1024 as a possible CPU.
>
> Or you keep 'cpu_possible_mask' and rely on the fact that the
> asym_cap_list entries are removed if those CPUs are hotplugged out. In
> this case the !has_fitting_cpu path should prevent useless Misfit load
> balancing approaches.

IIUC this removal doesn't happen today as outlined below, but something we need
to do, right? Would be better, yes.

>
> [...]
>
> >> What happen when we hotplug out all CPUs of one CPU capacity value?
> >> IMHO, we don't call asym_cpu_capacity_scan() with !new_topology
> >> (partition_sched_domains_locked()).
> >
> > Right. I missed that. We can add another intersection check against
> > cpu_active_mask.
> >
> > I am assuming the skipping was done by design, not a bug that needs fixing?
> > I see for suspend (cpuhp_tasks_frozen) the domains are rebuilt, but not for
> > hotplug.
>
> IMHO, it's by design. We setup asym_cap_list only when new_topology is
> set (update_topology_flags_workfn() from init_cpu_capacity_callback() or
> topology_init_cpu_capacity_cppc()). I.e. when the (max) CPU capacity can
> change.
> In all the other !new_topology cases we check `has_asym |= sd->flags &
> SD_ASYM_CPUCAPACITY` and set sched_asym_cpucapacity accordingly in
> build_sched_domains(). Before we always reset sched_asym_cpucapacity in
> detach_destroy_domains().
> But now we would have to keep asym_cap_list in sync with the active CPUs
> I guess.

Okay, so you suggest we need to update the code to keep it in sync. Let's see
first if Vincent is satisfied with this list traversal or we need to go another
way :-)

I think it is worth having this asym_capacity list available. It seemed several
times we needed it and just a little work is required to make it available for
potential future users. Even if we don't merge immediately.

>
> [...]
>
> >>> #else /* CONFIG_SMP */
> >>> @@ -9583,9 +9630,7 @@ check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
> >>> */
> >>> static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
> >>> {
> >>> - return rq->misfit_task_load &&
> >>> - (arch_scale_cpu_capacity(rq->cpu) < rq->rd->max_cpu_capacity ||
> >>> - check_cpu_capacity(rq, sd));
> >>> + return rq->misfit_task_load && check_cpu_capacity(rq, sd);
> >>
> >> You removed 'arch_scale_cpu_capacity(rq->cpu) <
> >> rq->rd->max_cpu_capacity' here. Why? I can see that with the standard
> >
> > Based on Pierre review since we no longer trigger misfit for big cores.
> > I thought Pierre's remark was correct so did the change in v3
>
> Ah, this is the replacement:
>
> - if (task_fits_cpu(p, cpu_of(rq))) { <- still MF for util > 0.8 * 1024
> - rq->misfit_task_load = 0;
> - return;
> + cpu_cap = arch_scale_cpu_capacity(cpu);
> +
> + /* If we can't fit the biggest CPU, that's the best we can ever get */
> + if (cpu_cap == SCHED_CAPACITY_SCALE)
> + goto out;

Yep.

>
> >
> > https://lore.kernel.org/lkml/bae88015-4205-4449-991f-8104436ab3ba@xxxxxxx/
> >
> >> setup (max CPU capacity equal 1024) which is what we probably use 100%
> >> of the time now. It might get useful again when Vincent will introduce
> >> his 'user space system pressure' implementation?
> >
> > I don't mind putting it back if you think it'd be required again in the near
> > future. I still didn't get a chance to look at Vincent patches properly, but if
> > there's a clash let's reduce the work.
>
> Vincent did already comment on this in this thread.
>
> [...]
>
> >>> @@ -1423,8 +1418,8 @@ static void asym_cpu_capacity_scan(void)
> >>>
> >>> list_for_each_entry_safe(entry, next, &asym_cap_list, link) {
> >>> if (cpumask_empty(cpu_capacity_span(entry))) {
> >>> - list_del(&entry->link);
> >>> - kfree(entry);
> >>> + list_del_rcu(&entry->link);
> >>> + call_rcu(&entry->rcu, free_asym_cap_entry);
> >>
> >> Looks like there could be brief moments in which one CPU capacity group
> >> of CPUs could be twice in asym_cap_list. I'm thinking about initial
> >> startup + max CPU frequency related adjustment of CPU capacity
> >> (init_cpu_capacity_callback()) for instance. Not sure if this is really
> >> an issue?
> >
> > I don't think so. As long as the reader sees a consistent value and no crashes
> > ensued, a momentarily wrong decision in transient or extra work is fine IMO.
> > I don't foresee a big impact.
>
> OK.

Thanks!

--
Qais Yousef