Re: [PATCH v4 1/2] sched/fair: Check a task has a fitting cpu when updating misfit

From: Dietmar Eggemann
Date: Tue Jan 23 2024 - 13:07:52 EST


On 22/01/2024 19:02, Qais Yousef wrote:
> On 01/22/24 09:59, Dietmar Eggemann wrote:
>> On 05/01/2024 23:20, Qais Yousef wrote:
>>> From: Qais Yousef <qais.yousef@xxxxxxx>

[...]

>>> + /*
>>> + * If the task affinity is not set to default, make sure it is not
>>> + * restricted to a subset where no CPU can ever fit it. Triggering
>>> + * misfit in this case is pointless as it has no where better to move
>>> + * to. And it can lead to balance_interval to grow too high as we'll
>>> + * continuously fail to move it anywhere.
>>> + */
>>> + if (!cpumask_equal(p->cpus_ptr, cpu_possible_mask)) {
>>
>> Shouldn't this be cpu_active_mask ?
>
> Hmm. So the intention was to check if the affinity was changed from default.
>
> If we hotplug all but little we could end up with the same problem, yes you're
> right.
>
> But if the affinity is set to only to littles and cpu_active_mask is only for
> littles too, then we'll also end up with the same problem as they both are
> equal.

Yes, that's true.

> Better to drop this check then? With the sorted list the common case should be
> quick to return as they'll have 1024 as a possible CPU.

Or you keep 'cpu_possible_mask' and rely on the fact that the
asym_cap_list entries are removed if those CPUs are hotplugged out. In
this case the !has_fitting_cpu path should prevent useless Misfit load
balancing approaches.

[...]

>> What happen when we hotplug out all CPUs of one CPU capacity value?
>> IMHO, we don't call asym_cpu_capacity_scan() with !new_topology
>> (partition_sched_domains_locked()).
>
> Right. I missed that. We can add another intersection check against
> cpu_active_mask.
>
> I am assuming the skipping was done by design, not a bug that needs fixing?
> I see for suspend (cpuhp_tasks_frozen) the domains are rebuilt, but not for
> hotplug.

IMHO, it's by design. We setup asym_cap_list only when new_topology is
set (update_topology_flags_workfn() from init_cpu_capacity_callback() or
topology_init_cpu_capacity_cppc()). I.e. when the (max) CPU capacity can
change.
In all the other !new_topology cases we check `has_asym |= sd->flags &
SD_ASYM_CPUCAPACITY` and set sched_asym_cpucapacity accordingly in
build_sched_domains(). Before we always reset sched_asym_cpucapacity in
detach_destroy_domains().
But now we would have to keep asym_cap_list in sync with the active CPUs
I guess.

[...]

>>> #else /* CONFIG_SMP */
>>> @@ -9583,9 +9630,7 @@ check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
>>> */
>>> static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
>>> {
>>> - return rq->misfit_task_load &&
>>> - (arch_scale_cpu_capacity(rq->cpu) < rq->rd->max_cpu_capacity ||
>>> - check_cpu_capacity(rq, sd));
>>> + return rq->misfit_task_load && check_cpu_capacity(rq, sd);
>>
>> You removed 'arch_scale_cpu_capacity(rq->cpu) <
>> rq->rd->max_cpu_capacity' here. Why? I can see that with the standard
>
> Based on Pierre review since we no longer trigger misfit for big cores.
> I thought Pierre's remark was correct so did the change in v3

Ah, this is the replacement:

- if (task_fits_cpu(p, cpu_of(rq))) { <- still MF for util > 0.8 * 1024
- rq->misfit_task_load = 0;
- return;
+ cpu_cap = arch_scale_cpu_capacity(cpu);
+
+ /* If we can't fit the biggest CPU, that's the best we can ever get */
+ if (cpu_cap == SCHED_CAPACITY_SCALE)
+ goto out;

>
> https://lore.kernel.org/lkml/bae88015-4205-4449-991f-8104436ab3ba@xxxxxxx/
>
>> setup (max CPU capacity equal 1024) which is what we probably use 100%
>> of the time now. It might get useful again when Vincent will introduce
>> his 'user space system pressure' implementation?
>
> I don't mind putting it back if you think it'd be required again in the near
> future. I still didn't get a chance to look at Vincent patches properly, but if
> there's a clash let's reduce the work.

Vincent did already comment on this in this thread.

[...]

>>> @@ -1423,8 +1418,8 @@ static void asym_cpu_capacity_scan(void)
>>>
>>> list_for_each_entry_safe(entry, next, &asym_cap_list, link) {
>>> if (cpumask_empty(cpu_capacity_span(entry))) {
>>> - list_del(&entry->link);
>>> - kfree(entry);
>>> + list_del_rcu(&entry->link);
>>> + call_rcu(&entry->rcu, free_asym_cap_entry);
>>
>> Looks like there could be brief moments in which one CPU capacity group
>> of CPUs could be twice in asym_cap_list. I'm thinking about initial
>> startup + max CPU frequency related adjustment of CPU capacity
>> (init_cpu_capacity_callback()) for instance. Not sure if this is really
>> an issue?
>
> I don't think so. As long as the reader sees a consistent value and no crashes
> ensued, a momentarily wrong decision in transient or extra work is fine IMO.
> I don't foresee a big impact.

OK.