Re: [PATCH v4 1/2] sched/fair: Check a task has a fitting cpu when updating misfit

From: Dietmar Eggemann
Date: Mon Jan 22 2024 - 05:00:00 EST


On 05/01/2024 23:20, Qais Yousef wrote:
> From: Qais Yousef <qais.yousef@xxxxxxx>
>
> If a misfit task is affined to a subset of the possible cpus, we need to
> verify that one of these cpus can fit it. Otherwise the load balancer
> code will continuously trigger needlessly leading the balance_interval
> to increase in return and eventually end up with a situation where real
> imbalances take a long time to address because of this impossible
> imbalance situation.
>
> This can happen in Android world where it's common for background tasks
> to be restricted to little cores.
>
> Similarly if we can't fit the biggest core, triggering misfit is
> pointless as it is the best we can ever get on this system.
>
> To be able to detect that; we use asym_cap_list to iterate through
> capacities in the system to see if the task is able to run at a higher
> capacity level based on its p->cpus_ptr. To do so safely, we convert the
> list to be RCU protected.
>
> To be able to iterate through capacity levels, export asym_cap_list to
> allow for fast traversal of all available capacity levels in the system.
>
> Test:
> =====
>
> Add
>
> trace_printk("balance_interval = %lu\n", interval)
>
> in get_sd_balance_interval().
>
> run
> if [ "$MASK" != "0" ]; then
> adb shell "taskset -a $MASK cat /dev/zero > /dev/null"
> fi
> sleep 10
> // parse ftrace buffer counting the occurrence of each valaue
>
> Where MASK is either:
>
> * 0: no busy task running

.. no busy task stands for no misfit scenario?

> * 1: busy task is pinned to 1 cpu; handled today to not cause
> misfit
> * f: busy task pinned to little cores, simulates busy background
> task, demonstrates the problem to be fixed
>

[...]

> + /*
> + * If the task affinity is not set to default, make sure it is not
> + * restricted to a subset where no CPU can ever fit it. Triggering
> + * misfit in this case is pointless as it has no where better to move
> + * to. And it can lead to balance_interval to grow too high as we'll
> + * continuously fail to move it anywhere.
> + */
> + if (!cpumask_equal(p->cpus_ptr, cpu_possible_mask)) {

Shouldn't this be cpu_active_mask ?

include/linux/cpumask.h

* cpu_possible_mask- has bit 'cpu' set iff cpu is populatable
* cpu_present_mask - has bit 'cpu' set iff cpu is populated
* cpu_online_mask - has bit 'cpu' set iff cpu available to scheduler
* cpu_active_mask - has bit 'cpu' set iff cpu available to migration


> + unsigned long clamped_util = clamp(util, uclamp_min, uclamp_max);
> + bool has_fitting_cpu = false;
> + struct asym_cap_data *entry;
> +
> + rcu_read_lock();
> + list_for_each_entry_rcu(entry, &asym_cap_list, link) {
> + if (entry->capacity > cpu_cap) {
> + cpumask_t *cpumask;
> +
> + if (clamped_util > entry->capacity)
> + continue;
> +
> + cpumask = cpu_capacity_span(entry);
> + if (!cpumask_intersects(p->cpus_ptr, cpumask))
> + continue;
> +
> + has_fitting_cpu = true;
> + break;
> + }
> + }

What happen when we hotplug out all CPUs of one CPU capacity value?
IMHO, we don't call asym_cpu_capacity_scan() with !new_topology
(partition_sched_domains_locked()).

> + rcu_read_unlock();
> +
> + if (!has_fitting_cpu)
> + goto out;
> }
>
> /*
> @@ -5083,6 +5127,9 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> * task_h_load() returns 0.
> */
> rq->misfit_task_load = max_t(unsigned long, task_h_load(p), 1);
> + return;
> +out:
> + rq->misfit_task_load = 0;
> }
>
> #else /* CONFIG_SMP */
> @@ -9583,9 +9630,7 @@ check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
> */
> static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
> {
> - return rq->misfit_task_load &&
> - (arch_scale_cpu_capacity(rq->cpu) < rq->rd->max_cpu_capacity ||
> - check_cpu_capacity(rq, sd));
> + return rq->misfit_task_load && check_cpu_capacity(rq, sd);

You removed 'arch_scale_cpu_capacity(rq->cpu) <
rq->rd->max_cpu_capacity' here. Why? I can see that with the standard
setup (max CPU capacity equal 1024) which is what we probably use 100%
of the time now. It might get useful again when Vincent will introduce
his 'user space system pressure' implementation?

> }

[...]

> @@ -1423,8 +1418,8 @@ static void asym_cpu_capacity_scan(void)
>
> list_for_each_entry_safe(entry, next, &asym_cap_list, link) {
> if (cpumask_empty(cpu_capacity_span(entry))) {
> - list_del(&entry->link);
> - kfree(entry);
> + list_del_rcu(&entry->link);
> + call_rcu(&entry->rcu, free_asym_cap_entry);

Looks like there could be brief moments in which one CPU capacity group
of CPUs could be twice in asym_cap_list. I'm thinking about initial
startup + max CPU frequency related adjustment of CPU capacity
(init_cpu_capacity_callback()) for instance. Not sure if this is really
an issue?

[...]