[PATCH v2] sched/fair: prefer prev cpu in asymmetric wakeup path

From: Vincent Guittot
Date: Wed Oct 28 2020 - 17:43:15 EST


During fast wakeup path, scheduler always check whether local or prev cpus
are good candidates for the task before looking for other cpus in the
domain. With
commit b7a331615d25 ("sched/fair: Add asymmetric CPU capacity wakeup scan")
the heterogenous system gains a dedicated path but doesn't try to reuse
prev cpu whenever possible. If the previous cpu is idle and belong to the
LLC domain, we should check it 1st before looking for another cpu because
it stays one of the best candidate and this also stabilizes task placement
on the system.

This change aligns asymmetric path behavior with symmetric one and reduces
cases where the task migrates across all cpus of the sd_asym_cpucapacity
domains at wakeup.

This change does not impact normal EAS mode but only the overloaded case or
when EAS is not used.

- On hikey960 with performance governor (EAS disable)

./perf bench sched pipe -T -l 50000
mainline w/ patch
# migrations 999364 0
ops/sec 149313(+/-0.28%) 182587(+/- 0.40) +22%

- On hikey with performance governor

./perf bench sched pipe -T -l 50000
mainline w/ patch
# migrations 0 0
ops/sec 47721(+/-0.76%) 47899(+/- 0.56) +0.4%

According to test on hikey, the patch doesn't impact symmetric system
compared to current implementation (only tested on arm64)

Also read the uclamped value of task's utilization at most twice instead
instead each time we compare task's utilization with cpu's capacity.

Fixes: b7a331615d25 ("sched/fair: Add asymmetric CPU capacity wakeup scan")
Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
---
Changes in v2:
- merge asymmetric and symmetric path instead of duplicating tests on target,
prev and other special cases.

- factorize call to uclamp_task_util(p) and use fits_capacity(). This could
explain part of the additionnal improvement compared to v1 (+22% instead of
+17% on v1).

- Keep using LLC instead of asym domain for early check of target, prev and
recent_used_cpu to ensure cache sharing between the task. This doesn't
change anything for dynamiQ but will ensure same cache for legacy big.LITTLE
and also simply the changes.

- don't check capacity for the per-cpu kthread UC because the assumption is
that the wakee queued work for the per-cpu kthread that is now complete and
the task was already on this cpu.

- On an asymmetric system where an exclusive cpuset defines a symmetric island,
task's load is synced and tested although it's not needed. But taking care of
this special case by testing if sd_asym_cpucapacity is not null impacts by
more than 4% the performance of default sched_asym_cpucapacity path.

- The huge increase of the number of migration for hikey960 mainline comes from
teh fact that the ftrace buffer was overloaded by events in the tests done
with v1.

kernel/sched/fair.c | 68 ++++++++++++++++++++++++++++-----------------
1 file changed, 43 insertions(+), 25 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aa4c6227cd6d..131b917b70f8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6173,20 +6173,20 @@ static int
select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
{
unsigned long best_cap = 0;
- int cpu, best_cpu = -1;
+ int task_util, cpu, best_cpu = -1;
struct cpumask *cpus;

- sync_entity_load_avg(&p->se);
-
cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);

+ task_util = uclamp_task_util(p);
+
for_each_cpu_wrap(cpu, cpus, target) {
unsigned long cpu_cap = capacity_of(cpu);

if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
continue;
- if (task_fits_capacity(p, cpu_cap))
+ if (fits_capacity(task_util, cpu_cap))
return cpu;

if (cpu_cap > best_cap) {
@@ -6198,44 +6198,41 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
return best_cpu;
}

+static inline int asym_fits_capacity(int task_util, int cpu)
+{
+ if (static_branch_unlikely(&sched_asym_cpucapacity))
+ return fits_capacity(task_util, capacity_of(cpu));
+
+ return 1;
+}
+
/*
* Try and locate an idle core/thread in the LLC cache domain.
*/
static int select_idle_sibling(struct task_struct *p, int prev, int target)
{
struct sched_domain *sd;
- int i, recent_used_cpu;
+ int i, recent_used_cpu, task_util;

/*
- * For asymmetric CPU capacity systems, our domain of interest is
- * sd_asym_cpucapacity rather than sd_llc.
+ * On asymmetric system, update task utilization because we will check
+ * that the task fits with cpu's capacity.
*/
if (static_branch_unlikely(&sched_asym_cpucapacity)) {
- sd = rcu_dereference(per_cpu(sd_asym_cpucapacity, target));
- /*
- * On an asymmetric CPU capacity system where an exclusive
- * cpuset defines a symmetric island (i.e. one unique
- * capacity_orig value through the cpuset), the key will be set
- * but the CPUs within that cpuset will not have a domain with
- * SD_ASYM_CPUCAPACITY. These should follow the usual symmetric
- * capacity path.
- */
- if (!sd)
- goto symmetric;
-
- i = select_idle_capacity(p, sd, target);
- return ((unsigned)i < nr_cpumask_bits) ? i : target;
+ sync_entity_load_avg(&p->se);
+ task_util = uclamp_task_util(p);
}

-symmetric:
- if (available_idle_cpu(target) || sched_idle_cpu(target))
+ if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
+ asym_fits_capacity(task_util, target))
return target;

/*
* If the previous CPU is cache affine and idle, don't be stupid:
*/
if (prev != target && cpus_share_cache(prev, target) &&
- (available_idle_cpu(prev) || sched_idle_cpu(prev)))
+ (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
+ asym_fits_capacity(task_util, prev))
return prev;

/*
@@ -6258,7 +6255,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
recent_used_cpu != target &&
cpus_share_cache(recent_used_cpu, target) &&
(available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
- cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr)) {
+ cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
+ asym_fits_capacity(task_util, recent_used_cpu)) {
/*
* Replace recent_used_cpu with prev as it is a potential
* candidate for the next wake:
@@ -6267,6 +6265,26 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
return recent_used_cpu;
}

+ /*
+ * For asymmetric CPU capacity systems, our domain of interest is
+ * sd_asym_cpucapacity rather than sd_llc.
+ */
+ if (static_branch_unlikely(&sched_asym_cpucapacity)) {
+ sd = rcu_dereference(per_cpu(sd_asym_cpucapacity, target));
+ /*
+ * On an asymmetric CPU capacity system where an exclusive
+ * cpuset defines a symmetric island (i.e. one unique
+ * capacity_orig value through the cpuset), the key will be set
+ * but the CPUs within that cpuset will not have a domain with
+ * SD_ASYM_CPUCAPACITY. These should follow the usual symmetric
+ * capacity path.
+ */
+ if (sd) {
+ i = select_idle_capacity(p, sd, target);
+ return ((unsigned)i < nr_cpumask_bits) ? i : target;
+ }
+ }
+
sd = rcu_dereference(per_cpu(sd_llc, target));
if (!sd)
return target;
--
2.17.1