Re: [RFC PATCH] sched/fair: Bias runqueue selection towards almost idle prev CPU

From: Chen Yu
Date: Sat Sep 30 2023 - 03:12:12 EST


Hi Mathieu,

On 2023-09-29 at 14:33:50 -0400, Mathieu Desnoyers wrote:
> Introduce the WAKEUP_BIAS_PREV_IDLE scheduler feature. It biases
> select_task_rq towards the previous CPU if it was almost idle
> (avg_load <= 0.1%).

Yes, this is a promising direction IMO. One question is that,
can cfs_rq->avg.load_avg be used for percentage comparison?
If I understand correctly, load_avg reflects that more than
1 tasks could have been running this runqueue, and the
load_avg is the direct proportion to the load_weight of that
cfs_rq. Besides, LOAD_AVG_MAX seems to not be the max value
that load_avg can reach, it is the sum of
1024 * (y + y^1 + y^2 ... )

For example,
taskset -c 1 nice -n -20 stress -c 1
cat /sys/kernel/debug/sched/debug | grep 'cfs_rq\[1\]' -A 12 | grep "\.load_avg"
.load_avg : 88763
.load_avg : 1024

88763 is higher than LOAD_AVG_MAX=47742
Maybe the util_avg can be used for precentage comparison I suppose?

> It eliminates frequent task migrations from almost
> idle CPU to completely idle CPUs. This is achieved by using the CPU
> load of the previously used CPU as "almost idle" criterion in
> wake_affine_idle() and select_idle_sibling().
>
> The following benchmarks are performed on a v6.5.5 kernel with
> mitigations=off.
>
> This speeds up the following hackbench workload on a 192 cores AMD EPYC
> 9654 96-Core Processor (over 2 sockets):
>
> hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100
>
> from 49s to 32s. (34% speedup)
>
> We can observe that the number of migrations is reduced significantly
> (-94%) with this patch, which may explain the speedup:
>
> Baseline: 118M cpu-migrations (9.286 K/sec)
> With patch: 7M cpu-migrations (0.709 K/sec)
>
> As a consequence, the stalled-cycles-backend are reduced:
>
> Baseline: 8.16% backend cycles idle
> With patch: 6.70% backend cycles idle
>
> Interestingly, the rate of context switch increases with the patch, but
> it does not appear to be an issue performance-wise:
>
> Baseline: 454M context-switches (35.677 K/sec)
> With patch: 654M context-switches (62.290 K/sec)
>
> This was developed as part of the investigation into a weird regression
> reported by AMD where adding a raw spinlock in the scheduler context
> switch accelerated hackbench. It turned out that changing this raw
> spinlock for a loop of 10000x cpu_relax within do_idle() had similar
> benefits.
>
> This patch achieves a similar effect without the busy-waiting by
> allowing select_task_rq to favor almost idle previously used CPUs based
> on the CPU load of that CPU. The threshold of 0.1% avg_load for almost
> idle CPU load has been identified empirically using the hackbench
> workload.
>
> Feedback is welcome. I am especially interested to learn whether this
> patch has positive or detrimental effects on performance of other
> workloads.
>
> Link: https://lore.kernel.org/r/09e0f469-a3f7-62ef-75a1-e64cec2dcfc5@xxxxxxx
> Link: https://lore.kernel.org/lkml/20230725193048.124796-1-mathieu.desnoyers@xxxxxxxxxxxx/
> Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@xxxxxxxxxxxx/
> Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@xxxxxxxxxxxx/
> Link: https://lore.kernel.org/lkml/f6dc1652-bc39-0b12-4b6b-29a2f9cd8484@xxxxxxx/
> Link: https://lore.kernel.org/lkml/20230822113133.643238-1-mathieu.desnoyers@xxxxxxxxxxxx/
> Link: https://lore.kernel.org/lkml/20230823060832.454842-1-aaron.lu@xxxxxxxxx/
> Link: https://lore.kernel.org/lkml/20230905171105.1005672-1-mathieu.desnoyers@xxxxxxxxxxxx/
> Link: https://lore.kernel.org/lkml/cover.1695704179.git.yu.c.chen@xxxxxxxxx/
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Cc: Valentin Schneider <vschneid@xxxxxxxxxx>
> Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>
> Cc: Ben Segall <bsegall@xxxxxxxxxx>
> Cc: Mel Gorman <mgorman@xxxxxxx>
> Cc: Daniel Bristot de Oliveira <bristot@xxxxxxxxxx>
> Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> Cc: Juri Lelli <juri.lelli@xxxxxxxxxx>
> Cc: Swapnil Sapkal <Swapnil.Sapkal@xxxxxxx>
> Cc: Aaron Lu <aaron.lu@xxxxxxxxx>
> Cc: Chen Yu <yu.c.chen@xxxxxxxxx>
> Cc: Tim Chen <tim.c.chen@xxxxxxxxx>
> Cc: K Prateek Nayak <kprateek.nayak@xxxxxxx>
> Cc: Gautham R . Shenoy <gautham.shenoy@xxxxxxx>
> Cc: x86@xxxxxxxxxx
> ---
> kernel/sched/fair.c | 18 +++++++++++++-----
> kernel/sched/features.h | 6 ++++++
> 2 files changed, 19 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1d9c2482c5a3..65a7d923ea61 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6599,6 +6599,14 @@ static int wake_wide(struct task_struct *p)
> return 1;
> }
>
> +static bool
> +almost_idle_cpu(int cpu, struct task_struct *p)
> +{
> + if (!sched_feat(WAKEUP_BIAS_PREV_IDLE))
> + return false;
> + return cpu_load_without(cpu_rq(cpu), p) <= LOAD_AVG_MAX / 1000;

Or
return cpu_util_without(cpu_rq(cpu), p) * 1000 <= capacity_orig_of(cpu) ?

thanks,
Chenyu