[RFC PATCH 1/1] sched: Extend cpu idle state for 1ms

From: Mathieu Desnoyers
Date: Tue Jul 25 2023 - 15:30:16 EST


Allow select_task_rq to consider a cpu as idle for 1ms after that cpu
has exited the idle loop.

This speeds up the following hackbench workload on a 192 cores AMD EPYC
9654 96-Core Processor (over 2 sockets):

hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100

from 49s to 34s. (30% speedup)

My working hypothesis for why this helps is: queuing more than a single
task on the runqueue of a cpu which just exited idle rather than
spreading work over other idle cpus helps power efficiency on systems
with large number of cores.

This was developed as part of the investigation into a weird regression
reported by AMD where adding a raw spinlock in the scheduler context
switch accelerated hackbench.

It turned out that changing this raw spinlock for a loop of 10000x
cpu_relax within do_idle() had similar benefits.

This patch achieve a similar effect without the busy-waiting by
introducing a runqueue state sampling the sched_clock() when exiting
idle, which allows select_task_rq to consider "as idle" a cpu which has
recently exited idle.

This patch should be considered "food for thoughts", and I would be glad
to hear feedback on whether it causes regressions on _other_ workloads,
and whether it helps with the hackbench workload on large Intel system
as well.

Link: https://lore.kernel.org/r/09e0f469-a3f7-62ef-75a1-e64cec2dcfc5@xxxxxxx
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Valentin Schneider <vschneid@xxxxxxxxxx>
Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>
Cc: Ben Segall <bsegall@xxxxxxxxxx>
Cc: Mel Gorman <mgorman@xxxxxxx>
Cc: Daniel Bristot de Oliveira <bristot@xxxxxxxxxx>
Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
Cc: Juri Lelli <juri.lelli@xxxxxxxxxx>
Cc: Swapnil Sapkal <Swapnil.Sapkal@xxxxxxx>
Cc: Aaron Lu <aaron.lu@xxxxxxxxx>
Cc: x86@xxxxxxxxxx
---
kernel/sched/core.c | 4 ++++
kernel/sched/sched.h | 3 +++
2 files changed, 7 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a68d1276bab0..d40e3a0a5ced 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6769,6 +6769,7 @@ void __sched schedule_idle(void)
* TASK_RUNNING state.
*/
WARN_ON_ONCE(current->__state);
+ WRITE_ONCE(this_rq()->idle_end_time, sched_clock());
do {
__schedule(SM_NONE);
} while (need_resched());
@@ -7300,6 +7301,9 @@ int idle_cpu(int cpu)
{
struct rq *rq = cpu_rq(cpu);

+ if (sched_clock() < READ_ONCE(rq->idle_end_time) + IDLE_CPU_DELAY_NS)
+ return 1;
+
if (rq->curr != rq->idle)
return 0;

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 81ac605b9cd5..8932e198a33a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -97,6 +97,8 @@
# define SCHED_WARN_ON(x) ({ (void)(x), 0; })
#endif

+#define IDLE_CPU_DELAY_NS 1000000 /* 1ms */
+
struct rq;
struct cpuidle_state;

@@ -1010,6 +1012,7 @@ struct rq {

struct task_struct __rcu *curr;
struct task_struct *idle;
+ u64 idle_end_time;
struct task_struct *stop;
unsigned long next_balance;
struct mm_struct *prev_mm;
--
2.39.2