Re: [RFC PATCH 1/1] sched: Extend cpu idle state for 1ms

From: Mathieu Desnoyers
Date: Wed Jul 26 2023 - 14:55:45 EST


On 7/26/23 13:40, Shrikanth Hegde wrote:


On 7/26/23 7:37 PM, Mathieu Desnoyers wrote:
On 7/26/23 04:04, Shrikanth Hegde wrote:


On 7/26/23 1:00 AM, Mathieu Desnoyers wrote:
Allow select_task_rq to consider a cpu as idle for 1ms after that cpu
has exited the idle loop.

This speeds up the following hackbench workload on a 192 cores AMD EPYC
9654 96-Core Processor (over 2 sockets):

hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100

from 49s to 34s. (30% speedup)

My working hypothesis for why this helps is: queuing more than a single
task on the runqueue of a cpu which just exited idle rather than
spreading work over other idle cpus helps power efficiency on systems
with large number of cores.

This was developed as part of the investigation into a weird regression
reported by AMD where adding a raw spinlock in the scheduler context
switch accelerated hackbench.

Do you have SMT here? What is the system utilization when you are running
this workload?

Yes, SMT is enabled, which brings the number of logical cpus to 384.

CPU utilization (through htop):

* 6.4.4: 27500%
* 6.4.4 with the extend-idle+nr_running<=4 patch: 30500%



It turned out that changing this raw spinlock for a loop of 10000x
cpu_relax within do_idle() had similar benefits.

This patch achieve a similar effect without the busy-waiting by
introducing a runqueue state sampling the sched_clock() when exiting
idle, which allows select_task_rq to consider "as idle" a cpu which has
recently exited idle.

This patch should be considered "food for thoughts", and I would be glad
to hear feedback on whether it causes regressions on _other_ workloads,
and whether it helps with the hackbench workload on large Intel system
as well.

Link:
https://lore.kernel.org/r/09e0f469-a3f7-62ef-75a1-e64cec2dcfc5@xxxxxxx
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Valentin Schneider <vschneid@xxxxxxxxxx>
Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>
Cc: Ben Segall <bsegall@xxxxxxxxxx>
Cc: Mel Gorman <mgorman@xxxxxxx>
Cc: Daniel Bristot de Oliveira <bristot@xxxxxxxxxx>
Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
Cc: Juri Lelli <juri.lelli@xxxxxxxxxx>
Cc: Swapnil Sapkal <Swapnil.Sapkal@xxxxxxx>
Cc: Aaron Lu <aaron.lu@xxxxxxxxx>
Cc: x86@xxxxxxxxxx
---
  kernel/sched/core.c  | 4 ++++
  kernel/sched/sched.h | 3 +++
  2 files changed, 7 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a68d1276bab0..d40e3a0a5ced 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6769,6 +6769,7 @@ void __sched schedule_idle(void)
       * TASK_RUNNING state.
       */
      WARN_ON_ONCE(current->__state);
+    WRITE_ONCE(this_rq()->idle_end_time, sched_clock());
      do {
          __schedule(SM_NONE);
      } while (need_resched());
@@ -7300,6 +7301,9 @@ int idle_cpu(int cpu)
  {
      struct rq *rq = cpu_rq(cpu);
  +    if (sched_clock() < READ_ONCE(rq->idle_end_time) +
IDLE_CPU_DELAY_NS)


Wouldn't this hurt the latency badly? Specially on a loaded system with
a workload that does a lot of wakeup.

Good point !

Can you try your benchmark replacing the if () statement above by:

+       if (sched_clock() < READ_ONCE(rq->idle_end_time) +
IDLE_CPU_DELAY_NS &&
+           READ_ONCE(rq->nr_running) <= 4)
+               return 1;


Tried with this change. I think it does help in reducing latency compared to
earlier specially till 95th percentile.

For the records, I also tried with nr_running <= 2 and still had decent performance
(32s with nr_running <= 2 instead of 30s for nr_running <= 4). It did drop with
nr_running <= 1 (40s). nr_running <= 5 was similar to 4, and performances start
degrading with nr_running <= 8 (31s).

So it might be interesting to measure the latency with nr_running <= 2 as well.
Perhaps nr_running <= 2 would be a good compromise between throughput and tail
latency.

6.5-rc3 6.5-rc3+RFC_Patch 6.5-rc3_RFC_Patch
+ nr<4
4 Groups
50.0th: 18.00 18.50 18.50
75.0th: 21.50 26.00 23.50
90.0th: 56.00 940.50 501.00
95.0th: 678.00 1896.00 1392.00
99.0th: 2484.00 3756.00 3708.00
99.5th: 3224.00 4616.00 5088.00
99.9th: 4960.00 6824.00 8068.00
8 Groups
50.0th: 23.50 25.50 23.00
75.0th: 30.50 421.50 30.50
90.0th: 443.50 1722.00 741.00
95.0th: 1410.00 2736.00 1670.00
99.0th: 3942.00 5496.00 4032.00
99.5th: 5232.00 7016.00 5064.00
99.9th: 7996.00 8896.00 8012.00
16 Groups
50.0th: 33.50 41.50 32.50
75.0th: 49.00 752.00 47.00
90.0th: 1067.50 2332.00 994.50
95.0th: 2093.00 3468.00 2117.00
99.0th: 5048.00 6728.00 5568.00
99.5th: 6760.00 7624.00 6960.00
99.9th: 8592.00 9504.00 11104.00
32 Groups
50.0th: 60.00 79.00 53.00
75.0th: 456.50 1712.00 209.50
90.0th: 2788.00 3996.00 2752.00
95.0th: 4544.00 5768.00 5024.00
99.0th: 8444.00 9104.00 10352.00
99.5th: 9168.00 9808.00 12720.00
99.9th: 11984.00 12448.00 17624.00

[...]

  @@ -1010,6 +1012,7 @@ struct rq {
        struct task_struct __rcu    *curr;
      struct task_struct    *idle;
+    u64            idle_end_time;

There is clock_idle already in the rq. Can that be used for the same?

Good point! And I'll change my use of "sched_clock()" in idle_cpu() for a
proper "sched_clock_cpu(cpu_of(rq))", which will work better on systems
without constant tsc.

The updated patch:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a68d1276bab0..1c7d5bd2968b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7300,6 +7300,10 @@ int idle_cpu(int cpu)
{
struct rq *rq = cpu_rq(cpu);
+ if (READ_ONCE(rq->nr_running) <= IDLE_CPU_DELAY_MAX_RUNNING &&
+ sched_clock_cpu(cpu_of(rq)) < READ_ONCE(rq->clock_idle) + IDLE_CPU_DELAY_NS)
+ return 1;
+
if (rq->curr != rq->idle)
return 0;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 81ac605b9cd5..57a49a5524f0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -97,6 +97,9 @@
# define SCHED_WARN_ON(x) ({ (void)(x), 0; })
#endif
+#define IDLE_CPU_DELAY_NS 1000000 /* 1ms */
+#define IDLE_CPU_DELAY_MAX_RUNNING 4
+
struct rq;
struct cpuidle_state;

And using it now brings the hackbench wall time at 28s :)

Thanks,

Mathieu


      struct task_struct    *stop;
      unsigned long        next_balance;
      struct mm_struct    *prev_mm;


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com