[PATCH 24/30] sched: support preempt=none under PREEMPT_AUTO

From: Ankur Arora
Date: Tue Feb 13 2024 - 01:03:27 EST


The default preemption policy for the no forced preemption model under
PREEMPT_AUTO is to always schedule lazily for well-behaved, non-idle
tasks, preempting at exit-to-user.

We already have that, so enable it.

Comparing a scheduling/IPC workload:

# perf stat -a -e cs --repeat 10 -- perf bench sched messaging -g 20 -t -l 5000

PREEMPT_AUTO, preempt=none

3,074,466 context-switches ( +- 0.34% )
3.66437 +- 0.00494 seconds time elapsed ( +- 0.13% )

PREEMPT_DYNAMIC, preempt=none

2,954,976 context-switches ( +- 0.70% )
3.62855 +- 0.00708 seconds time elapsed ( +- 0.20% )

Both perform similarly, but we incur a slightly higher number of
context-switches with PREEMPT_AUTO.

Drilling down we see that both voluntary and involuntary
context-switches are higher for this test:

PREEMPT_AUTO, preempt=none

2115660.30 +- 20442.34 voluntary context-switches ( +- 0.960% )
784690.40 +- 19629.42 involuntary context-switches ( +- 2.500% )

PREEMPT_DYNAMIC, preempt=none

2049027.10 +- 35237.10 voluntary context-switches ( +- 1.710% )
740676.90 +- 20346.45 involuntary context-switches ( +- 2.740% )

Assuming voluntary context-switches due to explicit blocking are
similar, we expect that PREEMPT_AUTO will incur larger context
switches at exit-to-user (counted as voluntary) since that is its
default rescheduling point.

Involuntary context-switches, under PREEMPT_AUTO are seen when a
task has exceeded its time quanta by a tick. Under PREEMPT_DYNAMIC,
these are incurred when a task needs to be rescheduled and then
encounters a cond_resched().
So, these two numbers aren't directly comparable.

Comparing a kernbench workload:

# Half load (-j 32)

PREEMPT_AUTO PREEMPT_DYNAMIC

wall 74.41 +- 0.45 ( +- 0.60% ) 74.20 +- 0.33 sec ( +- 0.45% )
utime 1419.78 +- 2.04 ( +- 0.14% ) 1416.40 +- 6.07 sec ( +- 0.42% )
stime 247.70 +- 0.88 ( +- 0.35% ) 246.23 +- 1.20 sec ( +- 0.49% )
%cpu 2240.20 +- 16.03 ( +- 0.71% ) 2240.20 +- 19.34 ( +- 0.86% )
inv-csw 13056.00 +- 427.58 ( +- 3.27% ) 18750.60 +- 771.21 ( +- 4.11% )
vol-csw 191000.00 +- 1623.25 ( +- 0.84% ) 182857.00 +- 2373.12 ( +- 1.29% )

The runtimes are basically identical for both of these. Voluntary
context switches, as above (and in the optimal, maximal runs below),
are higher. Which as mentioned above, does add up.

However, unlike the sched-messaging workload, the involuntary
context-switches are generally lower (also true for the optimal,
maximal runs below.) One reason for that might be that kbuild spends
~20% time executing in the kernel, while sched-messaging spends ~95%
time in the kernel. Which means a greater likelihood of being
preempted due to exceeding its time quanta.

# Optimal load (-j 256)

PREEMPT_AUTO PREEMPT_DYNAMIC

wall 65.15 +- 0.08 ( +- 0.12% ) 65.10 +- 0.19 ( +- 0.29% )
utime 1876.56 +- 477.03 ( +- 25.42% ) 1873.63 +- 481.98 ( +- 25.72% )
stime 295.77 +- 49.17 ( +- 16.62% ) 294.41 +- 50.79 ( +- 17.25% )
%cpu 3179.30 +- 970.30 ( +- 30.51% ) 3172.90 +- 983.26 ( +- 30.98% )
inv-csw 369670.00 +- 375980.00 ( +- 101.70% ) 390848.00 +- 392231.00 ( +- 100.35% )
vol-csw 216544.00 +- 28604.60 ( +- 13.20% ) 205117.00 +- 23949.50 ( +- 11.67% )

# Maximal load (-j 0)

PREEMPT_AUTO PREEMPT_DYNAMIC

wall 66.02 +- 0.53 ( +- 0.80% ) 65.67 +- 0.55 ( +- 0.83% )
utime 2024.79 +- 439.74 ( +- 21.71% ) 2026.12 +- 446.28 ( +- 22.02% )
stime 312.13 +- 46.14 ( +- 14.78% ) 311.53 +- 47.84 ( +- 15.35% )
%cpu 3465.40 +- 883.75 ( +- 25.50% ) 3473.80 +- 903.27 ( +- 26.00% )
inv-csw 471639.00 +- 336424.00 ( +- 71.33% ) 500981.00 +- 353471.00 ( +- 70.55% )
vol-csw 190138.00 +- 44947.20 ( +- 23.63% ) 177813.00 +- 44345.50 ( +- 24.93% )

Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: Juri Lelli <juri.lelli@xxxxxxxxxx>
Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
Cc: Peter Ziljstra <peterz@xxxxxxxxxxxxx>
Originally-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <ankur.a.arora@xxxxxxxxxx>
---
kernel/sched/core.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5df59a4548dc..2d33f3ff51a3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8968,7 +8968,9 @@ static void __sched_dynamic_update(int mode)
{
switch (mode) {
case preempt_dynamic_none:
- preempt_dynamic_mode = preempt_dynamic_undefined;
+ if (mode != preempt_dynamic_mode)
+ pr_info("%s: none\n", PREEMPT_MODE);
+ preempt_dynamic_mode = mode;
break;

case preempt_dynamic_voluntary:
--
2.31.1