[RFC PATCH V3 0/6] SCHED_DEADLINE server infrastructure

From: Daniel Bristot de Oliveira
Date: Thu Jun 08 2023 - 11:58:38 EST


This is RFC v3 of Peter's SCHED_DEADLINE server infrastructure
implementation [1].

SCHED_DEADLINE servers can help fixing starvation issues of low priority
tasks (e.g., SCHED_OTHER) when higher priority tasks monopolize CPU
cycles. Today we have RT Throttling; DEADLINE servers should be able to
replace and improve that.

I rebased Peter's patches (adding changelogs where needed) on
tip/sched/core as of today and incorporated fixes to issues discussed
during RFC v1 & v2.

In the v1 there was discussion raised about the consequence of using
deadline based servers on the fixed-priority workloads. For a demonstration
here is the baseline of timerlat** scheduling latency as-is, with kernel
build background workload:

# rtla timerlat top -u -d 10m

--------------------- %< ------------------------
0 01:00:01 | IRQ Timer Latency (us) | Thread Timer Latency (us) | Ret user Timer Latency (us)
CPU COUNT | cur min avg max | cur min avg max | cur min avg max
0 #3599960 | 1 0 1 31 | 6 1 6 65 | 9 2 9 86
1 #3599972 | 1 0 1 41 | 4 1 5 54 | 7 2 7 78
2 #3599966 | 1 0 1 36 | 6 1 6 65 | 9 2 9 81
3 #3599945 | 0 0 1 31 | 6 1 6 55 | 9 2 9 84
4 #3599939 | 1 0 1 32 | 4 1 6 53 | 7 2 8 85
5 #3599944 | 0 0 1 31 | 4 1 6 50 | 6 2 9 54
6 #3599945 | 1 0 1 38 | 5 1 6 53 | 8 2 9 88
7 #3599944 | 0 0 1 36 | 4 1 5 62 | 6 2 8 86
--------------------- >% ------------------------

And this is the same tests with DL server activating without any delay*:
--------------------- %< ------------------------
0 00:10:01 | IRQ Timer Latency (us) | Thread Timer Latency (us) | Ret user Timer Latency (us)
CPU COUNT | cur min avg max | cur min avg max | cur min avg max
0 #595748 | 1 0 1 254 | 8 1 31 1417 | 12 2 33 1422
1 #597951 | 1 0 1 239 | 6 1 27 1435 | 9 2 30 1438
2 #595060 | 1 0 1 24 | 5 1 28 1437 | 7 2 30 1441
3 #595914 | 1 0 1 218 | 6 1 29 1382 | 9 2 32 1385
4 #597829 | 1 0 1 233 | 8 1 26 1368 | 11 2 29 1427
5 #596314 | 2 0 1 21 | 7 1 29 1442 | 10 2 32 1447
6 #595532 | 1 0 1 238 | 6 1 31 1389 | 9 2 34 1392
7 #595852 | 0 0 1 34 | 6 1 30 1481 | 9 2 33 1484
--------------------- >% ------------------------

The problem with DL server only implementation is that FIFO tasks might
suffer preemption from NORMAL even when spare CPU cycles are available.
In fact, fair deadline server is enqueued right away when NORMAL tasks
wake up and they are first scheduled by the server, thus potentially
preempting a well behaving FIFO task. This is of course not ideal.

We had discussions about it, and one of the possibilities would be
using a different scheduling algorithm for this. But IMHO that is
an overkill. Juri and I discussed that, and that is why Juri added
the patch 6/6.

The patch 6/6 adds a PoC of an starvation monitor/watchdog that delays
enqueuing of deadline servers to the point when fair tasks might start
to actually suffer from starvation (just randomly picked HZ/2 for now).

With that in place, the results get better again*:

--------------------- %< ------------------------
0 01:00:01 | IRQ Timer Latency (us) | Thread Timer Latency (us) | Ret user Timer Latency (us)
CPU COUNT | cur min avg max | cur min avg max | cur min avg max
0 #3600004 | 1 0 1 29 | 8 1 5 50 | 11 2 8 66
1 #3600010 | 1 0 1 30 | 7 1 5 50 | 10 2 8 58
2 #3600010 | 0 0 1 30 | 5 1 5 43 | 7 2 7 70
3 #3600010 | 1 0 1 25 | 8 1 6 52 | 12 2 8 74
4 #3600010 | 1 0 1 63 | 8 1 6 72 | 12 2 8 88
5 #3600010 | 1 0 1 26 | 8 1 6 59 | 11 2 8 94
6 #3600010 | 1 0 1 29 | 9 1 5 55 | 12 2 8 82
7 #3600003 | 0 0 1 117 | 6 1 5 124 | 9 2 7 127
--------------------- >% ------------------------

So, that is in the right direction but we can improve it. Here
are the next steps I am taking:

- Getting parameters from the sysctl sched_rt...
- Trying to delay the start of the server for the 0-laxity time
- Maybe starting the server throttled with replenish time
at 0-laxity
- Maybe implement a starvation monitor offload, where the DL server
is started remotely, avoiding the overhead of its activation - like
stalld does;
- Test with micro-interference to measure overheads.

Here are some osnoise measurement, with osnoise threads running as FIFO:1 with
different setups*:
- CPU 2 isolated
- CPU 3 isolated shared with a CFS busy loop task
- CPU 8 non-isolated
- CPU 9 non-isolated shared with a CFS busy loop task

--------------------- %< ------------------------
# osnoise -P f:1 -c 2,3,8,9 -T 1 -d 10m -H 1 -q
Operating System Noise
duration: 0 00:12:39 | time is in us
CPU Period Runtime Noise % CPU Aval Max Noise Max Single HW NMI IRQ Softirq Thread
2 #757 757000000 49 99.99999 14 3 0 0 106 0 0
3 #757 757001039 39322713 94.80546 52992 1103 0 0 3657933 0 59685
8 #757 757000000 779821 99.89698 1513 4 0 113 794899 0 189
9 #757 757001043 39922677 94.72620 53775 1105 0 112 4361243 0 49009
--------------------- >% ------------------------

The results are promising, but there is a problem when no setting
HRTICK_DL... checking it. No splat on any of these scenarios.

* tests with throttling disabled, on the 6.3 stable RT. But also
on 6.4 and tip/sched/core.
** timerlat with user-space support under dev, you need these patch series:
https://lore.kernel.org/all/cover.1686063934.git.bristot@xxxxxxxxxx
https://lore.kernel.org/all/cover.1686066600.git.bristot@xxxxxxxxxx
or just run without the -u option :-)

Changes from v2:
- rebased on 6.4-rc1 tip/sched/core

Juri Lelli (1):
sched/fair: Implement starvation monitor

Peter Zijlstra (5):
sched: Unify runtime accounting across classes
sched/deadline: Collect sched_dl_entity initialization
sched/deadline: Move bandwidth accounting into {en,de}queue_dl_entity
sched/deadline: Introduce deadline servers
sched/fair: Add trivial fair server

include/linux/sched.h | 24 +-
kernel/sched/core.c | 23 +-
kernel/sched/deadline.c | 497 +++++++++++++++++++++++++--------------
kernel/sched/fair.c | 143 +++++++++--
kernel/sched/rt.c | 15 +-
kernel/sched/sched.h | 60 +++--
kernel/sched/stop_task.c | 13 +-
7 files changed, 538 insertions(+), 237 deletions(-)

--
2.40.1