Re: [RFC PATCH 00/86] Make the kernel preemptible

From: Ankur Arora
Date: Tue Nov 07 2023 - 18:45:07 EST



Steven Rostedt <rostedt@xxxxxxxxxxx> writes:

> On Tue, 7 Nov 2023 13:56:46 -0800
> Ankur Arora <ankur.a.arora@xxxxxxxxxx> wrote:
>
>> Hi,
>
> Hi Ankur,
>
> Thanks for doing this!
>
>>
>> We have two models of preemption: voluntary and full (and RT which is
>> a fuller form of full preemption.) In this series -- which is based
>> on Thomas' PoC (see [1]), we try to unify the two by letting the
>> scheduler enforce policy for the voluntary preemption models as well.
>
> I would say there's "NONE" which is really just a "voluntary" but with
> fewer preemption points ;-) But still should be mentioned, otherwise people
> may get confused.
>
>>
>> (Note that this is about preemption when executing in the kernel.
>> Userspace is always preemptible.)
>>
>
>
>> Design
>> ==
>>
>> As Thomas outlines in [1], to unify the preemption models we
>> want to: always have the preempt_count enabled and allow the scheduler
>> to drive preemption policy based on the model in effect.
>>
>> Policies:
>>
>> - preemption=none: run to completion
>> - preemption=voluntary: run to completion, unless a task of higher
>> sched-class awaits
>> - preemption=full: optimized for low-latency. Preempt whenever a higher
>> priority task awaits.
>>
>> To do this add a new flag, TIF_NEED_RESCHED_LAZY which allows the
>> scheduler to mark that a reschedule is needed, but is deferred until
>> the task finishes executing in the kernel -- voluntary preemption
>> as it were.
>>
>> The TIF_NEED_RESCHED flag is evaluated at all three of the preemption
>> points. TIF_NEED_RESCHED_LAZY only needs to be evaluated at ret-to-user.
>>
>> ret-to-user ret-to-kernel preempt_count()
>> none Y N N
>> voluntary Y Y Y
>> full Y Y Y
>
> Wait. The above is for when RESCHED_LAZY is to preempt, right?
>
> Then, shouldn't voluntary be:
>
> voluntary Y N N
>
> For LAZY, but
>
> voluntary Y Y Y
>
> For NEED_RESCHED (without lazy)

Yes. You are, of course, right. I was talking about the TIF_NEED_RESCHED flags
and in the middle switched to talking about how the voluntary model will
get to what it wants.

> That is, the only difference between voluntary and none (as you describe
> above) is that when an RT task wakes up, on voluntary, it sets NEED_RESCHED,
> but on none, it still sets NEED_RESCHED_LAZY?

Yeah exactly. Just to restate without mucking it up:

The TIF_NEED_RESCHED flag is evaluated at all three of the preemption
points. TIF_NEED_RESCHED_LAZY only needs to be evaluated at ret-to-user.

ret-to-user ret-to-kernel preempt_count()
NEED_RESCHED_LAZY Y N N
NEED_RESCHED Y Y Y

Based on how various preemption models set the flag they would cause
preemption at:

ret-to-user ret-to-kernel preempt_count()
none Y N N
voluntary Y Y Y
full Y Y Y

>> The max-load numbers (not posted here) also behave similarly.
>
> It would be interesting to run any "latency sensitive" benchmarks.
>
> I wounder how cyclictest would work under each model with and without this
> patch?

Didn't post these numbers because I suspect that code isn't quite right,
but voluntary preemption for instance does what it promises:

# echo NO_FORCE_PREEMPT > sched/features
# echo NO_PREEMPT_PRIORITY > sched/features # preempt=none
# stress-ng --cyclic 1 --timeout 10
stress-ng: info: [1214172] setting to a 10 second run per stressor
stress-ng: info: [1214172] dispatching hogs: 1 cyclic
stress-ng: info: [1214174] cyclic: sched SCHED_DEADLINE: 100000 ns delay, 10000 samples
stress-ng: info: [1214174] cyclic: mean: 9834.56 ns, mode: 3495 ns
stress-ng: info: [1214174] cyclic: min: 2413 ns, max: 3145065 ns, std.dev. 77096.98
stress-ng: info: [1214174] cyclic: latency percentiles:
stress-ng: info: [1214174] cyclic: 25.00%: 3366 ns
stress-ng: info: [1214174] cyclic: 50.00%: 3505 ns
stress-ng: info: [1214174] cyclic: 75.00%: 3776 ns
stress-ng: info: [1214174] cyclic: 90.00%: 4316 ns
stress-ng: info: [1214174] cyclic: 95.40%: 10989 ns
stress-ng: info: [1214174] cyclic: 99.00%: 91181 ns
stress-ng: info: [1214174] cyclic: 99.50%: 290477 ns
stress-ng: info: [1214174] cyclic: 99.90%: 1360837 ns
stress-ng: info: [1214174] cyclic: 99.99%: 3145065 ns
stress-ng: info: [1214172] successful run completed in 10.00s

# echo PREEMPT_PRIORITY > features # preempt=voluntary
# stress-ng --cyclic 1 --timeout 10
stress-ng: info: [916483] setting to a 10 second run per stressor
stress-ng: info: [916483] dispatching hogs: 1 cyclic
stress-ng: info: [916484] cyclic: sched SCHED_DEADLINE: 100000 ns delay, 10000 samples
stress-ng: info: [916484] cyclic: mean: 3682.77 ns, mode: 3185 ns
stress-ng: info: [916484] cyclic: min: 2523 ns, max: 150082 ns, std.dev. 2198.07
stress-ng: info: [916484] cyclic: latency percentiles:
stress-ng: info: [916484] cyclic: 25.00%: 3185 ns
stress-ng: info: [916484] cyclic: 50.00%: 3306 ns
stress-ng: info: [916484] cyclic: 75.00%: 3666 ns
stress-ng: info: [916484] cyclic: 90.00%: 4778 ns
stress-ng: info: [916484] cyclic: 95.40%: 5359 ns
stress-ng: info: [916484] cyclic: 99.00%: 6141 ns
stress-ng: info: [916484] cyclic: 99.50%: 7824 ns
stress-ng: info: [916484] cyclic: 99.90%: 29825 ns
stress-ng: info: [916484] cyclic: 99.99%: 150082 ns
stress-ng: info: [916483] successful run completed in 10.01s

This is with a background kernbench half-load.

Let me see if I can dig out the numbers without this series.

--
ankur