Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED

From: Thomas Gleixner
Date: Mon Sep 18 2023 - 19:42:12 EST


On Tue, Sep 12 2023 at 10:26, Peter Zijlstra wrote:
> On Mon, Sep 11, 2023 at 10:04:17AM -0700, Ankur Arora wrote:
>> > The problem with the REP prefix (and Xen hypercalls) is that
>> > they're long running instructions and it becomes fundamentally
>> > impossible to put a cond_resched() in.
>> >
>> >> Yes. I'm starting to think that that the only sane solution is to
>> >> limit cases that can do this a lot, and the "instruciton pointer
>> >> region" approach would certainly work.
>> >
>> > From a code locality / I-cache POV, I think a sorted list of
>> > (non overlapping) ranges might be best.
>>
>> Yeah, agreed. There are a few problems with doing that though.
>>
>> I was thinking of using a check of this kind to schedule out when
>> it is executing in this "reschedulable" section:
>> !preempt_count() && in_resched_function(regs->rip);
>>
>> For preemption=full, this should mostly work.
>> For preemption=voluntary, though this'll only work with out-of-line
>> locks, not if the lock is inlined.
>>
>> (Both, should have problems with __this_cpu_* and the like, but
>> maybe we can handwave that away with sparse/objtool etc.)
>
> So one thing we can do is combine the TIF_ALLOW_RESCHED with the ranges
> thing, and then only search the range when TIF flag is set.
>
> And I'm thinking it might be a good idea to have objtool validate the
> range only contains simple instructions, the moment it contains control
> flow I'm thinking it's too complicated.

Can we take a step back and look at the problem from a scheduling
perspective?

The basic operation of a non-preemptible kernel is time slice
scheduling, which means that a task can run more or less undisturbed for
a full time slice once it gets on the CPU unless it schedules away
voluntary via a blocking operation.

This works pretty well as long as everything runs in userspace as the
preemption points in the return to user space path are independent of
the preemption model.

These preemption points handle both time slice exhaustion and priority
based preemption.

With PREEMPT=NONE these are the only available preemption points.

That means that kernel code can run more or less indefinitely until it
schedules out or returns to user space, which is obviously not possible
for kernel threads.

To prevent starvation the kernel gained voluntary preemption points,
i.e. cond_resched(), which has to be added manually to code as a
developer sees fit.

Later we added PREEMPT=VOLUNTARY which utilizes might_resched() as
additional preemption points. might_resched() utilizes the existing
might_sched() debug points, which are in code paths which might block on
a contended resource. These debug points are mostly in core and
infrastructure code and are in code paths which can block anyway. The
only difference is that they allow preemption even when the resource is
uncontended.

Additionally we have PREEMPT=FULL which utilizes every zero transition
of preeempt_count as a potential preemption point.

Now we have the situation of long running data copies or data clear
operations which run fully in hardware, but can be interrupted. As the
interrupt return to kernel mode does not preempt in the NONE and
VOLUNTARY cases, new workarounds emerged. Mostly by defining a data
chunk size and adding cond_reched() again.

That's ugly and does not work for long lasting hardware operations so we
ended up with the suggestion of TIF_ALLOW_RESCHED to work around
that. But again this needs to be manually annotated in the same way as a
IP range based preemption scheme requires annotation.

TBH. I detest all of this.

Both cond_resched() and might_sleep/sched() are completely random
mechanisms as seen from time slice operation and the data chunk based
mechanism is just heuristics which works as good as heuristics tend to
work. allow_resched() is not any different and IP based preemption
mechanism are not going to be any better.

The approach here is: Prevent the scheduler to make decisions and then
mitigate the fallout with heuristics.

That's just backwards as it moves resource control out of the scheduler
into random code which has absolutely no business to do resource
control.

We have the reverse issue observed in PREEMPT_RT. The fact that spinlock
held sections became preemtible caused even more preemption activity
than on a PREEMPT=FULL kernel. The worst side effect of that was
extensive lock contention.

The way how we addressed that was to add a lazy preemption mode, which
tries to preserve the PREEMPT=FULL behaviour when the scheduler wants to
preempt tasks which all belong to the SCHED_OTHER scheduling class. This
works pretty well and gains back a massive amount of performance for the
non-realtime throughput oriented tasks without affecting the
schedulability of real-time tasks at all. IOW, it does not take control
away from the scheduler. It cooperates with the scheduler and leaves the
ultimate decisions to it.

I think we can do something similar for the problem at hand, which
avoids most of these heuristic horrors and control boundary violations.

The main issue is that long running operations do not honour the time
slice and we work around that with cond_resched() and now have ideas
with this new TIF bit and IP ranges.

None of that is really well defined in respect to time slices. In fact
its not defined at all versus any aspect of scheduling behaviour.

What about the following:

1) Keep preemption count and the real preemption points enabled
unconditionally. That's not more overhead than the current
DYNAMIC_PREEMPT mechanism as long as the preemption count does not
go to zero, i.e. the folded NEED_RESCHED bit stays set.

From earlier experiments I know that the overhead of preempt_count
is minimal and only really observable with micro benchmarks.
Otherwise it ends up in the noise as long as the slow path is not
taken.

I did a quick check comparing a plain inc/dec pair vs. the
DYMANIC_PREEMPT inc/dec_and_test+NOOP mechanism and the delta is
in the non-conclusive noise.

20 years ago this was a real issue because we did not have:

- the folding of NEED_RESCHED into the preempt count

- the cacheline optimizations which make the preempt count cache
pretty much always cache hot

- the hardware was way less capable

I'm not saying that preempt_count is completely free today as it
obviously adds more text and affects branch predictors, but as the
major distros ship with DYNAMIC_PREEMPT enabled it is obviously an
acceptable and tolerable tradeoff.

2) When the scheduler wants to set NEED_RESCHED due it sets
NEED_RESCHED_LAZY instead which is only evaluated in the return to
user space preemption points.

As NEED_RESCHED_LAZY is not folded into the preemption count the
preemption count won't become zero, so the task can continue until
it hits return to user space.

That preserves the existing behaviour.

3) When the scheduler tick observes that the time slice is exhausted,
then it folds the NEED_RESCHED bit into the preempt count which
causes the real preemption points to actually preempt including
the return from interrupt to kernel path.

That even allows the scheduler to enforce preemption for e.g. RT
class tasks without changing anything else.

I'm pretty sure that this gets rid of cond_resched(), which is an
impressive list of instances:

./drivers 392
./fs 318
./mm 189
./kernel 184
./arch 95
./net 83
./include 46
./lib 36
./crypto 16
./sound 16
./block 11
./io_uring 13
./security 11
./ipc 3

That list clearly documents that the majority of these
cond_resched() invocations is in code which neither should care
nor should have any influence on the core scheduling decision
machinery.

I think it's worth a try as it just fits into the existing preemption
scheme, solves the issue of long running kernel functions, prevents
invalid preemption and can utilize the existing instrumentation and
debug infrastructure.

Most importantly it gives control back to the scheduler and does not
make it depend on the mercy of cond_resched(), allow_resched() or
whatever heuristics sprinkled all over the kernel.

To me this makes a lot of sense, but I might be on the completely wrong
track. Se feel free to tell me that I'm completely nuts and/or just not
seeing the obvious.

Thanks,

tglx