Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED

From: Ingo Molnar
Date: Tue Sep 19 2023 - 04:04:00 EST



* Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:

> On Mon, 18 Sept 2023 at 16:42, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
> >
> > What about the following:
> >
> > 1) Keep preemption count and the real preemption points enabled
> > unconditionally.
>
> Well, it's certainly the simplest solution, and gets rid of not just
> the 'rep string' issue, but gets rid of all the cond_resched() hackery
> entirely.
>
> > 20 years ago this was a real issue because we did not have:
> >
> > - the folding of NEED_RESCHED into the preempt count
> >
> > - the cacheline optimizations which make the preempt count cache
> > pretty much always cache hot
> >
> > - the hardware was way less capable
> >
> > I'm not saying that preempt_count is completely free today as it
> > obviously adds more text and affects branch predictors, but as the
> > major distros ship with DYNAMIC_PREEMPT enabled it is obviously an
> > acceptable and tolerable tradeoff.
>
> Yeah, the fact that we do presumably have PREEMPT_COUNT enabled in most
> distros does speak for just admitting that the PREEMPT_NONE / VOLUNTARY
> approach isn't actually used, and is only causing pain.

The macro-behavior of NONE/VOLUNTARY is still used & relied upon in server
distros - and that's the behavior that enterprise distros truly cared
about.

Micro-overhead of NONE/VOLUNTARY vs. FULL is nonzero but is in the 'noise'
category for all major distros I'd say.

And that's what Thomas's proposal achieves: keep the nicely execution-batched
NONE/VOLUNTARY scheduling behavior for SCHED_OTHER tasks, while having the
latency advantages of fully-preemptible kernel code for RT and critical
tasks.

So I'm fully on board with this. It would reduce the number of preemption
variants to just two: regular kernel and PREEMPT_RT. Yummie!

> > 2) When the scheduler wants to set NEED_RESCHED due it sets
> > NEED_RESCHED_LAZY instead which is only evaluated in the return to
> > user space preemption points.
>
> Is this just to try to emulate the existing PREEMPT_NONE behavior?

Yes: I'd guesstimate that the batching caused by timeslice-laziness that is
naturally part of NONE/VOLUNTARY resolves ~90%+ of observable
macro-performance regressions between NONE/VOLUNTARY and PREEMPT/RT.

> If the new world order is that the time slice is always honored, then the
> "this might be a latency issue" goes away. Good.
>
> And we'd also get better coverage for the *debug* aim of "might_sleep()"
> and CONFIG_DEBUG_ATOMIC_SLEEP, since we'd rely on PREEMPT_COUNT always
> existing.
>
> But because the latency argument is gone, the "might_resched()" should
> then just be removed entirely from "might_sleep()", so that might_sleep()
> would *only* be that DEBUG_ATOMIC_SLEEP thing.

Correct. And that's even a minor code generation advantage, as we wouldn't
have these additional hundreds of random/statistical preemption checks.

> That argues for your suggestion too, since we had a performance issue due
> to "might_sleep()" _not_ being just a debug thing, and pointlessly
> causing a reschedule in a place where reschedules were _allowed_, but
> certainly much less than optimal.
>
> Which then caused that fairly recent commit 4542057e18ca ("mm: avoid
> 'might_sleep()' in get_mmap_lock_carefully()").

4542057e18ca is arguably kind of a workaround though - and with the
preempt_count + NEED_RESCHED_LAZY approach we'd have both the latency
advantages *and* the execution-batching performance advantages of
NONE/VOLUNTARY that 4542057e18ca exposed.

> However, that does bring up an issue: even with full preemption, there
> are certainly places where we are *allowed* to schedule (when the preempt
> count is zero), but there are also some places that are *better* than
> other places to schedule (for example, when we don't hold any other
> locks).
>
> So, I do think that if we just decide to go "let's just always be
> preemptible", we might still have points in the kernel where preemption
> might be *better* than in others points.

So in the broadest sense we have 3 stages of pending preemption:

NEED_RESCHED_LAZY
NEED_RESCHED_SOON
NEED_RESCHED_NOW

And we'd transition:

- from 0 -> SOON when an eligible task is woken up,
- from LAZY -> SOON when current timeslice is exhausted,
- from SOON -> NOW when no locks/resources are held.

[ With a fast-track for RT or other urgent tasks to enter NOW immediately. ]

On the regular kernels it's probably not worth modeling the SOON/NOW split,
as we'd have to track the depth of sleeping locks as well, which we don't
do right now.

On PREEMPT_RT the SOON/NOW distinction possibly makes sense, as there we
are aware of locking depth already and it would be relatively cheap to
check for it on natural 0-preempt_count boundaries.


> But none of might_resched(), might_sleep() _or_ cond_resched() are
> necessarily that kind of "this is a good point" thing. They come from a
> different background.

Correct, they come from two sources:

- They are hundreds of points that we know are 'technically correct'
preemption points, and they break up ~90% of long latencies by brute
force & chance.

- Explicitly identified problem points that added a cond_resched() or its
equivalent. These are rare and also tend to bitrot, because *removing*
them is always more risky than adding them, so they tend to accumulate.

> So what I think what you are saying is that we'd have the following
> situation:
>
> - scheduling at "return to user space" is presumably always a good thing.
>
> A non-preempt-count bit NEED_RESCHED_LAZY (or TIF_RESCHED, or
> whatever) would cover that, and would give us basically the existing
> CONFIG_PREEMPT_NONE behavior.
>
> So a config variable (either compile-time with PREEMPT_NONE or a
> dynamic one with DYNAMIC_PREEMPT set to none) would make any external
> wakeup only set that bit.
>
> And then a "fully preemptible low-latency desktop" would set the
> preempt-count bit too.

I'd even argue that we only need two preemption modes, and that 'fully
preemptible low-latency desktop' is an artifact of poor latencies on
PREEMPT_NONE.

Ie. in the long run - after a careful period of observing performance
regressions and other dragons - we'd only have *two* preemption modes left:

!PREEMPT_RT # regular kernel. Single default behavior.
PREEMPT_RT=y # -rt kernel, because rockets, satellites & cars matter.

Any other application level preemption preferences can be expressed via
scheduling policies & priorities.

Nothing else. We don't need PREEMPT_DYNAMIC, PREEMPT_VOLUNTARY or
PREEMPT_NONE in any of their variants, probably not even as runtime knobs.

People who want shorter timeslices can set shorter timeslices, and people
who want immediate preemption of certain tasks can manage priorities.

> - but the "timeslice over" case would always set the preempt-count-bit,
> regardless of any config, and would guarantee that we have reasonable
> latencies.

Yes. Probably a higher nice-priority task becoming runnable would cause
immediate preemption too, in addition to RT tasks.

Ie. the execution batching would be for same-priority groups of SCHED_OTHER
tasks.

> This all makes cond_resched() (and might_resched()) pointless, and
> they can just go away.

Yep.

> Then the question becomes whether we'd want to introduce a *new* concept,
> which is a "if you are going to schedule, do it now rather than later,
> because I'm taking a lock, and while it's a preemptible lock, I'd rather
> not sleep while holding this resource".

Something close to this concept is naturally available on PREEMPT_RT
kernels, which only use a single central lock primitive (rt_mutex), but it
would have be added explicitly for regular kernels.

We could do the following intermediate step:

- Remove all the random cond_resched() points such as might_sleep()
- Turn all explicit cond_resched() points into 'ideal point to reschedule'.

- Maybe even rename it from cond_resched() to resched_point(), to signal
the somewhat different role.

While cond_resched() and resched_point() are not 100% matches, they are
close enough, as most existing cond_resched() points were added to places
that cause the least amount of disruption with held resources.

But I think it would be better to add resched_point() as a new API, and add
it to places where there's a performance benefit. Clean slate,
documentation, and all that.

> I suspect we want to avoid that for now, on the assumption that it's
> hopefully not a problem in practice (the recently addressed problem with
> might_sleep() was that it actively *moved* the scheduling point to a bad
> place, not that scheduling could happen there, so instead of optimizing
> scheduling, it actively pessimized it). But I thought I'd mention it.
>
> Anyway, I'm definitely not opposed. We'd get rid of a config option that
> is presumably not very widely used, and we'd simplify a lot of issues,
> and get rid of all these badly defined "cond_preempt()" things.

I think we can get rid of *all* the preemption model Kconfig knobs, except
PREEMPT_RT. :-)

Thanks,

Ingo