Re: [RFC PATCH 00/86] Make the kernel preemptible

From: Thomas Gleixner
Date: Wed Nov 08 2023 - 10:38:21 EST


On Wed, Nov 08 2023 at 11:13, Peter Zijlstra wrote:
> On Wed, Nov 08, 2023 at 02:04:02AM -0800, Ankur Arora wrote:
> I'm not understanding, those should stay obviously.
>
> The current preempt_dynamic stuff has 5 toggles:
>
> /*
> * SC:cond_resched
> * SC:might_resched
> * SC:preempt_schedule
> * SC:preempt_schedule_notrace
> * SC:irqentry_exit_cond_resched
> *
> *
> * NONE:
> * cond_resched <- __cond_resched
> * might_resched <- RET0
> * preempt_schedule <- NOP
> * preempt_schedule_notrace <- NOP
> * irqentry_exit_cond_resched <- NOP
> *
> * VOLUNTARY:
> * cond_resched <- __cond_resched
> * might_resched <- __cond_resched
> * preempt_schedule <- NOP
> * preempt_schedule_notrace <- NOP
> * irqentry_exit_cond_resched <- NOP
> *
> * FULL:
> * cond_resched <- RET0
> * might_resched <- RET0
> * preempt_schedule <- preempt_schedule
> * preempt_schedule_notrace <- preempt_schedule_notrace
> * irqentry_exit_cond_resched <- irqentry_exit_cond_resched
> */
>
> If you kill voluntary as we know it today, you can remove cond_resched
> and might_resched, but the remaining 3 are still needed to switch
> between NONE and FULL.

No. The whole point of LAZY is to keep preempt_schedule(),
preempt_schedule_notrace(), irqentry_exit_cond_resched() always enabled.

Look at my PoC: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/

The idea is to always enable preempt count and keep _all_ preemption
points enabled.

For NONE/VOLUNTARY mode let the scheduler set TIF_NEED_RESCHED_LAZY
instead of TIF_NEED_RESCHED. In full mode set TIF_NEED_RESCHED.

Here is where the regular and the lazy flags are evaluated:

Ret2user Ret2kernel PreemptCnt=0 need_resched()

NEED_RESCHED Y Y Y Y
LAZY_RESCHED Y N N Y

The trick is that LAZY is not folded into preempt_count so a 1->0
counter transition won't cause preempt_schedule() to be invoked because
the topmost bit (NEED_RESCHED) is set.

The scheduler can still decide to set TIF_NEED_RESCHED which will cause
an immediate preemption at the next preemption point.

This allows to force out a task which loops, e.g. in a massive copy or
clear operation, as it did not reach a point where TIF_NEED_RESCHED_LAZY
is evaluated after a time which is defined by the scheduler itself.

For my PoC I did:

1) Set TIF_NEED_RESCHED_LAZY

2) Set TIF_NEED_RESCHED when the task did not react on
TIF_NEED_RESCHED_LAZY within a tick

I know that's crude but it just works and obviously requires quite some
refinement.

So the way how you switch between preemption modes is to select when the
scheduler sets TIF_NEED_RESCHED/TIF_NEED_RESCHED_LAZY. No static call
switching at all.

In full preemption mode it sets always TIF_NEED_RESCHED and otherwise it
uses the LAZY bit first, grants some time and then gets out the hammer
and sets TIF_NEED_RESCHED when the task did not reach a LAZY preemption
point.

Which means once the whole thing is in place then the whole
PREEMPT_DYNAMIC along with NONE, VOLUNTARY, FULL can go away along with
the cond_resched() hackery.

So I think this series is backwards.

It should add the LAZY muck with a Kconfig switch like I did in my PoC
_first_. Once that is working and agreed on, the existing muck can be
removed.

Thanks,

tglx