Re: [RFC PATCH 00/86] Make the kernel preemptible

From: Steven Rostedt
Date: Wed Nov 08 2023 - 00:12:23 EST


On Tue, 7 Nov 2023 20:52:39 -0800 (PST)
Christoph Lameter <cl@xxxxxxxxx> wrote:

> On Tue, 7 Nov 2023, Ankur Arora wrote:
>
> > This came up in an earlier discussion (See
> > https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/) and Thomas mentioned
> > that preempt_enable/_disable() overhead was relatively minimal.
> >
> > Is your point that always-on preempt_count is far too expensive?
>
> Yes over the years distros have traditionally delivered their kernels by
> default without preemption because of these issues. If the overhead has
> been minimized then that may have changed. Even if so there is still a lot
> of code being generated that has questionable benefit and just
> bloats the kernel.
>
> >> These are needed to avoid adding preempt_enable/disable to a lot of primitives
> >> that are used for synchronization. You cannot remove those without changing a
> >> lot of synchronization primitives to always have to consider being preempted
> >> while operating.
> >
> > I'm afraid I don't understand why you would need to change any
> > synchronization primitives. The code that does preempt_enable/_disable()
> > is compiled out because CONFIG_PREEMPT_NONE/_VOLUNTARY don't define
> > CONFIG_PREEMPT_COUNT.
>
> In the trivial cases it is simple like that. But look f.e.
> in the slub allocator at the #ifdef CONFIG_PREEMPTION section. There is a
> overhead added to be able to allow the cpu to change under us. There are
> likely other examples in the source.
>

preempt_disable() and preempt_enable() are much lower overhead today than
it use to be.

If you are worried about changing CPUs, there's also migrate_disable() too.

> And the whole business of local data
> access via per cpu areas suffers if we cannot rely on two accesses in a
> section being able to see consistent values.
>
> > The intent here is to always have CONFIG_PREEMPT_COUNT=y.
>
> Just for fun? Code is most efficient if it does not have to consider too
> many side conditions like suddenly running on a different processor. This
> introduces needless complexity into the code. It would be better to remove
> PREEMPT_COUNT for good to just rely on voluntary preemption. We could
> probably reduce the complexity of the kernel source significantly.

That is what caused this thread in the first place. Randomly scattered
"preemption points" does not scale!

And I'm sorry, we have latency sensitive use cases that require full
preemption.

>
> I have never noticed a need to preemption at every instruction in the
> kernel (if that would be possible at all... Locks etc prevent that ideal
> scenario frequently). Preemption like that is more like a pipe dream.
>
> High performance kernel solution usually disable
> overhead like that.
>

Please read the email from Thomas:

https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/

This is not technically getting rid of PREEMPT_NONE. It is adding a new
NEED_RESCHED_LAZY flag, that will have the kernel preempt only when
entering or in user space. It will behave the same as PREEMPT_NONE, but
without the need for all the cond_resched() scattered randomly throughout
the kernel.

If the task is in the kernel for more than one tick (1ms at 1000Hz, 4ms at
250Hz and 10ms at 100Hz), it will then set NEED_RESCHED, and you will
preempt at the next available location (preempt_count == 0).

But yes, all locations that do not explicitly disable preemption, will now
possibly preempt (due to long running kernel threads).

-- Steve