Preempting kernel tasks and a crazy proposal

Ingo Molnar (
Sat, 7 Sep 1996 12:31:55 +0200 (MET DST)

On Thu, 5 Sep 1996, Linus Torvalds wrote:

> Like David, I do not like the idea of pre-empting kernel tasks. It implies a
> more complex locking setup than I'd really like.
> HOWEVER, I'm all for having "user processes" in kernel mode. That's not too
> hard - if only implies that we have to change the way we test for "user vs
> kernel" a bit.

another thing we might try:

doing a reschedule even if the kernel operation wouldnt block. There are
well-defined preemption points in the kernel (the points where a process
might or might not sleep). If we break up kernel service operations into
'atoms', and guarantee that an atom might sleep because some RT process
wants to get serviced, then we would get nearly the same result.

[ and we have to keep the execution time of an atom low, but this is
already mostly true for the current code ]

The problem with current RT operations is:

Example: a non-RT process doing a system call:

[atom 1] <-------------------- here a RT process gets runnable due
to an ISR
[atom 2]
[atom 3]
[atom 4]
[atom 5]
ret_from_syscall, reschedule <---- the RT process gets scheduled

In the case when atom1-5 doesnt sleep, the RT process gets delayed by the
execution time of the whole system call. In some cases (the a.out module
for example) this might be several jiffies (ouch). [ok, this is an unfair

If an atom could sleep only because the RT process got runnable, the
RT process could be scheduled right after atom1.

Actually, such preemption points are a clean concept, if we introduce
'CPU-locking'. Each atom should request the CPU explicitly. If a RT task
wants the CPU, then it has to lock the next CPU quantum, and the kernel
atom should 'block'. Currently we have no CPU locking, we lock only other
resources. Maybe non-RT processes could benefit from this CPU locking
thing too? [dont think so]

and the crazy idea:

Instead of introducing explicit 'checks' after each atom, i propose the
following way:

- a hash table, which shows the next preemption point for each kernel EIP
- a kernel function that fetches the current interrupted kernel EIP
and puts a breakpoint to this next preemption point ...

This way only the RT event would be penaltized (a bit), and no 'normal
code' would be penaltized. It's ugly, but i see no other way doing it :)
Instead of a breakpoint, a 'jmp rt_reschedule' might be faster.

Self-modifying code, but hey :) The profiling code already looks at the
EIP value, so we arent so far from actually changing it :)))

And if it's too ugly and nonportable, then 'rescheduling checks' could
be macroed at the preemption points. [to avoid hurting normal operation
performance] And such RT tasks should be scheduled before bottom halves
(maybe this should depend on an extended bottom half priority notion)

-- mingo