Re: SCO: "thread creation is about a thousand times faster than onnative

From: Jamie Lokier (
Date: Fri Aug 25 2000 - 12:53:09 EST

(Fwiw I know exactly how Linux scheduling works. If I'm unclear it's bad
explanation on my part).

Linus Torvalds wrote:
> Wrong.

I beg to differ.

> A stack page _is_ required for a thread, basically always:
> - when it is running (either in user or kernel space): interrupts happen.
> You can't share the stack page sanely with other processes, because the
> interrupt can (and often does) cause a task-switch to another
> completely unrelated process. And you have to save the register state
> somewhere.

Conclusion: each CPU needs a stack to handle interrupts. Quelle suprise.

> NOTE! Linux does NOT save the register state in the task struct. Saving
> state in the task struct is a horrible waste of time when it has been
> already saved on the stack for other reasons (ie the stack contains not
> just the register state, but the instructions about how to unwind it
> from recursive faults etc).

You only save register state in the task_struct in certain profitable
cases. In those cases, the _only_ thing to copy are the registers. No
unwind or recursive fault nonsense. 7 words on x86. Restoring doesn't
require the full copy.

I bet that's more than made up for by the cache benefits of reusing the
stack page.

> - when it is sleeping in kernel space: the stack contains the full path
> to where it slept, we _need_ it.
> Basically, the _only_ time you don't need a kernel stack is when the
> process is completely idle, ie it has been scheduled away because its
> time-slice ended, and it wasn't in kernel mode when this happened.

There are a couple of really obvious states where you don't need to
record the full path to where we slept.

   - The schedule() in ret_from_sys_call.

   - do_poll().

The first is used when pre-empting, such as when we have lots of threads
running. The second is used by almost every task that sleeps, including
heavy duty servers.

What you do need to save then is the stack state in do_poll. Well,
there's very little of that indeed. Not even a full register set.

> And Linux doesn't consider this a special state at all - as far as the
> kernel is concerned, that sleeping state is the same as sleeping in
> kernel mode.

Quite. But the stack frame in that case is _exactly_ the same each time
and only a few words, basically the registers.

> Quite frankly, the stack _is_ part of the "struct task_struct". It's not
> just a performance optimization to allocate them together. Sure, the fact
> that we allocate them together means one less allocation, and it allows
> the clever "current" games we play with the stack pointer, but those are
> details. Fundamentally, the two are part of the same thing.

Fundamentally they don't have to be.

> Think about WHY our system call latency beats everybody else on the
> planet. Think about WHY Linux is fast. It's because it's designed
> right.

It's fast because it does very little. Any stack page optimisation has
to keep the current scheme for system calls. That means switching
stacks only when making a task idle, and quite possibly only under
memory pressure.

There's a thought. A memory pressure hook that frees hardly-used stack

> Continuations and other crap only add overhead. 8kB of storage for the
> whole thread state (including the kernel stack) is _incredibly_ tiny.

Think of the cache benefits of sharing :-)

Think of kernel stacks as a resource worth keeping a limited pool of,
per CPU. Like the pgd cache. You don't have to save state on _every_
pre-emptive task switch, just the old ones.

> Yes, you can get lower-latency threads. But they won't be the "real
> thing". They won't scale on SMP, and they'll have other limitations too
> (often they require synchronous task-switching, because that allows
> optimizations like not saving the whole register state: which can be a BIG
> win when the register state is big or slow to save like the x86 FP state).

> Trust me. You won't find faster true threads than what Linux offers. And
> they won't be using less than 8kB of memory.

Well well. I think it's possible to over the best of user-space "fake"
threads plus the advantages of "true" kernel threads in one blindingly
fast combination, in less than 8kB per thread.

If you stop and think for a moment, I think you'll see it too.

-- Jamie
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
Please read the FAQ at

This archive was generated by hypermail 2b29 : Thu Aug 31 2000 - 21:00:16 EST