thousands of threads & kernel continuations

Michael Callahan (mjc@stelias.com)
Sat, 26 Apr 1997 19:00:21 -0400 (EDT)


I agree with what Alan, David and others have said about 1000s of threads
being the wrong answer to the question--essentially no matter what the
question is. (This is assuming you don't have a machine with 1000s of
CPUs!)

Unfortunately, trying to multiplex bazillions of user-level threads onto
smaller numbers of kernel threads is also not wholly satisfactory.

There _have_ been people who have tried to migrate a Unix-like kernel
toward a bazillions-of-threads model. If people want to pursue this
despite the fact that Linux almost certainly won't go this way, they
should take a look at the papers in the Mach community (RIP?) which talk
about various ways of moving from the Unix kernel model of processes to
one more suitable for an aggressively thread&message-passing based model.
The first thing to go is the idea that threads have statically-allocated
kernel stacks, with context switching done, essentially, by coroutine
methods between these in-kernel stacks.

For example, the CMU Mach group introduced an explicit notion of
continuation to streamline their message-passing RPC implementation. Most
threads in this system were expected to spend most of their time sitting
in a msg_recv operation (or the receive half of a send/recv RPC call). The
Unix way would be to have some kernel msg_recv function block itself until
a message arrived, at which point it would return up the call chain to
user-mode. It was observed that keeping the whole call chain on the
kernel stack was a very inefficient way of remembering the very little
information actually necessary to carry out the steps after a message came
in. So, instead, the msg_recv routine would stuff a few bits of
information into a small structure and stuff a pointer to this structure
and to a completion function into the thread structure. Then the kernel
stack could be destroyed. When the message finally arrives, a new kernel
stack is allocated from a pool and the completion function is called on
this new stack with the struct pointer argument.

The upshot is that a system with a thousand threads might need only a few
dozen kernel stacks.

The Utah Mach project went further and observed that clients can lend
their threads to the servers which process their RPC calls. Not only do
you not have to keep all those kernel stacks around, but servers don't
even need to create service threads in the first place, because their
clients will lend them ones when the time comes. Furthermore, the latency
of cross-domain RPCs is improved a lot because there it is so much easier
to move a single thread from one task to another than it is to put one to
sleep and then wake another.

In principle, Linux could start walking down this road. The unfortunate
thing is that Linux threads block in many different places in the
kernel--the Mach people had the advantage of having one single place where
an enormous fraction of their threads would block. Having an
explicit-continuation version of select() might cover a fair fraction of
blocking threads. However, with an I/O system that is based on the
process model (unlike, say, the VMS or NT I/O system) there are dozens of
places where threads block, and I suspect that the frequency histogram has
a pretty long tail. And, of course, introducing the notion of discardable
kernel stacks has all sorts of locore complications.

Remember, Linux is trying to be a top-rate Unix-like kernel. Frankly,
Unix-like systems don't handle threads the way non-Unix systems like Mach
3 and 4 can (imho).

Michael

PS: the Utah migrating threads paper by Bryan Ford and Jay Lepreau is at
ftp://mancos.cs.utah.edu/papers/thread-migrate.ps.Z