Re: Thread implementations...

Richard Gooch (Richard.Gooch@atnf.CSIRO.AU)
Sun, 21 Jun 1998 13:20:57 +1000


Dean Gaudet writes:
>
>
> On Sat, 20 Jun 1998, Richard Gooch wrote:
[...]
> > Use 10 threads. Seems to me that would provide reasonable load
> > balancing. And increasing that to 100 threads would be even better.
>
> No it wouldn't. 100 kernel-level threads is overkill. Unless your box
> can do 100 things at a time there's no benefit from giving the kernel 100
> objects to schedule. 10 is a much more reasonable number, and even that
> may be too high. You only need as many kernel threads as there is
> parallelism to exploit in the hardware. Everything else can, and should,
> happen in userland where timeslices can be maximized and context switches
> minimized.
>
> > The aim is to ensure that, statistically, most threads will remain
> > sleeping for several clock ticks.
>
> What? If I am wasting system memory for a kernel-level thread I'm not
> going to go about ensuring that it remains asleep! no way. I'm going to
> use each and every time slice to its fullest -- because context switches
> have a non-zero cost, it may be small, but it is non-zero.

The point is that *most* FDs are inactive. If in every timeslice you
have only 5 active FDs (taken from a uniform random distribution),
then with 10 threads only half of those are woken up. Hence only half
the number of FDs have to be scanned when these threads have processed
the activity. For 1000 FDs, then is a saving of 500 FD scans, which is
1.5 ms. So scanning load has gone from 30% to 15% (10 ms timeslice).
Also note that only 5 threads are woken up (scheduled), the other 5
remain asleep.

Now lets look at 100 threads. With 5 active FDs, you still get at most
5 threads woken up. But now FD scanning after processing activity
drops to a total of 50 FDs. So scanning load (per timeslice!) has
dropped to 150 us. So compared with the 10 thread case, we have saved
1.35 ms of FD scanning time. Compared with the 1 thread case, we have
saved 2.85 ms of scanning time (as always, per 10 ms timeslice). In
other words, only 0.15% scanning load. And still we are only
scheduling 5 threads *this timeslice*!

I don't know why you care so much about context switches: the time
taken for select(2) or poll(2) for many FDs is dominant!

Just how much time do you think scheduling is taking???

> > With a bit of extra work you could even slowly migrate consistently
> > active FDs to one or a few threads.
>
> But migrating them costs you extra CPU time. That's time that strictly
> speaking, which does not need to be spent. NT doesn't have to spend this
> time when using completion ports (I'm sounding like a broken record).

Migration is pretty cheap: it's a matter of swapping some entries in a
table. And migration only happens upon FD activity. Adding a few extra
microseconds for migration is peanuts compared with the time taken to
process a datagram.

> Look at this another way. If I'm using poll() to implement something,
> then I typically have a structure that describes each FD and the state it
> is in. I'm always interested in whether that FD is ready for read or
> write. When it is ready I'll do some processing, modify the state,
> read/write something, and then do nothing with it until it is ready again.

Yep, fine. My conceptual model is that I call a callback for each
active FD. Same thing.

> To do this I list for the kernel all the FDs and call poll(). Then the
> kernel goes around and polls everything. For many descriptors (i.e. slow
> long haul internet clients) this is a complete waste. There are two
> approaches I've seen to deal with this:
>
> - don't poll everything as frequently, do complex migration between
> different "pools" sorted by how active the FD is. This reduces the number
> of times slow sockets are polled. This is a win, but I feel it is far too
> complex (read: easy to get wrong).

It only needs to be done "right" once. In a library. Heck, I might
even modify my own FD management library code to do this just to prove
the point. Write once, use many!
Note that even the "complex" migration is optional: simply dividing up
FDs equally between N threads is a win.
Having migration between a small number of threads is going to be a
*real* win.

> - let the kernel queue an event when the FD becomes ready. So rather than
> calling poll() with a list of 100s of FDs, we tell the kernel on a per-FD
> basis "when this is ready for read/write queue an event on this pipe, and
> could you please hand me back this void * with it? thanks". In this
> model when a write() returns EWOULDBLOCK the kernel implicitly sets that
> FD up as "waiting for write", similarly for a read(). This means that no
> matter what speed the socket is, it won't be polled and no complex
> dividing of the FDs into threads needs to be done.

I think this will be more complex to implement than a small userspace
library that uses a handful of threads.

> The latter model is a lot like completion ports... but probably far easier
> to implement. When the kernel changes an FD in a way that could cause it
> to become ready for read or write it checks if it's supposed to queue an
> event. If the event queue becomes full the kernel should queue one event
> saying "event queue full, you'll have to recover in whatever way you find
> suitable... like use poll()".

This involves kernel bloat. It seems to me that there is such a simple
userspace solution, so why bother hacking the kernel?
I'd much rather hack the kernel to speed up select(2) and poll(2) a
few times. This benefits all existing Linux/UNIX programmes.

Regards,

Richard....

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu