Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3

From: Michael K. Edwards
Date: Thu Feb 22 2007 - 16:25:16 EST


On 2/22/07, Ingo Molnar <mingo@xxxxxxx> wrote:
> It is not a TUX anymore - you had 1024 threads, and all of them will
> be consumed by tcp_sendmsg() for slow clients - rescheduling will kill
> a machine.

maybe it will, maybe it wont. Lets try? There is no true difference
between having a 'request structure' that represents the current state
of the HTTP connection plus a statemachine that moves that request
between various queues, and a 'kernel stack' that goes in and out of
runnable state and carries its processing state in its stack - other
than the amount of RAM they take. (the kernel stack is 4K at a minimum -
so with a million outstanding requests they would use up 4 GB of RAM.
With 20k outstanding requests it's 80 MB of RAM - that's acceptable.)

This is a fundamental misconception. The state machine doesn't have
to do anything but chase pointers through cache. Done right, it
hardly even branches (although the branch misprediction penalty is a
lot less of a worry on current x86_64 than it was in the
mega-superscalar-out-of-order-speculative-execution days). It's damn
near free -- but it's a pain in the butt to code, and it has to be
done either in-kernel or in per-CPU OS-atop-the-OS dispatch threads.

The scheduler, on the other hand, has to blow and reload all of the
hidden state associated with force-loading the PC and wherever your
architecture keeps its TLS (maybe not the whole TLB, but not nothing,
either). The only way around this that I can think of is to make
threadlets promise that they will not touch anything thread-local, and
that when the FPU is handed to them in a specific, known state, they
leave it in that same state. (Some of the flags can be
unspecified-but-don't-touch-me.) Then you can schedule threadlets in
bursts with negligible transition cost from one to the next.

There is, however, a substantial setup cost for a burst, because you
have to put the FPU in that known state and lock out TLS access (this
is user code, after all). If the wrong process is in foreground, you
also need to switch process context at the start of a burst; no
fandangos on other processes' core, please, and to be remotely useful
the threadlets need access to process-global data structures and
synchronization primitives anyway. That's why you need for threadlets
to have a separate SCHED_THREADLET priority and at least a weak
ordering by PID. At which point you are outside the feature set of
the O(1) scheduler as I understand it, and you might as well schedule
them from the next tasklet following the softirq dispatcher.

> My tests show that with 4k connections per second (8k concurrency)
> more than 20k connections of 80k total block in tcp_sendmsg() over
> gigabit lan between quite fast machines.

yeah. Note that you can have a million sleeping threads if you want, the
scheduler wont care. What matters more is the amount of true concurrency
that is present at any given time. But yes, i agree that overscheduling
can be a problem.

What matters is that a burst of I/O responses be scheduled efficiently
without taking down the rest of the box. That, and the ability to
cancel no-longer-interesting I/O requests in bulk, without leaking
memory and synchronization primitives all over the place. If you
don't have that, this scheme is UNUSABLE for network I/O.

btw., what is the measurement utility you are using with kevents ('ab'
perhaps, with a high -c concurrency count?), and which webserver are you
using? (light-httpd?)

Do me a favor. Do some floating point math and a memcpy() in between
syscalls in the threadlet. Actually fiddle with errno and the FPU
rounding flags. Watch it slow to a crawl and/or break floating point
arithmetic horribly. Understand why no one with half a brain uses
Java, or any other language which cuts FP corners for the sake of
cheap threads, for calculations that have to be correct. (Note that
Kahan received the Turing award for contributions to IEEE 754. If his
polemic is too thick, read
http://www-128.ibm.com/developerworks/java/library/j-jtp0114/.)

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/