Re: kqueue microbenchmark results

From: Terry Lambert (tlambert@primenet.com)
Date: Wed Oct 25 2000 - 15:24:18 EST


> What applications would do better by postponing some of the reading?
> I can't think of any reason off the top of my head why an application
> wouldn't want to read everything it can. Doing everything in smaller
> chunks would increase overhead (but maybe reduce latencies very slightly
> -- albeit probably not much when using a get_events()-style interface).

Applications that:

o Want to limit their memory footprint by limiting the
        amount of process VM they consume, and so limit their
        buffer size to less than all the data the stacks might
        be capable of providing at one time

o With fixed-size messages, which want to operate on a
        message at a time, without restricting the sender to
        sending only a single message or whole messages at
        one time

o Want to limit their overall system processing overhead
        for irrelevent/stale data (example: one which implements
        state delta referesh events, such as "Bolo" or "Netrek")

o Have to implement "leaky bucket" algorithms, where it
        is permissible to drop some data on the floor, and
        assume it will be retransmited later (e.g. ATM or user
        space protocols which want to implement QoS guarantees)

o Need to take advantage of kernel strategies for protection
        from denial of service attacks, without having to redo
        those strategies themselves (particularly, flood attacks;
        this is the same reason inetd supports connection rate
        limiting on behalf of the programs it is responsible for
        starting)

o With multiple data channels to which they are listening,
        some of which are more important than others (e.g. the
        Real Networks streaming media protocols are an example)

o Want to evealuate the contents of a security negotiation
        prior to accepting data that was sent using an expired
        certificate or otherwise bogus credentials

There are all sorts of good reasons a programmer would want to
trust the kernel, instead of having to build ring buffers into
each and every program they write to ensure they remember data
which is irrelevent to the processing at hand, or protect their
code against buffer overflows initiated by trusted applications.

> Isn't it probably better to keep the kernel implementation as efficient
> as possible so that the majority of applications which will read (and
> write) all data possible can do it as efficiently as possible? Queueing
> up the events, even as they are in the form received from the kernel, is
> pretty simple for a userspace program to do, and I think it's the best
> place for it.

Reading, yes. Writing, no. The buffers they are filling in
the kernel belong to the kernel, not the application, despite
what Microsoft tells you about WSOCK32 programming. The WSOCK32
model assumes that the networking is implemented in another user
space process, rather than in the kernel. People who use the
"async" WSOCK32 interface rarely understand the implications
because they rarely understand how async messages are built
using a Windows data pump, which serializes all requests through
the moral equivalent of a select loop (which is why NT supports
async notification on socket I/O, but other versions of Windows
does not [NB: actually, it could, using an fd=-1 I/O completion
port, but the WSOCK32 programmers were a bit lazy and were also
being told to keep performance under that of NT]).

In any case, it's not just a matter of queueing up kernel events,
it's also a matter of partially instead of completely reacting to
the events, since if an event comes in that says you have 1k of
data, and you only read 128 bytes of it, you will have to requeue,
in LIFO instead of FIFO order, a modified event with 1k-128 bytes,
so the next read completes as expected. Very gross code, which
must be then duplicated in every iser space program, and either
requires a "tail minus one" pointer, or requires doubly linking
the user space event queue.

> I know nothing about any other implementations, though, and I'm speaking
> mainly from the experiences I've had with coding daemons using select().

Programs which are select-based are usually DFAs (Deterministic
Finite State Automatons), which operate on non-blocking file
descriptors. This means that I/O is not interleaved, and so is
not necessarily as efficient as it could be, should there ever
occur a time when an I/O completion posts sooner after being
checked than the amount of time it takes to complete 50% of an
event processing cycle (the reasons for this involve queueing
theory algebra, and are easier to explain in terms of the relative
value of positive and negative caches in situations where a cache
miss results in the need to perform a linear traversal). A lot
of this can be "magically" alleviated using POSIX AIO calls in
the underlying implementation, instead of relying on non-blocking
I/O -- even then, don't expect a better than 50% improvement, and
be prepared for it to be closer to 30%-40%. Expect kernel threads
to max out at less than 50% (and usually be closer to 25%-35%),
and expect kernel threads on any non-commercial OS and on SVR4
systems, excluding the most recent versions of Solaris, to do
much worse than that.

Really, the performance of DFAs based on select loops has more
to do with mean processing time for any arbitrary event, than
anything else. I was able to run logins for an entire lab full
of X terminals with an xdm replacement DFA that was very highly
controlled in its longhest and mean path processing latency, and
recover the use of a SPARCServer that was otherwise burning all
its VM and CPU resources on swapping 32 copies of xdm in and out
for no good reason.

The AfterBurner web server (John Sokol, et. al.) is still, I
believe, the highest performance HTTP server, by a wide margin,
beating out anything threads or not DFA-based, like it is.

Even kqueue and kernel threads are DFAs, in the limit: the
question is on which side of the user/kernel barrier the code
lives, not whether or not it lives somewhere. All UNIX schedulers
(well, all reasonable and deployed-for-actual-work ones) are
DFAs in disguise.

> You mention you wrote a paper discussing this issue...Where could I find
> this?

I'm pretty sure he will give you the paper reference; I personally
don't have the reference handy (sorry).

I think this is more a case of believing the weight of the
literature, rather than an individual report, so you will want
to look at his paper, but also at others in the field, before
you decide to believe anything specific (though I have no doubt
you'll agree with his findings, after you do that). Otherwise,
it's just him against Linus, and making a decision based on one
paper or an authors reputation is bad science (and bad engineering).

                                        Terry Lambert
                                        terry@lambert.org

---
Any opinions in this posting are my own and not those of my present
or previous employers.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Tue Oct 31 2000 - 21:00:17 EST