Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3

From: Michael K. Edwards
Date: Mon Mar 05 2007 - 00:24:03 EST


On 3/4/07, Kyle Moffett <mrmacman_g4@xxxxxxx> wrote:
Well, even this far into 2.6, Linus' patch from 2003 still (mostly)
applies; the maintenance cost for this kind of code is virtually
zilch. If it matters that much to you clean it up and make it apply;
add an alarmfd() syscall (another 100 lines of code at most?) and
make a "read" return an architecture-independent siginfo-like
structure and submit it for inclusion. Adding epoll() support for
random objects is as simple as a 75-line object-filesystem and a 25-
line syscall to return an FD to a new inode. Have fun! Go wild!
Something this trivially simple could probably spend a week in -mm
and go to linus for 2.6.22.

Or, if you want to do slightly more work and produce something a great
deal more useful, you could implement additional netlink address
families for additional "event" sources. The socket - setsockopt -
bind - sendmsg/recvmsg sequence is a well understood and well
documented UNIX paradigm for multiplexing non-blocking I/O to many
destinations over one socket. Everyone who has read Stevens is
familiar with the basic UDP and "fd open server" techniques, and if
you look at Linux's IP_PKTINFO and NETLINK_W1 (bravo, Evgeniy!) you'll
see how easily they could be extended to file AIO and other kinds of
event sources.

For file AIO, you might have the application open one AIO socket per
mount point, open files indirectly via the SCM_RIGHTS mechanism, and
submit/retire read/write requests via sendmsg/recvmsg with ancillary
data consisting of an lseek64 tuple and a user-provided cookie.
Although the process still has to have one fd open per actual open
file (because trying to authenticate file accesses without opening fds
is madness), the only fds it has to manipulate directly are those
representing entire pools of outstanding requests. This is usually a
small enough set that select() will do just fine, if you're careful
with fd allocation. (You can simply punt indirectly opened fds up to
a high numerical range, where they can't be accessed directly from
userspace but still make fine cookies for use in lseek64 tuples within
cmsg headers).

The same basic approach will work for timers, signals, and just about
any other event source. Userspace is of course still stuck doing its
own state machines / thread scheduling / however you choose to think
of it. But all the important activity goes through socketcall(), and
the data and control parameters are all packaged up into a struct
msghdr instead of the bare buffer pointers of read/write. So if
someone else does come along later and design an ultralight threading
mechanism that isn't a total botch, the actual data paths won't need
much rework; the exception handling will just get a lot simpler.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/