[patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support

From: Ingo Molnar
Date: Tue Feb 13 2007 - 09:25:45 EST


I'm pleased to announce the first release of the "Syslet" kernel feature
and kernel subsystem, which provides generic asynchrous system call
support:

http://redhat.com/~mingo/syslet-patches/

Syslets are small, simple, lightweight programs (consisting of
system-calls, 'atoms') that the kernel can execute autonomously (and,
not the least, asynchronously), without having to exit back into
user-space. Syslets can be freely constructed and submitted by any
unprivileged user-space context - and they have access to all the
resources (and only those resources) that the original context has
access to.

because the proof of the pudding is eating it, here are the performance
results from async-test.c which does open()+read()+close() of 1000 small
random files (smaller is better):

synchronous IO | Syslets:
--------------------------------------------
uncached: 45.8 seconds | 34.2 seconds ( +33.9% )
cached: 31.6 msecs | 26.5 msecs ( +19.2% )

("uncached" results were done via "echo 3 > /proc/sys/vm/drop_caches".
The default IO scheduler was the deadline scheduler, the test was run on
ext3, using a single PATA IDE disk.)

So syslets, in this particular workload, are a nice speedup /both/ in
the uncached and in the cached case. (note that i used only a single
disk, so the level of parallelism in the hardware is quite limited.)

the testcode can be found at:

http://redhat.com/~mingo/syslet-patches/async-test-0.1.tar.gz

The boring details:

Syslets consist of 'syslet atoms', where each atom represents a single
system-call. These atoms can be chained to each other: serially, in
branches or in loops. The return value of an executed atom is checked
against the condition flags. So an atom can specify 'exit on nonzero' or
'loop until non-negative' kind of constructs.

Syslet atoms fundamentally execute only system calls, thus to be able to
manipulate user-space variables from syslets i've added a simple special
system call: sys_umem_add(ptr, val). This can be used to increase or
decrease the user-space variable (and to get the result), or to simply
read out the variable (if 'val' is 0).

So a single syslet (submitted and executed via a single system call) can
be arbitrarily complex. For example it can be like this:

--------------------
| accept() |-----> [ stop if returns negative ]
--------------------
|
V
-------------------------------
| setsockopt(TCP_NODELAY) |-----> [ stop if returns negative ]
-------------------------------
|
v
--------------------
| read() |<---------
-------------------- | [ loop while positive ]
| | |
| ---------------------
|
-----------------------------------------
| decrease and read user space variable |
----------------------------------------- A
| |
-------[ loop back to accept() if positive ]------

(you can find a VFS example and a hello.c example in the user-space
testcode.)

A syslet is executed opportunistically: i.e. the syslet subsystem
assumes that the syslet will not block, and it will switch to a
cachemiss kernel thread from the scheduler. This means that even a
single-atom syslet (i.e. a pure system call) is very close in
performance to a pure system call. The syslet NULL-overhead in the
cached case is roughly 10% of the SYSENTER NULL-syscall overhead. This
means that two atoms are a win already, even in the cached case.

When a 'cachemiss' occurs, i.e. if we hit schedule() and are about to
consider other threads, the syslet subsystem picks up a 'cachemiss
thread' and switches the current task's user-space context over to the
cachemiss thread, and makes the cachemiss thread available. The original
thread (which now becomes a 'busy' cachemiss thread) continues to block.
This means that user-space will still be executed without stopping -
even if user-space is single-threaded.

if the submitting user-space context /knows/ that a system call will
block, it can request immediate 'cachemiss' via the SYSLET_ASYNC flag.
This would be used if for example an O_DIRECT file is read() or
write()n.

likewise, if user-space knows (or expects) that a system call takes alot
of CPU time even in the cached case, and it wants to offload it to
another asynchronous context, it can request that via the SYSLET_ASYNC
flag too.

completions of asynchronous syslets are done via a user-space ringbuffer
that the kernel fills and user-space clears. Waiting is done via the
sys_async_wait() system call. Completion can be supressed on a per-atom
basis via the SYSLET_NO_COMPLETE flag, for atoms that include some
implicit notification mechanism. (such as sys_kill(), etc.)

As it might be obvious to some of you, the syslet subsystem takes many
ideas and experience from my Tux in-kernel webserver :) The syslet code
originates from a heavy rewrite of the Tux-atom and the Tux-cachemiss
infrastructure.

Open issues:

- the 'TID' of the 'head' thread currently varies depending on which
thread is running the user-space context.

- signal support is not fully thought through - probably the head
should be getting all of them - the cachemiss threads are not really
interested in executing signal handlers.

- sys_fork() and sys_async_exec() should be filtered out from the
syscalls that are allowed - first one only makes sense with ptregs,
second one is a nice kernel recursion thing :) I didnt want to
duplicate the sys_call_table though - maybe others have a better
idea.

See more details in Documentation/syslet-design.txt. The patchset is
against v2.6.20, but should apply to the -git head as well.

Thanks to Zach Brown for the idea to drive cachemisses via the
scheduler. Thanks to Arjan van de Ven for early review feedback.

Comments, suggestions, reports are welcome!

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/