Linux AIO status & todo

From: Suparna Bhattacharya
Date: Tue Aug 23 2005 - 02:36:12 EST



Ingo suggested putting together a summary note describing the status (e.g.
pending out-of-tree patches) and TODO items that need fixing in the mainline
linux kernel AIO implementation to get good AIO support in both kernel-space
and user-space, starting with enabling reasonably efficient and compliant
POSIX AIO on top of kernel AIO. Since Sébastien is on a longish leave, I
thought I'd go ahead and post it anyway and refine it along the way, rather
than delay the discussion further. So here is a first-cut attempt for
review and feedback.

Thoughts ?

Ulrich,
This doesn't go as far as addressing the blue-sky section of your posix
aio requirements list, but I think it tries to cover some of the major issues.
Do you see anything significant that is missing here ?

Regards
Suparna

--
Suparna Bhattacharya (suparna@xxxxxxxxxx)
Linux Technology Center
IBM Software Lab, India


Linux kernel AIO Status/Todo
----------------------------

Put together by Sébastien Dugué <sebastien.dugue@xxxxxxxx> with
inputs and additions from Ben LaHaise <bcrl@xxxxxxxxx, bcrl@xxxxxxxxxxxxxxx>
and Suparna Bhattacharya <suparna@xxxxxxxxxx>.


1. Linux kernel 2.6.13 AIO support
----------------------------------

The current 2.6.13 kernel native AIO infrastructure allows implementation of
only a subset of the POSIX AIO API.

Currently the restrictions are:

1. AIO support is only provided for files opened with O_DIRECT and
when the user buffer and size are block aligned. This means that IO
requests going through the page cache are still synchronous if the
pages are not in the cache.

2. No support for propagating IO completion events to user space
threads using RT signals. User threads need to poll the completion
queue using io_getevents. POSIX specifies that when an AIO
request completes, a signal can be delivered to the application
to indicate the completion of the IO.

3. No support for listio completion notification. POSIX specifies that
if the lio_listio mode is LIO_NOWAIT then asynchronous notification
shall occur upon completion of all the IOs on the list.

4. No support for listio LIO_WAIT. POSIX specifies that
if the lio_listio mode is LIO_WAIT then the caller blocks
until completion of all the IOs on the list.

5. No support for prioritized IO - aio_reqprio field of the aiocb.

6. No support for cancelation against a file descriptor. POSIX
specifies that if the aiocb argument to aio_cancel is NULL then
all cancelable AIO requests against the file descriptor shall be
canceled.

7. Cancellation of iocbs is not implemented (the infrastructure exists
but cancel methods haven't been implemented yet for supported
AIO operations, so cancellation returns -EAGAIN)

8. No support for aio_fsync.

9. AIO on sockets is not implemented and exhibits synchronous
behaviour

10.No support for AIO on pipes

An implementation of Linux POSIX AIO using kernel AIO, authored by
Laurent Vivier and Sébastien Dugue is available at:
http://www.bullopensource.org/posix.

The implementation uses a single ioctx for all POSIX AIO requests,
avoiding the need to wait on multiple contexts for aio_suspend, and
can take advantage of additional kernel patches described below for
providing more complete and efficient POSIX AIO.

Pradeep Padala (<ppadala@xxxxxxxxxxxxxx>) mentioned that he is working
toward some glibc patches based on the above implementation.

2. Additional support provided by patches
-----------------------------------------

Kernel patches add some missing functionality previously described.


2.1. Buffered filesystem AIO (#1)
----------------------------

This is addressed by Suparna's patches for buffered filesystem AIO.


2.2. AIO completion sigevent (#2)
-------------------------

This is addressed by Laurent and Sébastien's aioevent patch,
with modifications from Ben LaHaise. It adds an
aio_sigevent struct to the iocb. The relevant fields of the sigevent
(pid, signal number, notification type and value) are extracted and
stored in the kiocb for use upon request completion.

The sigevent structure is filled in by the user application as part of
an AIO request preparation. Upon request completion, the kernel notifies
the application using those sigevent parameters. If SIGEV_NONE has been
specified then the old behavior is retained and the application must
rely on polling the completion queue.


2.3 Listio completion event (#3)
---------------------------

There are a few alternative approaches under consideration to address
this:

(a) IOCB_CMD_EVENT marker iocbs
Laurent's and Sébastien's lioevent patch introduces an
IOCB_CMD_EVENT command. As part of listio submission,
userspace creates an empty special request with an aio_lio_opcode of
IOCB_CMD_EVENT filling up only the aio_sigevent fields.
The purpose of IOCB_CMD_EVENT is to group together the following
requests in the list up to the end of the list (or up to the next
IOCB_CMD_EVENT request in the list).

In sys_io_submit, upon detecting such a marker iocb, an lio_event is
created which contains the necessary information for
signaling a thread (signal number, pid, notify type and value) along
with a count of requests attached to this event.
Each subsequent submitted request is attached to this lio_event by
setting the request kiocb to that lio_event. When all the requests
in the group have completed then aio_complete() knows that it is time
to signal the user process.

(b) IOCB_CMD_GROUP for submitting a group of iocbs
This approach introduces a new IOCB_CMD_GROUP command iocb, which
takes as an argument a group of iocbs which must be submitted and
completed before marking the IOCB_CMD_GROUP iocb complete (the
argument may be passed in as a user-space buffer to be copied in).
Internally a struct kiocb *ki_liocb is added to the kiocb structure, to link the individual iocbs with the group command iocb, so that an
aio_complete() can be issued on the latter when all the iocbs in
the group are done. Upon request completion of the IOCB_CMD_GROUP
iocb, the kernel notifies the application using its corresponding
sigevent parameters. [Status: Patch to be developed]

(c) A new io_submit_group() or lio_submit() syscall
Similar to (b), but using an explicit system call.


2.4. Listio LIO_WAIT (#4)
--------------------

Alternative approaches under consideration include:
(a) IOCB_CMD_CHECKPOINT marker iocbs
Laurent's and Sébastien's liowait patch adds support for an in-kernel
POSIX listio LIO_WAIT mechanism. This works by adding an
IOCB_CMD_CHECKPOINT command and builds upon the lioevent
patch described in 2.3(a). As part of listio submission, userspace
prepends an empty iocb to the list with an aio_lio_opcode of
IOCB_CMD_CHECKPOINT. All iocbs following this particular CHECKPOINT
iocb are in the same group and sys_io_submit will block until all
iocbs submitted in the group have completed.

The behavior is similar to IOCB_CMD_EVENT. In sys_io_submit, upon
detecting such a marker iocb, an lio_event is created.
Each subsequent submitted request is attached to this lio_event by
setting the request kiocb to that lio_event (in io_submit_one) and
incrementing the lio_users count.

(b) IOCB_CMD_GROUP with min_nr wakeup in io_getevents
An io_submit() with IOCB_CMD_GROUP as described in 2.3(b) with
SIGEV_NONE followed by a call to io_getevents() requesting a
single wakeup for min_nr events (patch from Ben LaHaise) can
help make LIO_WAIT implementation reasonably efficient.



2.5 AIO cancellation against a file descriptor (#6)
---------------------------------------------

Laurent's and Sébastien's cancelfd patch implements this by
walking the list of active requests queued onto an IO context and trying
to cancel all those requests related to the given file descriptor.
This doesn't scale well under the presence of thousands of iocbs to
several files. A better solution (as suggested by Ben LaHaise) would
be maintain a list of iocbs in struct file, which would also be
useful for getting the queueing semantics correct for network AIO
when it is implemented. [Status: Patch to be developed]

2.6 AIO for pipes (#10)
-----------------

Chris Mason had a patch to support AIO for pipes; more recently
Ben LaHaise's git tree includes a pipe AIO implementation which is
based on his patches for async semaphore support

2.7 Thread based fallback for unimplemented AIO operations (#8 etc)
------------------------------------------------------------

Ben LaHaise has a patch for an in-kernel thread based fallback using
regular synchronous IO for AIO operations that have not been
implemented as yet, as an interim measure while AIO gets extended
more widely to additional methods like aio_fsync and drivers like
sound. This enables user space application development to proceed
independently of asyncification for more methods.


2.8 Additional Features (beyond POSIX)
---------------------------------------

- Vector AIO patches aka AIO readv, writev (from Zach Brown)
Currently included in Ben's git tree

- Patches for epoll notification through AIO (Zach Brown/Feng Zhou/wli)
Needs benchmarking and reposting with updates

3. Work to do
-------------------

- Make the existing max aio events limit a ulimit

- Add support for prioritized IO (#5). This is optional for AIO if
POSIX_PRIORITIZED_IO is not defined, but mandatory for Realtime
profiles.

Work is currently going on to add IO priority support to the CFQ IO
scheduler (2 new syscalls). This could be used to map AIO priority
levels onto the scheduler priority levels provided the CFQ elevator
is used.

- Implement IO requests cancelation support at the fs level (#7), for
various operations.

- Implement AIO for network sockets (#9)

- Implement asynchronous fsync at the fs level (#8).

- Spread AIO to more drivers etc

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/