io_uring question

From: Filipp Mikoian
Date: Thu Jul 04 2019 - 07:27:24 EST


Hi dear io_uring developers,

Recently I started playing with io_uring, and the main difference I expected
to see with old AIO(io_submit(), etc.) was submission syscall(io_uring_enter())
not blocking in case submission might take long time, e.g. if waiting for a slot
in block device request queue is required. AFAIU, 'workers' machinery is used
solely to be able to submit requests in async context, thus not forcing calling
thread to block for a significant time. At worst EAGAIN is expected.

However, when I installed fresh 5.2.0-rc7 kernel on the machine with HDD with
64-requests-deep queue, I noticed significant increase in time spent in
io_uring_enter() once request queue became full. Below you can find output
of the program that submits random(in 1GB range) 4K read requests in batches
of 32. Though O_DIRECT is used, the same phenomenon is observed when using
page cache. Source code can be found here:
https://github.com/Phikimon/io_uring_question

While analyzing stack dump, I found out that IOCB_NOWAIT flag being set
does not prevent generic_file_read_iter() from calling blkdev_direct_IO(),
so thread gets stuck for hundreds of milliseconds. However, I am not a
Linux kernel expert, so I can not be sure this is actually related to the
mentioned issue.

Is it actually expected that io_uring would sleep in case there is no slot
in block device's request queue, or is this a bug of current implementation?

root@localhost:~/io_uring# uname -msr
Linux 5.2.0-rc7 x86_64
root@localhost:~/io_uring# hdparm -I /dev/sda | grep Model
Model Number: Hitachi HTS541075A9E680
root@localhost:~/io_uring# cat /sys/block/sda/queue/nr_requests
64
root@localhost:~/io_uring# ./io_uring_read_blkdev /dev/sda8
submitted_already = 0, submitted_now = 32, submit_time = 246 us
submitted_already = 32, submitted_now = 32, submit_time = 130 us
submitted_already = 64, submitted_now = 32, submit_time = 189548 us
submitted_already = 96, submitted_now = 32, submit_time = 121542 us
submitted_already = 128, submitted_now = 32, submit_time = 128314 us
submitted_already = 160, submitted_now = 32, submit_time = 136345 us
submitted_already = 192, submitted_now = 32, submit_time = 162320 us
root@localhost:~/io_uring# cat pstack_output # This is where process slept
[<0>] io_schedule+0x16/0x40
[<0>] blk_mq_get_tag+0x166/0x280
[<0>] blk_mq_get_request+0xde/0x380
[<0>] blk_mq_make_request+0x11e/0x5b0
[<0>] generic_make_request+0x191/0x3c0
[<0>] submit_bio+0x75/0x140
[<0>] blkdev_direct_IO+0x3f8/0x4a0
[<0>] generic_file_read_iter+0xbf/0xdc0
[<0>] blkdev_read_iter+0x37/0x40
[<0>] io_read+0xf6/0x180
[<0>] __io_submit_sqe+0x1cd/0x6a0
[<0>] io_submit_sqe+0xea/0x4b0
[<0>] io_ring_submit+0x86/0x120
[<0>] __x64_sys_io_uring_enter+0x241/0x2d0
[<0>] do_syscall_64+0x60/0x1a0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<0>] 0xffffffffffffffff

P.S. There are also several suspicious places in liburing and io_uring's kernel
part. I'm not sure if these are really bugs, so please point out if any of
them needs a fixing patch. Among them:
1. Inaccurate handling of errors in liburing/__io_uring_submit(). Because
liburing currently does not care about queue head that kernel sets, it cannot
know how many entries have been actually consumed. In case e.g.
io_uring_enter() returns EAGAIN, and consumes none of the sqes, sq->sqe_head
still advances in __io_uring_submit(), this can eventually cause both
io_uring_submit() and io_uring_sqe() return 0 forever.
2. There is also a related issue -- when using IORING_SETUP_SQPOLL, in case
polling kernel thread already went to sleep(IORING_SQ_NEED_WAKEUP is set),
io_uring_enter() just wakes it up and immediately reports all @to_submit
requests are consumed, while this is not true until awaken thread will manage
to handle them. At least this contradicts with man page, which states:
> When the system call returns that a certain amount of SQEs have been
> consumed and submitted, it's safe to reuse SQE entries in the ring.
It is easy to reproduce this bug -- just change e.g. ->offset field in the
SQE immediately after io_uring_enter() successfully returns and you will see
that IO happened on new offset.
3. Again due to lack of synchronization between io_sq_thread() and
io_uring_enter(), in case the ring is full and IORING_SETUP_SQPOLL is used,
it seems there is no other way for application to wait for slots in SQ to
become available but busy waiting for *sq->khead to advance. Thus from one
busy waiting thread we get two. Is this the expected behavior? Should the
user of IORING_SETUP_SQPOLL busy wait for slots in SQ?
4. Minor one: in case sq_thread_idle is set to ridiculously big value(e.g. 100
sec), kernel watchdog starts reporting this as a bug.
> Message from syslogd@centos-linux at Jun 21 20:00:04 ...
> kernel:watchdog: BUG: soft lockup - CPU#0 stuck for 21s! [io_uring-sq:10691]

Looking forward to your reply and thank you in advance.