Re: INFO: task hung in __io_uring_register

From: Jens Axboe
Date: Mon Apr 15 2019 - 12:49:40 EST


On 4/15/19 7:52 AM, syzbot wrote:
> syzbot has found a reproducer for the following crash on:
>
> HEAD commit: dc4060a5 Linux 5.1-rc5
> git tree: upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=17eacddd200000
> kernel config: https://syzkaller.appspot.com/x/.config?x=856fc6d0fbbeede9
> dashboard link: https://syzkaller.appspot.com/bug?extid=16dc03452dee970a0c3e
> compiler: gcc (GCC) 9.0.0 20181231 (experimental)
> userspace arch: i386
> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=14eb47ed200000
> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=112175dd200000
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+16dc03452dee970a0c3e@xxxxxxxxxxxxxxxxxxxxxxxxx
>
> INFO: task syz-executor043:8083 blocked for more than 143 seconds.
> Not tainted 5.1.0-rc5 #68
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> syz-executor043 D29912 8083 8063 0x20020004
> Call Trace:
> context_switch kernel/sched/core.c:2877 [inline]
> __schedule+0x813/0x1cc0 kernel/sched/core.c:3518
> schedule+0x92/0x180 kernel/sched/core.c:3562
> schedule_timeout+0x8ca/0xfd0 kernel/time/timer.c:1779
> do_wait_for_common kernel/sched/completion.c:83 [inline]
> __wait_for_common kernel/sched/completion.c:104 [inline]
> wait_for_common kernel/sched/completion.c:115 [inline]
> wait_for_completion+0x29c/0x440 kernel/sched/completion.c:136
> __io_uring_register+0xb6/0x1fd0 fs/io_uring.c:2929
> __do_sys_io_uring_register fs/io_uring.c:2979 [inline]
> __se_sys_io_uring_register fs/io_uring.c:2961 [inline]
> __ia32_sys_io_uring_register+0x193/0x1f0 fs/io_uring.c:2961
> do_syscall_32_irqs_on arch/x86/entry/common.c:326 [inline]
> do_fast_syscall_32+0x281/0xc98 arch/x86/entry/common.c:397
> entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139
> RIP: 0023:0xf7f06869
> Code: Bad RIP value.
> RSP: 002b:00000000f7f021ec EFLAGS: 00000296 ORIG_RAX: 00000000000001ab
> RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000000001
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000000014a
> RBP: 00000000200000c0 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> INFO: task syz-executor043:8084 blocked for more than 143 seconds.
> Not tainted 5.1.0-rc5 #68
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> syz-executor043 D30320 8084 8063 0x20020004
> Call Trace:
> context_switch kernel/sched/core.c:2877 [inline]
> __schedule+0x813/0x1cc0 kernel/sched/core.c:3518
> schedule+0x92/0x180 kernel/sched/core.c:3562
> schedule_preempt_disabled+0x13/0x20 kernel/sched/core.c:3620
> __mutex_lock_common kernel/locking/mutex.c:1002 [inline]
> __mutex_lock+0x726/0x1310 kernel/locking/mutex.c:1072
> mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:1087
> __do_sys_io_uring_enter fs/io_uring.c:2678 [inline]
> __se_sys_io_uring_enter fs/io_uring.c:2637 [inline]
> __ia32_sys_io_uring_enter+0x67f/0xac0 fs/io_uring.c:2637
> do_syscall_32_irqs_on arch/x86/entry/common.c:326 [inline]
> do_fast_syscall_32+0x281/0xc98 arch/x86/entry/common.c:397
> entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139
> RIP: 0023:0xf7f06869
> Code: Bad RIP value.
> RSP: 002b:00000000f7ee11ec EFLAGS: 00000292 ORIG_RAX: 00000000000001aa
> RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000010005
> RDX: 0000000000000002 RSI: 0000000000000003 RDI: 0000000000000000
> RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>
> Showing all locks held in the system:
> 1 lock held by khungtaskd/1042:
> #0: 00000000fbdbf33e (rcu_read_lock){....}, at:
> debug_show_all_locks+0x5f/0x27e kernel/locking/lockdep.c:5056
> 2 locks held by rsyslogd/7952:
> #0: 000000001bb2c5a7 (&f->f_pos_lock){+.+.}, at: __fdget_pos+0xee/0x110
> fs/file.c:801
> #1: 000000007392ce44 (fs_reclaim){+.+.}, at: is_bpf_text_address+0x0/0x170
> kernel/bpf/core.c:697
> 2 locks held by getty/8041:
> #0: 00000000e7d67802 (&tty->ldisc_sem){++++}, at:
> ldsem_down_read+0x33/0x40 drivers/tty/tty_ldsem.c:341
> #1: 0000000026520570 (&ldata->atomic_read_lock){+.+.}, at:
> n_tty_read+0x232/0x1b70 drivers/tty/n_tty.c:2156
> 2 locks held by getty/8042:
> #0: 000000004426012c (&tty->ldisc_sem){++++}, at:
> ldsem_down_read+0x33/0x40 drivers/tty/tty_ldsem.c:341
> #1: 0000000004c02a9e (&ldata->atomic_read_lock){+.+.}, at:
> n_tty_read+0x232/0x1b70 drivers/tty/n_tty.c:2156
> 2 locks held by getty/8043:
> #0: 00000000e0209856 (&tty->ldisc_sem){++++}, at:
> ldsem_down_read+0x33/0x40 drivers/tty/tty_ldsem.c:341
> #1: 000000009f921bda (&ldata->atomic_read_lock){+.+.}, at:
> n_tty_read+0x232/0x1b70 drivers/tty/n_tty.c:2156
> 2 locks held by getty/8044:
> #0: 0000000000de2ea5 (&tty->ldisc_sem){++++}, at:
> ldsem_down_read+0x33/0x40 drivers/tty/tty_ldsem.c:341
> #1: 000000002674bf52 (&ldata->atomic_read_lock){+.+.}, at:
> n_tty_read+0x232/0x1b70 drivers/tty/n_tty.c:2156
> 2 locks held by getty/8045:
> #0: 00000000abbba1ef (&tty->ldisc_sem){++++}, at:
> ldsem_down_read+0x33/0x40 drivers/tty/tty_ldsem.c:341
> #1: 00000000ef32475e (&ldata->atomic_read_lock){+.+.}, at:
> n_tty_read+0x232/0x1b70 drivers/tty/n_tty.c:2156
> 2 locks held by getty/8046:
> #0: 000000008d742c83 (&tty->ldisc_sem){++++}, at:
> ldsem_down_read+0x33/0x40 drivers/tty/tty_ldsem.c:341
> #1: 00000000fa255b8a (&ldata->atomic_read_lock){+.+.}, at:
> n_tty_read+0x232/0x1b70 drivers/tty/n_tty.c:2156
> 2 locks held by getty/8047:
> #0: 00000000d3d56b2b (&tty->ldisc_sem){++++}, at:
> ldsem_down_read+0x33/0x40 drivers/tty/tty_ldsem.c:341
> #1: 0000000015242045 (&ldata->atomic_read_lock){+.+.}, at:
> n_tty_read+0x232/0x1b70 drivers/tty/n_tty.c:2156
> 1 lock held by syz-executor043/8083:
> #0: 000000005524ed18 (&ctx->uring_lock){+.+.}, at:
> __do_sys_io_uring_register fs/io_uring.c:2978 [inline]
> #0: 000000005524ed18 (&ctx->uring_lock){+.+.}, at:
> __se_sys_io_uring_register fs/io_uring.c:2961 [inline]
> #0: 000000005524ed18 (&ctx->uring_lock){+.+.}, at:
> __ia32_sys_io_uring_register+0x182/0x1f0 fs/io_uring.c:2961
> 1 lock held by syz-executor043/8084:
> #0: 000000005524ed18 (&ctx->uring_lock){+.+.}, at: __do_sys_io_uring_enter
> fs/io_uring.c:2678 [inline]
> #0: 000000005524ed18 (&ctx->uring_lock){+.+.}, at: __se_sys_io_uring_enter
> fs/io_uring.c:2637 [inline]
> #0: 000000005524ed18 (&ctx->uring_lock){+.+.}, at:
> __ia32_sys_io_uring_enter+0x67f/0xac0 fs/io_uring.c:2637
>
> =============================================
>
> NMI backtrace for cpu 1
> CPU: 1 PID: 1042 Comm: khungtaskd Not tainted 5.1.0-rc5 #68
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
> __dump_stack lib/dump_stack.c:77 [inline]
> dump_stack+0x172/0x1f0 lib/dump_stack.c:113
> nmi_cpu_backtrace.cold+0x63/0xa4 lib/nmi_backtrace.c:101
> nmi_trigger_cpumask_backtrace+0x1be/0x236 lib/nmi_backtrace.c:62
> arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
> trigger_all_cpu_backtrace include/linux/nmi.h:146 [inline]
> check_hung_uninterruptible_tasks kernel/hung_task.c:204 [inline]
> watchdog+0x9b7/0xec0 kernel/hung_task.c:288
> kthread+0x357/0x430 kernel/kthread.c:253
> ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352
> Sending NMI from CPU 1 to CPUs 0:

This is a deadlock due to io_uring_register() waiting for existing
users to exit. But it does so with the io_uring mutex held, and
existing callers to io_uring_enter() may need that mutex to make
progress. Hence we're deadlocked - one side holding the mutex and
waiting for users to exit, the other side waiting on the mutex
to make progress and then exit.

This should fix it.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index f4ddb9d23241..b35300e4c9a7 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2929,11 +2929,23 @@ SYSCALL_DEFINE2(io_uring_setup, u32, entries,

static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
void __user *arg, unsigned nr_args)
+ __releases(ctx->uring_lock)
+ __acquires(ctx->uring_lock)
{
int ret;

percpu_ref_kill(&ctx->refs);
+
+ /*
+ * Drop uring mutex before waiting for references to exit. If another
+ * thread is currently inside io_uring_enter() it might need to grab
+ * the uring_lock to make progress. If we hold it here across the drain
+ * wait, then we can deadlock. It's safe to drop the mutex here, since
+ * no new references will come in after we've killed the percpu ref.
+ */
+ mutex_unlock(&ctx->uring_lock);
wait_for_completion(&ctx->ctx_done);
+ mutex_lock(&ctx->uring_lock);

switch (opcode) {
case IORING_REGISTER_BUFFERS:

--
Jens Axboe