Re: Shutting down a VM with Kernel 4.14 will sometime hang and a reboot is the only way to recover.

From: Harald Moeller
Date: Sat Dec 02 2017 - 07:16:15 EST


Hello, my name is Harry and this is my first post here, hope I'm doing this the right way, sorry if not ...

I'm not a subscriber to the full list yet so I understand I shall ask you to be personally CCed.

I am following this as I do experience the same (or sort-a same) issue with 4.14.2.

My setup is more simple, just an oVirt host shutting down some VMs. Doesn't happen all the time but I'd say around 3 from 10.

This is what I see (slightly different from David):

Dec 01 16:11:53 oVirtHost01.xyz.net kernel: INFO: task qemu-kvm:1173 blocked for more than 120 seconds.
Dec 01 16:11:53 oVirtHost01.xyz.net kernel:       Tainted: G          I     4.14.2-1.el7.hakimo.x86_64 #4
Dec 01 16:11:53 oVirtHost01.xyz.net kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 01 16:11:53 oVirtHost01.xyz.net kernel: qemu-kvm        D 0  1173      1 0x00000084
Dec 01 16:11:53 oVirtHost01.xyz.net kernel: Call Trace:
Dec 01 16:11:53 oVirtHost01.xyz.net kernel: __schedule+0x28d/0x880
Dec 01 16:11:53 oVirtHost01.xyz.net kernel:  schedule+0x36/0x80
Dec 01 16:11:53 oVirtHost01.xyz.net kernel: vhost_net_ubuf_put_and_wait+0x61/0x90 [vhost_net]
Dec 01 16:11:53 oVirtHost01.xyz.net kernel:  ? remove_wait_queue+0x60/0x60
Dec 01 16:11:53 oVirtHost01.xyz.net kernel: vhost_net_ioctl+0x317/0x8e0 [vhost_net]
Dec 01 16:11:53 oVirtHost01.xyz.net kernel: do_vfs_ioctl+0xa7/0x5f0
Dec 01 16:11:53 oVirtHost01.xyz.net kernel:  SyS_ioctl+0x79/0x90
Dec 01 16:11:53 oVirtHost01.xyz.net kernel: do_syscall_64+0x67/0x1b0
Dec 01 16:11:53 oVirtHost01.xyz.net kernel: entry_SYSCALL64_slow_path+0x25/0x25
Dec 01 16:11:53 oVirtHost01.xyz.net kernel: RIP: 0033:0x7fb8862d1107
Dec 01 16:11:53 oVirtHost01.xyz.net kernel: RSP: 002b:00007fff4acd7e58 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Dec 01 16:11:53 oVirtHost01.xyz.net kernel: RAX: ffffffffffffffda RBX: 000055abaa2d29c0 RCX: 00007fb8862d1107
Dec 01 16:11:53 oVirtHost01.xyz.net kernel: RDX: 00007fff4acd7e60 RSI: 000000004008af30 RDI: 0000000000000028
Dec 01 16:11:53 oVirtHost01.xyz.net kernel: RBP: 00007fff4acd7e60 R08: 000055aba805e10f R09: 00000000ffffffff
Dec 01 16:11:53 oVirtHost01.xyz.net kernel: R10: 0000000000000004 R11: 0000000000000246 R12: 000055ababf32510
Dec 01 16:11:53 oVirtHost01.xyz.net kernel: R13: 0000000000000001 R14: 000055ababf32498 R15: 000055abaa2a0b40

This is still happening after reverting the three suggested commits

1f8b977ab32dc5d148f103326e80d9097f1cefb5 ("sock: enable MSG_ZEROCOPY")

c1d1b437816f0afa99202be3cb650c9d174667bc ("net: convert (struct ubuf_info)->refcnt to refcount_t")

581fe0ea61584d88072527ae9fb9dcb9d1f2783e {"net: orphan frags on stand-alone ptype in dev_queue_xmit_nit"}

Anything I could be helpful with trying to solve this? Any more info I could provide?

Harry