Re: sched/fair: Hard lockup from idle_balance()/task_numa_migrate() race

From: Ed Swierk
Date: Thu Nov 02 2017 - 20:51:56 EST


Ping?

On Wed, Oct 25, 2017 at 9:35 PM, Ed Swierk <eswierk@xxxxxxxxxxxxxxxxxx> wrote:
>
> Ping?
>
> On Mon, Oct 16, 2017 at 4:11 PM, Ed Swierk <eswierk@xxxxxxxxxxxxxxxxxx> wrote:
> >
> > Ping for Peter, Ingo and other sched maintainers:
> >
> > I'd appreciate any feedback on this hard lockup issue, which occurred
> > on a system running kernel 4.4.52-grsec.
> >
> > To recap: a dual-socket Xeon (E5 v4) server system had been running a
> > bunch of KVM workloads just fine for over 6 weeks. Suddenly hard
> > lockups occurred on cpu 13 in task_numa_migrate(), and cpu 0 in
> > idle_balance(). That condition persisted until the system was rebooted
> > after a couple of hours, and meanwhile all manner of soft lockups and
> > even SATA errors appeared.
> >
> > [3851435.777762] NMI watchdog: Watchdog detected hard LOCKUP on cpu 13
> > [3851435.784582] Modules linked in: vhost_net tun vhost macvtap 8021q garp mrp drbg ansi_cprng dm_crypt algif_skcipher af_alg ip_gre ip_tunnel gre macvlan ebtable_filter ebtables openvswitch nf_defrag_ipv6 libcrc32c nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc ip6table_filter ip6_tables nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables x_tables iTCO_wdt iTCO_vendor_support x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul raid10 crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd sb_edac lpc_ich mfd_core mei_me mei i2c_i801 ioatdma ipmi_ssif ipmi_msghandler acpi_pad acpi_cpufreq squashfs lz4_decompress xhci_pci xhci_hcd ixgbe mdio vxlan ip6_udp_tunnel udp_tunnel ptp pps_core dca ehci_pci ehci_hcd usbcore ahci libahci usb_common tpm_tis
> > [3851435.874884] irq event stamp: 3828060294
> > [3851435.879398] hardirqs last enabled at (3828060293): [<ffffffff810db86c>] ffffffff810db86c
> > [3851435.888810] hardirqs last disabled at (3828060294): [<ffffffff8170ff74>] ffffffff8170ff74
> > [3851435.898254] softirqs last enabled at (3828060156): [<ffffffff8107f31e>] ffffffff8107f31e
> > [3851435.907666] softirqs last disabled at (3828060139): [<ffffffff8107f648>] ffffffff8107f648
> > [3851435.917094] CPU: 13 PID: 2146 Comm: pocviexg Not tainted 4.4.52-grsec #1
> > [3851435.925252] Hardware name: Intel S2600WTTR, BIOS SE5C610.86B.01.01.0016.C1.033120161139 03/31/2016
> > [3851435.937473] task: ffff881c9ff35c00 ti: ffff881c9ff372b0 task.ti: ffff881c9ff372b0
> > [3851435.946104] RIP: 0010:[<ffffffff810ad854>] [<ffffffff810ad854>] try_to_wake_up+0xe4/0x500
> > [3851435.955614] RSP: 0000:ffffc90016beba40 EFLAGS: 00000002
> > [3851435.961780] RAX: 0000000000000000 RBX: ffff88102cc8e3d8 RCX: 0000000000000001
> > [3851435.970016] RDX: 0000000000000001 RSI: 00000000000001e4 RDI: 0000000000000000
> > [3851435.978259] RBP: ffffc90016beba80 R08: 0000000000000000 R09: 0000000000000001
> > [3851435.986493] R10: 00000000000000e0 R11: ffff881c9ff35c00 R12: 0000000000000082
> > [3851435.994727] R13: ffff88102cc8dc00 R14: 00000000001d3f00 R15: ffffffff81f25060
> > [3851436.002962] FS: 00006e79227fc700(0000) GS:ffff882034000000(0000) knlGS:0000000000000000
> > [3851436.012266] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [3851436.018920] CR2: 00006e79227fba38 CR3: 000000201d2fe000 CR4: 00000000003626f0
> > [3851436.027155] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [3851436.035391] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [3851436.043626] Stack:
> > [3851436.046098] ffff8820341ce620 ffff8820341ce608 0000000000000000 ffff8820341ce600
> > [3851436.054695] ffffc90016bebb48 000000000000000d 0000000000000000 ffff881033fce600
> > [3851436.063259] ffffc90016beba90 ffffffff810adc80 ffffc90016bebbe8 ffffffff8112c830
> > [3851436.071868] Call Trace:
> > [3851436.074862] [<ffffffff810adc80>] wake_up_process+0x10/0x20 https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/sched/core.c#L1997
> > [3851436.081317] [<ffffffff8112c830>] stop_two_cpus+0x1b0/0x280 https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/stop_machine.c#L235
> > [3851436.087778] [<ffffffff8112c490>] ? cpu_stop_park+0x40/0x40
> > [3851436.094237] [<ffffffff8112c490>] ? cpu_stop_park+0x40/0x40
> > [3851436.100715] [<ffffffff810ad560>] ? __migrate_swap_task.part.94+0x70/0x70
> > [3851436.108539] [<ffffffff810ade01>] migrate_swap+0xf1/0x1f0 https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/sched/core.c#L1408
> > [3851436.114800] [<ffffffff810b9d55>] task_numa_migrate+0x345/0xb60 https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/sched/fair.c#L1572
> > [3851436.121656] [<ffffffff811092e3>] ? futex_wait_queue_me+0xf3/0x160
> > [3851436.128801] [<ffffffff810ba5e3>] numa_migrate_preferred+0x73/0x80 https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/sched/fair.c#L1597
> > [3851436.135936] [<ffffffff810bb944>] task_numa_fault+0x7a4/0xee0 https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/sched/fair.c#L2137
> > [3851436.142604] [<ffffffff810bafe0>] ? should_numa_migrate_memory+0x50/0x130
> > [3851436.150437] [<ffffffff811b2db6>] handle_mm_fault+0xfd6/0x1e10 https://github.com/skyportsystems/linux/blob/4.4.52-grsec/mm/memory.c#L3503
> > [3851436.157189] [<ffffffff811b1e28>] ? handle_mm_fault+0x48/0x1e10
> > [3851436.164051] [<ffffffff810a85aa>] ? ___might_sleep+0x1ea/0x220
> > [3851436.170804] [<ffffffff8106405f>] __do_page_fault+0x1bf/0x460
> > [3851436.177475] [<ffffffff8110ca44>] ? sys_futex+0x74/0x230
> > [3851436.183642] [<ffffffff8106432c>] do_page_fault+0x2c/0x40
> > [3851436.189909] [<ffffffff817128b8>] page_fault+0x28/0x30
> > [3851436.195892] Code: 0f 85 e0 02 00 00 41 8b 85 b8 16 00 00 89 45 d0 41 8b 7d 4c 49 c7 c6 00 3f 1d 00 85 ff 0f 85 68 01 00 00 41 8b 4d 30 85 c9 74 0a <f3> 90 41 8b 55 30 85 d2 75 f6 49 8b 45 00 a8 02 b8 00 00 00 00
> >
> > [3851438.906435] NMI watchdog: Watchdog detected hard LOCKUP on cpu 0
> > [3851438.913158] Modules linked in: vhost_net tun vhost macvtap 8021q garp mrp drbg ansi_cprng dm_crypt algif_skcipher af_alg ip_gre ip_tunnel gre macvlan ebtable_filter ebtables openvswitch nf_defrag_ipv6 libcrc32c nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc ip6table_filter ip6_tables nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables x_tables iTCO_wdt iTCO_vendor_support x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul raid10 crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd sb_edac lpc_ich mfd_core mei_me mei i2c_i801 ioatdma ipmi_ssif ipmi_msghandler acpi_pad acpi_cpufreq squashfs lz4_decompress xhci_pci xhci_hcd ixgbe mdio vxlan ip6_udp_tunnel udp_tunnel ptp pps_core dca ehci_pci ehci_hcd usbcore ahci libahci usb_common tpm_tis
> > [3851439.003452] irq event stamp: 25935634
> > [3851439.007769] hardirqs last enabled at (25935633): [<ffffffff81710341>] ffffffff81710341
> > [3851439.017010] hardirqs last disabled at (25935634): [<ffffffff8170ff74>] ffffffff8170ff74
> > [3851439.026226] softirqs last enabled at (25935520): [<ffffffff8107f31e>] ffffffff8107f31e
> > [3851439.035441] softirqs last disabled at (25935445): [<ffffffff8107f648>] ffffffff8107f648
> > [3851439.044657] CPU: 0 PID: 10 Comm: migration/0 Not tainted 4.4.52-grsec #1
> > [3851439.052892] Hardware name: Intel S2600WTTR, BIOS SE5C610.86B.01.01.0016.C1.033120161139 03/31/2016
> > [3851439.065111] task: ffff88102cc8dc00 ti: ffff88102cc8f2b0 task.ti: ffff88102cc8f2b0
> > [3851439.073747] RIP: 0010:[<ffffffff813c4935>] [<ffffffff813c4935>] delay_tsc+0x25/0x70
> > [3851439.082679] RSP: 0000:ffffc900006ebb40 EFLAGS: 00000002
> > [3851439.088852] RAX: 000000002591f96b RBX: ffff8820341ce608 RCX: 0000000000000018
> > [3851439.101462] RDX: 001cab2e00000000 RSI: 001cab2e2591f96b RDI: 0000000000000001
> > [3851439.109695] RBP: ffffc900006ebb40 R08: 0000000000000000 R09: 0000000000000000
> > [3851439.117950] R10: 0000000000000038 R11: 0000000000000000 R12: 0000000021d37081
> > [3851439.126198] R13: 0000000000000001 R14: 000000007d234b28 R15: 0000000000000000
> > [3851439.134440] FS: 0000000000000000(0000) GS:ffff881033e00000(0000) knlGS:0000000000000000
> > [3851439.143741] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [3851439.150393] CR2: 000000c420148008 CR3: 0000000001f36000 CR4: 00000000003626f0
> > [3851439.158627] Stack:
> > [3851439.161096] ffffc900006ebb50 ffffffff813c487a ffffc900006ebb80 ffffffff810dba87
> > [3851439.169669] ffff8820341ce608 ffff8820341ce620 0000000000000086 00000000000006ae
> > [3851439.178228] ffffc900006ebbb0 ffffffff817101a2 ffffffff8112c1cb ffff8820341ce600
> > [3851439.186825] Call Trace:
> > [3851439.189784] [<ffffffff813c487a>] __delay+0xa/0x10
> > [3851439.195385] [<ffffffff810dba87>] do_raw_spin_lock+0x87/0x160
> > [3851439.202062] [<ffffffff817101a2>] _raw_spin_lock_irqsave+0xa2/0xb0
> > [3851439.209238] [<ffffffff8112c1cb>] ? cpu_stop_queue_work+0x2b/0x70
> > [3851439.216287] [<ffffffff8112c1cb>] cpu_stop_queue_work+0x2b/0x70 https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/stop_machine.c#L89
> > [3851439.223149] [<ffffffff8112c92b>] stop_one_cpu_nowait+0x2b/0x30 https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/stop_machine.c#L310
> > [3851439.229999] [<ffffffff810c1127>] load_balance+0x827/0x960 https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/sched/fair.c#L7184
> > [3851439.236371] [<ffffffff810eae3d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
> > [3851439.243995] [<ffffffff810c1608>] pick_next_task_fair+0x3a8/0x750 https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/sched/fair.c#L5348
> > [3851439.251062] [<ffffffff810c14db>] ? pick_next_task_fair+0x27b/0x750
> > [3851439.258289] [<ffffffff81708fdd>] __schedule+0x1ad/0xbd0 https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/sched/core.c#L3196
> > [3851439.264456] [<ffffffff810d461d>] ? trace_hardirqs_on_caller+0x13d/0x1d0
> > [3851439.272204] [<ffffffff81709a37>] schedule+0x37/0x80
> > [3851439.277984] [<ffffffff810a51cb>] smpboot_thread_fn+0x16b/0x1e0
> > [3851439.284822] [<ffffffff810a5060>] ? sort_range+0x20/0x20
> > [3851439.291001] [<ffffffff810a124c>] kthread+0xfc/0x120
> > [3851439.296796] [<ffffffff8170adf1>] ? wait_for_completion+0xe1/0x110
> > [3851439.303936] [<ffffffff810a1150>] ? kthread_create_on_node+0x240/0x240
> > [3851439.311464] [<ffffffff81710aae>] ret_from_fork+0x3e/0x70
> > [3851439.317738] [<ffffffff810a1150>] ? kthread_create_on_node+0x240/0x240
> > [3851439.325281] Code: 00 5d c3 0f 1f 00 55 65 ff 05 b8 65 c4 7e 48 89 e5 65 44 8b 05 35 48 c4 7e 0f ae e8 0f 31 48 c1 e2 20 48 89 d6 48 09 c6 0f ae e8 <0f> 31 48 c1 e2 20 48 09 d0 48 89 c1 48 29 f1 48 39 cf 76 27 65
> >
> > [3851450] NMI watchdog: BUG: soft lockup - CPU#8 stuck for 22s! [pocviexg:2136]
> > [3851450] NMI watchdog: BUG: soft lockup - CPU#12 stuck for 22s! [CPU 0/KVM:9662]
> > [3851450] NMI watchdog: BUG: soft lockup - CPU#14 stuck for 22s! [CPU 1/KVM:10302]
> >
> > [3851478] NMI watchdog: BUG: soft lockup - CPU#8 stuck for 22s! [pocviexg:2136]
> > [3851478] NMI watchdog: BUG: soft lockup - CPU#12 stuck for 22s! [CPU 0/KVM:9662]
> > [3851478] NMI watchdog: BUG: soft lockup - CPU#14 stuck for 22s! [CPU 1/KVM:10302]
> >
> > [3851482] BUG: spinlock lockup suspected on CPU#0, migration/0/10
> >
> > [3851482] Sending NMI to all CPUs:
> > [3851482] NMI backtrace for cpu 0
> > [3851483] NMI backtrace for cpu 1
> > [3851483] NMI backtrace for cpu 2
> > [3851483] NMI backtrace for cpu 3
> > [3851483] NMI backtrace for cpu 4
> >
> > [3851484] ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen **
> > [3851484] ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen **
> >
> > [3851484] NMI backtrace for cpu 5
> > [3851484] NMI backtrace for cpu 6
> > ...
> > [3851489] NMI backtrace for cpu 24
> >
> > [3851489] ata8.00: qc timeout (cmd 0xec) **
> > [3851489] ata7.00: qc timeout (cmd 0xec) **
> >
> > [3851489] NMI backtrace for cpu 25
> > ...
> > [3851490] NMI backtrace for cpu 31
> >
> > [3851499] ata8.00: qc timeout (cmd 0xec) **
> > [3851499] ata7.00: qc timeout (cmd 0xec) **
> >
> > [3851510] NMI watchdog: BUG: soft lockup - CPU#12 stuck for 22s! [CPU 0/KVM:9662]
> > ...and so on for 2 more hours, until the system was finally rebooted
> >
> > Based on the addresses in the stack and registers, here's what I think
> > happened.
> >
> > On cpu 13:
> >
> > - task_numa_fault() calls task_numa_migrate(), which selects the task
> > on cpu 0 as the dst_task.
> > - migrate_swap() calls stop_two_cpus(), which acquires the cpu_stopper
> > locks for the dst_cpu (cpu 0, at 0xffff881033fce600) and src_cpu
> > (cpu X, at 0xffff8820341ce600).
> > - stop_two_cpus() calls wake_up_process() on the lower-numbered cpu
> > first, which has to be cpu 0.
> > - wake_up_process() spins until the cpu 0 task (at 0xffff88102cc8dc00)
> > is no longer on_cpu.
> >
> > On cpu 0:
> >
> > - pick_next_task_fair() calls idle_balance(). According to the "This
> > is OK" comment, current is on_cpu at this point.
> > - idle_balance() calls load_balance() for dst_cpu 0.
> > - load_balance() decides to move a task from cpu X, so calls
> > stop_one_cpu_nowait() on cpu X.
> > - stop_one_cpu_nowait() spins trying to acquire the cpu_stopper lock
> > for cpu X (at 0xffff8820341ce600).
> >
> > So idle_balance() on cpu 0 is stuck waiting for task_numa_fault() to
> > move a task to cpu 0, which is blocked on idle_balance() completing.
> >
> > Meanwhile, the SATA controller interrupt handler, which is placed on
> > cpu 0, doesn't get a chance to run, causing the command failure
> > messages from ata7 and ata8.
> >
> > Also, it appears that task_numa_fault() tries to migrate current, so
> > the src_cpu X used by task_numa_migrate() is cpu 13 in this
> > case. Though the key issue is that both task_numa_migrate() and
> > idle_balance() are trying to stop the same cpu, regardless of whether
> > it's the cpu task_numa_migrate() is running on.
> >
> > Is this a plausible theory?
> >
> > This is the only instance of a hard lockup among a number of identical
> > systems running the same kernel, so I'm not seeing an easy way to
> > reproduce this issue, but any suggestions would be welcome.
>
> --Ed
>