Re: [PATCH] smp/call: Detect stuck CSD locks

From: Chris J Arges
Date: Thu Apr 16 2015 - 11:59:19 EST


On Thu, Apr 16, 2015 at 01:04:23PM +0200, Ingo Molnar wrote:
>
> * Chris J Arges <chris.j.arges@xxxxxxxxxxxxx> wrote:
>
> > Ingo,
> >
> > Below are the patches and data I've gathered from the reproducer. My
> > methodology was as described previously; however I used gdb on the
> > qemu process in order to breakpoint L1 once we've detected the hang.
> > This made dumping the kvm_lapic structures on L0 more reliable.
>
> Thanks!
>
> So I have trouble interpreting the L1 backtrace, because it shows
> something entirely new (to me).
>
> First lets clarify the terminology, to make sure I got the workload
> all right:
>
> - L0 is the host kernel, running native Linux. It's not locking up.
>
> - L1 is the guest kernel, running virtualized Linux. This is the one
> that is locking up.
>
> - L2 is the nested guest kernel, running whatever test workload you
> were running - this is obviously locking up together with L1.
>
> Right?

Yup this sums it up nicely.

>
> So with that cleared up, the backtrace on L1 looks like this:
>
> > * Crash dump backtrace from L1:
> >
> > crash> bt -a
> > PID: 26 TASK: ffff88013a4f1400 CPU: 0 COMMAND: "ksmd"
> > #0 [ffff88013a5039f0] machine_kexec at ffffffff8109d3ec
> > #1 [ffff88013a503a50] crash_kexec at ffffffff8114a763
> > #2 [ffff88013a503b20] panic at ffffffff818068e0
> > #3 [ffff88013a503ba0] csd_lock_wait at ffffffff8113f1e4
> > #4 [ffff88013a503bf0] generic_exec_single at ffffffff8113f2d0
> > #5 [ffff88013a503c60] smp_call_function_single at ffffffff8113f417
> > #6 [ffff88013a503c90] smp_call_function_many at ffffffff8113f7a4
> > #7 [ffff88013a503d20] flush_tlb_page at ffffffff810b3bf9
> > #8 [ffff88013a503d50] ptep_clear_flush at ffffffff81205e5e
> > #9 [ffff88013a503d80] try_to_merge_with_ksm_page at ffffffff8121a445
> > #10 [ffff88013a503e00] ksm_scan_thread at ffffffff8121ac0e
> > #11 [ffff88013a503ec0] kthread at ffffffff810df0fb
> > #12 [ffff88013a503f50] ret_from_fork at ffffffff8180fc98
>
> So this one, VCPU0, is trying to send an IPI to VCPU1:
>
> > PID: 1674 TASK: ffff8800ba4a9e00 CPU: 1 COMMAND: "qemu-system-x86"
> > #0 [ffff88013fd05e20] crash_nmi_callback at ffffffff81091521
> > #1 [ffff88013fd05e30] nmi_handle at ffffffff81062560
> > #2 [ffff88013fd05ea0] default_do_nmi at ffffffff81062b0a
> > #3 [ffff88013fd05ed0] do_nmi at ffffffff81062c88
> > #4 [ffff88013fd05ef0] end_repeat_nmi at ffffffff81812241
> > [exception RIP: vmx_vcpu_run+992]
> > RIP: ffffffff8104cef0 RSP: ffff88013940bcb8 RFLAGS: 00000082
> > RAX: 0000000080000202 RBX: ffff880139b30000 RCX: ffff880139b30000
> > RDX: 0000000000000200 RSI: ffff880139b30000 RDI: ffff880139b30000
> > RBP: ffff88013940bd28 R8: 00007fe192b71110 R9: 00007fe192b71140
> > R10: 00007fff66407d00 R11: 00007fe1927f0060 R12: 0000000000000000
> > R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000000
> > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> > --- <NMI exception stack> ---
> > #5 [ffff88013940bcb8] vmx_vcpu_run at ffffffff8104cef0
> > #6 [ffff88013940bcf8] vmx_handle_external_intr at ffffffff81040c18
> > #7 [ffff88013940bd30] kvm_arch_vcpu_ioctl_run at ffffffff8101b5ad
> > #8 [ffff88013940be00] kvm_vcpu_ioctl at ffffffff81007894
> > #9 [ffff88013940beb0] do_vfs_ioctl at ffffffff81253190
> > #10 [ffff88013940bf30] sys_ioctl at ffffffff81253411
> > #11 [ffff88013940bf80] system_call_fastpath at ffffffff8180fd4d
>
> So the problem here that I can see is that L1's VCPU1 appears to be
> looping with interrupts disabled:
>
> > RIP: ffffffff8104cef0 RSP: ffff88013940bcb8 RFLAGS: 00000082
>
> Look how RFLAGS doesn't have 0x200 set - so it's executing with
> interrupts disabled.
>

I've ran this with L0: 3.13+b6b8a145 L1: 4.0+debug patches and got a similar
backtrace with interrupts disabled; however this _may_ be another issue.

I ran L0/L1: 3.13+b6b8a145 and got something like this:
PID: 36 TASK: ffff8801396a9800 CPU: 0 COMMAND: "ksmd"
#0 [ffff88013fc03d18] machine_kexec at ffffffff8104ace2
#1 [ffff88013fc03d68] crash_kexec at ffffffff810e72e3
#2 [ffff88013fc03e30] panic at ffffffff8171a404
#3 [ffff88013fc03ea8] watchdog_timer_fn at ffffffff8110daa5
#4 [ffff88013fc03ed8] __run_hrtimer at ffffffff8108e7a7
#5 [ffff88013fc03f18] hrtimer_interrupt at ffffffff8108ef6f
#6 [ffff88013fc03f80] local_apic_timer_interrupt at ffffffff81043617
#7 [ffff88013fc03f98] smp_apic_timer_interrupt at ffffffff8173414f
#8 [ffff88013fc03fb0] apic_timer_interrupt at ffffffff81732add
--- <IRQ stack> ---
#9 [ffff880137fd3b08] apic_timer_interrupt at ffffffff81732add
[exception RIP: generic_exec_single+130]
RIP: ffffffff810dbf52 RSP: ffff880137fd3bb0 RFLAGS: 00000202
RAX: 0000000000000002 RBX: ffff880137fd3b80 RCX: 0000000000000002
RDX: ffffffff8180ad80 RSI: 0000000000000000 RDI: 0000000000000282
RBP: ffff880137fd3be0 R8: ffffffff8180ad68 R9: 0000000000000001
R10: 0000000000000000 R11: 0000000000000002 R12: 0000000000000000
R13: 0000000000000002 R14: 0000000000000000 R15: ffff880137fd3bd0
ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018
#10 [ffff880137fd3be8] smp_call_function_single at ffffffff810dc065
#11 [ffff880137fd3c60] smp_call_function_many at ffffffff810dc496
#12 [ffff880137fd3cc0] make_all_cpus_request at ffffffffa004e5ef [kvm]
#13 [ffff880137fd3cf8] kvm_flush_remote_tlbs at ffffffffa004e620 [kvm]
#14 [ffff880137fd3d18] kvm_mmu_notifier_invalidate_range_start at ffffffffa004e6a2 [kvm]
#15 [ffff880137fd3d58] __mmu_notifier_invalidate_range_start at ffffffff8119b56b
#16 [ffff880137fd3d90] try_to_merge_with_ksm_page at ffffffff8119d296
#17 [ffff880137fd3e00] ksm_do_scan at ffffffff8119d749
#18 [ffff880137fd3e78] ksm_scan_thread at ffffffff8119e0cf
#19 [ffff880137fd3ed0] kthread at ffffffff8108b592
#20 [ffff880137fd3f50] ret_from_fork at ffffffff81731ccc

PID: 1543 TASK: ffff880137861800 CPU: 1 COMMAND: "qemu-system-x86"
#0 [ffff88013fd05e58] crash_nmi_callback at ffffffff81040082
#1 [ffff88013fd05e68] nmi_handle at ffffffff8172aa38
#2 [ffff88013fd05ec8] do_nmi at ffffffff8172ac00
#3 [ffff88013fd05ef0] end_repeat_nmi at ffffffff81729ea1
[exception RIP: _raw_spin_lock+50]
RIP: ffffffff817292b2 RSP: ffff8800372ab8f8 RFLAGS: 00000202
RAX: 0000000000000010 RBX: 0000000000000010 RCX: 0000000000000202
RDX: ffff8800372ab8f8 RSI: 0000000000000018 RDI: 0000000000000001
RBP: ffffffff817292b2 R8: ffffffff817292b2 R9: 0000000000000018
R10: ffff8800372ab8f8 R11: 0000000000000202 R12: ffffffffffffffff
R13: ffff8800b8f78000 R14: 00000000000070ae R15: 00000000000070ae
ORIG_RAX: 00000000000070ae CS: 0010 SS: 0018
--- <DOUBLEFAULT exception stack> ---
#4 [ffff8800372ab8f8] _raw_spin_lock at ffffffff817292b2
#5 [ffff8800372ab900] kvm_mmu_notifier_invalidate_range_start at ffffffffa004e67e [kvm]
#6 [ffff8800372ab940] __mmu_notifier_invalidate_range_start at ffffffff8119b56b
#7 [ffff8800372ab978] do_wp_page at ffffffff81177fee

A previous backtrace of a 3.19 series kernel is here and showing interrupts
enabled on both CPUs on L1:
https://lkml.org/lkml/2015/2/23/234
http://people.canonical.com/~inaddy/lp1413540/BACKTRACES.txt

> That is why the IPI does not get through to it, but kdump's NMI had no
> problem getting through.
>
> This (assuming all backtraces are exact!):
>
> > #5 [ffff88013940bcb8] vmx_vcpu_run at ffffffff8104cef0
> > #6 [ffff88013940bcf8] vmx_handle_external_intr at ffffffff81040c18
> > #7 [ffff88013940bd30] kvm_arch_vcpu_ioctl_run at ffffffff8101b5ad
>
> suggests that we called vmx_vcpu_run() from
> vmx_handle_external_intr(), and that we are executing L2 guest code
> with interrupts disabled.
>
> How is this supposed to work? What mechanism does KVM have against a
> (untrusted) guest interrupt handler locking up?
>
> I might be misunderstanding how this works at the KVM level, but from
> the APIC perspective the situation appears to be pretty clear: CPU1's
> interrupts are turned off, so it cannot receive IPIs, the CSD wait
> will eventually time out.
>
> Now obviously it appears to be anomalous (assuming my analysis is
> correct) that the interrupt handler has locked up, but it's
> immaterial: a nested kernel must not allow its guest to lock it up.
>
> Thanks,
>
> Ingo
>

Yes, I think at this point I'll go through the various backtraces and try to
narrow things down. I think overall we're seeing a single effect from multiple
code paths.

--chris j arges


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/