Re: [RFC] REHL 7.1: soft lockup when flush tlb

From: Nikolay Borisov
Date: Sat Dec 12 2015 - 04:28:18 EST




On 12/12/2015 10:52 AM, Xishi Qiu wrote:
> [60050.458309] kjournald starting. Commit interval 5 seconds
> [60076.821224] EXT3-fs (sda1): using internal journal
> [60098.811865] EXT3-fs (sda1): mounted filesystem with ordered data mode
> [60138.687054] kjournald starting. Commit interval 5 seconds
> [60143.888627] EXT3-fs (sda1): using internal journal
> [60143.888631] EXT3-fs (sda1): mounted filesystem with ordered data mode
> [60164.075002] BUG: soft lockup - CPU#1 stuck for 22s! [mount:3883]
> [60164.075002] Modules linked in: loop binfmt_misc rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd sunrpc fscache hmem_driver(OF) kbox(OF) cirrus syscopyarea sysfillrect sysimgblt ttm drm_kms_helper drm ppdev parport_pc parport i2c_piix4 i2c_core virtio_balloon floppy serio_raw pcspkr ext3 mbcache jbd sd_mod sr_mod crc_t10dif cdrom crct10dif_common ata_generic pata_acpi virtio_scsi virtio_console ata_piix virtio_pci libata e1000 virtio_ring virtio [last unloaded: kernel_allocpage]
> [60164.075002] CPU: 1 PID: 3883 Comm: mount Tainted: GF W O -------------- 3.10.0-229.20.1.23.x86_64 #1
> [60164.075002] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
> [60164.075002] task: ffff88061d850b60 ti: ffff88061c500000 task.ti: ffff88061c500000
> [60164.075002] RIP: 0010:[<ffffffff810d7a4a>] [<ffffffff810d7a4a>] generic_exec_single+0xfa/0x1a0
> [60164.075002] RSP: 0018:ffff88061c503c50 EFLAGS: 00000202
> [60164.075002] RAX: 0000000000000004 RBX: ffff88061c503c20 RCX: 000000000000003c
> [60164.075002] RDX: 000000000000000f RSI: 0000000000000004 RDI: 0000000000000282
> [60164.075002] RBP: ffff88061c503c98 R08: ffffffff81631120 R09: 00000000000169e0
> [60164.075002] R10: ffff88063ffc5000 R11: 0000000000000000 R12: 0000000000000000
> [60164.075002] R13: ffff88063ffc5000 R14: ffff88061c503c20 R15: 0000000200000000
> [60164.075002] FS: 0000000000000000(0000) GS:ffff88063fc80000(0000) knlGS:0000000000000000
> [60164.075002] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [60164.075002] CR2: 00007fab0c4a0c40 CR3: 000000000190a000 CR4: 00000000000006e0
> [60164.075002] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [60164.075002] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [60164.075002] Stack:
> [60164.075002] 0000000000000000 0000000000000000 ffffffff8105fab0 ffff88061c503d20
> [60164.075002] 0000000000000003 000000009383884f 0000000000000002 ffffffff8105fab0
> [60164.075002] ffffffff8105fab0 ffff88061c503cc8 ffffffff810d7b4f ffff88061c503cc8
> [60164.075002] Call Trace:
> [60164.075002] [<ffffffff8105fab0>] ? leave_mm+0x70/0x70
> [60164.075002] [<ffffffff8105fab0>] ? leave_mm+0x70/0x70
> [60164.075002] [<ffffffff8105fab0>] ? leave_mm+0x70/0x70
> [60164.075002] [<ffffffff810d7b4f>] smp_call_function_single+0x5f/0xa0
> [60164.075002] [<ffffffff812d5c15>] ? cpumask_next_and+0x35/0x50
> [60164.075002] [<ffffffff810d80e3>] smp_call_function_many+0x223/0x260
> [60164.075002] [<ffffffff8105fc78>] native_flush_tlb_others+0xb8/0xc0
> [60164.075002] [<ffffffff8105fd3c>] flush_tlb_mm_range+0x5c/0x180
> [60164.075002] [<ffffffff8117f503>] tlb_flush_mmu.part.53+0x83/0x90
> [60164.075002] [<ffffffff81180015>] tlb_finish_mmu+0x55/0x60
> [60164.075002] [<ffffffff8118b0bb>] exit_mmap+0xdb/0x1a0
> [60164.075002] [<ffffffff8106b487>] mmput+0x67/0xf0
> [60164.075002] [<ffffffff8107458c>] do_exit+0x28c/0xa60
> [60164.075002] [<ffffffff816100c3>] ? trace_do_page_fault+0x43/0x100
> [60164.075002] [<ffffffff81074ddf>] do_group_exit+0x3f/0xa0
> [60164.075002] [<ffffffff81074e54>] SyS_exit_group+0x14/0x20
> [60164.075002] [<ffffffff816147c9>] system_call_fastpath+0x16/0x1b
>

This stack trace is not sufficient to show what root cause of the lockup
is. More often than not what happens is that some other core (the
offending one) is stuck with interrupts disabled, and that's why the ipi
resulting from native_flush_tlb_others is getting stuck. So you need to
see which core failed at handling the IPI.


> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/