Re: frequent lockups in 3.18rc4

From: Sasha Levin
Date: Tue Dec 02 2014 - 18:32:55 EST


On 12/02/2014 02:32 PM, Dave Jones wrote:
> On Mon, Dec 01, 2014 at 06:08:38PM -0500, Chris Mason wrote:
> > I'm not sure if this is related, but running trinity here, I noticed it
> > was stuck at 100% system time on every CPU. perf report tells me we are
> > spending all of our time in spin_lock under the sync system call.
> >
> > I think it's coming from contention in the bdi_queue_work() call from
> > inside sync_inodes_sb, which is spin_lock_bh().
> >
> > I wonder if we're just spinning so hard on this one bh lock that we're
> > starving the watchdog?
> >
> > Dave, do you have spinlock debugging on?
>
> That has been a constant, yes. I can try with that disabled some time.

Here's my side of the story: I was observing RCU lockups which went away when
I disabled verbose printing for fault injections. It seems that printing one
line ~10 times a second can cause that...

I've disabled lock debugging to see if anything new will show up, and hit
something that may be related:

[ 787.894288] ================================================================================
[ 787.897074] UBSan: Undefined behaviour in kernel/sched/fair.c:4541:17
[ 787.898981] signed integer overflow:
[ 787.900066] 361516561629678 * 101500 cannot be represented in type 'long long int'
[ 787.900066] CPU: 18 PID: 12958 Comm: trinity-c103 Not tainted 3.18.0-rc6-next-20141201-sasha-00070-g028060a-dirty #1528
[ 787.900066] 0000000000000000 0000000000000000 ffffffff93b0f890 ffff8806e3eff918
[ 787.900066] ffffffff91f1cf26 1ffffffff3c2de73 ffffffff93b0f8a8 ffff8806e3eff938
[ 787.900066] ffffffff91f1fb90 1ffffffff3c2de73 ffffffff93b0f8a8 ffff8806e3eff9f8
[ 787.900066] Call Trace:
[ 787.900066] dump_stack (lib/dump_stack.c:52)
[ 787.900066] ubsan_epilogue (lib/ubsan.c:159)
[ 787.900066] handle_overflow (lib/ubsan.c:191)
[ 787.900066] ? __do_page_fault (arch/x86/mm/fault.c:1220)
[ 787.900066] ? local_clock (kernel/sched/clock.c:392)
[ 787.900066] __ubsan_handle_mul_overflow (lib/ubsan.c:218)
[ 787.900066] select_task_rq_fair (kernel/sched/fair.c:4541 kernel/sched/fair.c:4755)
[ 787.900066] try_to_wake_up (kernel/sched/core.c:1415 kernel/sched/core.c:1724)
[ 787.900066] ? trace_hardirqs_on_thunk (arch/x86/lib/thunk_64.S:33)
[ 787.900066] default_wake_function (kernel/sched/core.c:2979)
[ 787.900066] ? get_parent_ip (kernel/sched/core.c:2559)
[ 787.900066] autoremove_wake_function (kernel/sched/wait.c:295)
[ 787.900066] ? get_parent_ip (kernel/sched/core.c:2559)
[ 787.900066] __wake_up_common (kernel/sched/wait.c:73)
[ 787.900066] __wake_up_sync_key (include/linux/spinlock.h:364 kernel/sched/wait.c:146)
[ 787.900066] pipe_write (fs/pipe.c:466)
[ 787.900066] ? kasan_poison_shadow (mm/kasan/kasan.c:48)
[ 787.900066] ? new_sync_read (fs/read_write.c:480)
[ 787.900066] do_iter_readv_writev (fs/read_write.c:681)
[ 787.900066] compat_do_readv_writev (fs/read_write.c:1029)
[ 787.900066] ? wait_for_partner (fs/pipe.c:340)
[ 787.900066] ? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:95 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
[ 787.900066] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[ 787.900066] ? syscall_trace_enter_phase1 (include/linux/context_tracking.h:27 arch/x86/kernel/ptrace.c:1486)
[ 787.900066] compat_writev (fs/read_write.c:1145)
[ 787.900066] compat_SyS_writev (fs/read_write.c:1163 fs/read_write.c:1151)
[ 787.900066] ia32_do_call (arch/x86/ia32/ia32entry.S:446)
[ 787.900066] ================================================================================

(For Linus asking himself "what the hell is this UBSan thing, I didn't merge that!" - it's an
undefined behaviour sanitizer that works with gcc5.)


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/