Re: frequent lockups in 3.18rc4

From: Frederic Weisbecker
Date: Thu Dec 04 2014 - 11:52:20 EST


On Thu, Dec 04, 2014 at 08:18:10AM -0800, Linus Torvalds wrote:
> On Thu, Dec 4, 2014 at 12:43 AM, Dâniel Fraga <fragabr@xxxxxxxxx> wrote:
> >
> > Linus, today it's your lucky day, because I think I found the
> > real bad commit (if it isn't, then it's some very close to it). I
> > managed to narrow the bisect and here's the result:
>
> Ok, that actually looks very reasonable, I had actually looked at it
> because of the whole "changes IPI" thing.
>
> One more thing to try: does a revert fix it on current git?
>
> It doesn't revert entirely cleanly, but close enough - attached a
> quick rough patch that may or may not work, but looks like a good
> revert.
>
> Dave - this might be worth testing for you too, exactly because of
> that whole "it changes how we do IPI's". It was your bug report with
> TLB IPI's that made me look at that commit originally.

I think this is a different issue. What Daniel reported is:

Dec 4 06:03:41 tux kernel: [ 737.180761] [<ffffffff810637ca>] hrtimer_cancel+0x1a/0x30
Dec 4 06:03:41 tux kernel: [ 737.180766] [<ffffffff81097842>] tick_nohz_restart+0x12/0x80
Dec 4 06:03:41 tux kernel: [ 737.180769] [<ffffffff81097c4f>] __tick_nohz_full_check+0x9f/0xb0
Dec 4 06:03:41 tux kernel: [ 737.180771] [<ffffffff81097c69>] nohz_full_kick_work_func+0x9/0x10
Dec 4 06:03:41 tux kernel: [ 737.180774] [<ffffffff810aecd4>] irq_work_run_list+0x44/0x70
Dec 4 06:03:41 tux kernel: [ 737.180777] [<ffffffff81097730>] ? tick_sched_handle.isra.20+0x40/0x40
Dec 4 06:03:41 tux kernel: [ 737.180779] [<ffffffff810aed19>] __irq_work_run+0x19/0x30
Dec 4 06:03:41 tux kernel: [ 737.180782] [<ffffffff810aed98>] irq_work_run+0x18/0x40
Dec 4 06:03:41 tux kernel: [ 737.180784] [<ffffffff8104deb6>] update_process_times+0x56/0x70
Dec 4 06:03:41 tux kernel: [ 737.180786] [<ffffffff81097721>] tick_sched_handle.isra.20+0x31/0x40
Dec 4 06:03:42 tux kernel: [ 737.180788] [<ffffffff81097769>] tick_sched_timer+0x39/0x60
Dec 4 06:03:42 tux kernel: [ 737.180790] [<ffffffff810636a1>] __run_hrtimer.isra.33+0x41/0xd0
Dec 4 06:03:42 tux kernel: [ 737.180792] [<ffffffff81063a4f>] hrtimer_interrupt+0xef/0x250
Dec 4 06:03:42 tux kernel: [ 737.180795] [<ffffffff8102db65>] local_apic_timer_interrupt+0x35/0x60
Dec 4 06:03:42 tux kernel: [ 737.180797] [<ffffffff8102e12a>] smp_apic_timer_interrupt+0x3a/0x50
Dec 4 06:03:42 tux kernel: [ 737.180799] [<ffffffff81391a3a>] apic_timer_interrupt+0x6a/0x70

And this bug has been fixed upstream with:

_ nohz: nohz full depends on irq work self IPI support
_ x86: Tell irq work about self IPI support
_ irq_work: Force raised irq work to run on irq work interrupt
_ nohz: Move nohz full init call to tick init

These patches have been backported to stable as well.

I suspect Daniel rewinded far enough to fall on that old bug.

Daniel, did you see the above very stacktrace in latest upstream too? Or was it
a different one?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/