Sporadic 3.12.* panics on invalid function-by-pointer call

From: Mikhail Sennikovskii
Date: Wed May 20 2015 - 15:02:04 EST

Next message: Thomas Gleixner: "Re: [PATCH 3/7] x86/intel_rdt: Add support for cache bit mask management"
Previous message: Steven Rostedt: "Re: [PATCH 3/5] trace-cmd: lookup syscall names in profile"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Fellows,

We are having sporadic panics of our 3.12.38 and 3.12.40 kernels,
running on a 64-cpu
host with AMD Opteron(tm) 6386 SE processors, which are presumably
caused by a call on a corrupted function pointer stored in kernel
.rodata or .data sections (e.g. fair_sched_class::task_tick).
Both 3.12.38 and 3.12.40 crashes seem to have the same origin,
although the 3.12.38 ones are more representative, so I'll stick to
them for shortness, but I can provide a more detailed info on the
3.12.40 ones on request as well.
So the 3.12.38 crashes always happen in a context of the
hrtimer_interrupt, e.g. a
typical stack trace we are getting on panic is:
..
RIP: 0010:[<ffffffff81072eb7>] [<ffffffff81072eb7>] cpus_share_cache+0x7/0x30
..
[884252.542536] [<ffffffff81073548>] ? scheduler_tick+0x68/0xe0
[884252.542762] [<ffffffff81052a12>] update_process_times+0x62/0x80
[884252.542882] [<ffffffff8109f1af>] tick_sched_timer+0x4f/0x120
[884252.543003] [<ffffffff8106972c>] __run_hrtimer+0x7c/0x1c0
[884252.543268] [<ffffffff81069d4f>] hrtimer_interrupt+0xff/0x230
[884252.543511] [<ffffffff8102d906>] local_apic_timer_interrupt+0x36/0x60
[884252.543634] [<ffffffff8102df5e>] smp_apic_timer_interrupt+0x3e/0x60
[884252.543757] [<ffffffff8171dbca>] apic_timer_interrupt+0x6a/0x70

(note that we are building the kernel with frame pointer omission disabled)
where all functions offsets definitely correspond to the addresses
following the appropriate calls,
and in particular scheduler_tick+0x68 corresponds to the
curr->sched_class->task_tick(rq, curr, 0); call (
http://lxr.free-electrons.com/source/kernel/sched/core.c?v=3.12#L2164
),
apparently cpus_share_cache is not the one that should be called.
Notably that RIP is always the same on this type of crashes.

The really interesting thing is that the contents of all
sched_class'es is absolutely valid, when we check it in our panic
handler (invoked from a page fault context).

So far we do not have a 100% prove that this is really a function
pointer being incorrect, however all the registers and stack data we
have on the crash prove that this is the case.
We are now doing some instrumentations and also browsing through the
hrtimer-related code to find the issue, still since this issue is a
really rarely happening one, and we're having troubles with
reproducing it,
we decided to ask if someone else has faced something similar or knows
a fix for this.
Below I'm giving a more detailed info on the registers and stack data
we have on the crash, which indirectly prove our assumption if someone
is interested.

If it indeed turns to be the case that the pointer is invalid by the
time of a function call, but then later is absolutely valid in a page
fault handler context, as I mentioned above,
I would think that the CPU has invalid TLB entry by the time of the
call, which gets flushed when the page fault occurs, but then again,
it's not clear to me, why that invalid TLB entry might appear.

Any comments or suggestions are very welcome.

And here is a bit more detailed info I was mentioning above.
I still intended to keep it short, so only some basic things are noted
here, but I can give you a more detailed info if needed.

One thing to note is that the scheduler_tick &
curr->sched_class->task_tick(rq, curr, 0); is not the only place where
it crashes for us.
E.g. we also get crashes on a clock->read(clock); call in
timekeeping_get_ns (
http://lxr.free-electrons.com/source/kernel/time/timekeeping.c?v=3.12#L174
)

[<ffffffff81096d9d>] ktime_get+0x4d/0xd0
[<ffffffff8109f175>] tick_sched_timer+0x25/0x120
[<ffffffff8106971c>] __run_hrtimer+0x7c/0x1c0
[<ffffffff81069d3f>] hrtimer_interrupt+0xff/0x230

The interesting thing about that crash is that hrtimer_interrupt does
the timekeeping_get_ns (and thus the clock->read(clock);) at its start
http://lxr.free-electrons.com/source/kernel/hrtimer.c?v=3.12#L1293 ,
and apparently it succeeds there, which makes me think this might be
related to some hrtimer callback misbehave.

Still to stick with the crash I've started with, here is a portion of
the stack and register values we had on the crash,
which indirectly prove that this is exactly the
fair_sched_class.task_tick pointer was called, and was incorrect by
the time of call:

==REGISTERS==

RSP: 0018:ffff882007d83df8 EFLAGS: 00010046
RAX: ffffffff8180ff40 ; should contain a pointer to the sched_class
instance by the time of the curr->sched_class->task_tick(rq, curr, 0)
call, so it does on crash:

scheduler_tick+98: callq *0x90(%rax)

(gdb) info symbol 0xffffffff8180ff40
fair_sched_class in section .rodata

RBX: ffff882007d91e00
RCX: 000000000000a888
RDX: 0000000000000000
RSI: fffffffffcd2e0b0 <- R15 should match RSI on
curr->sched_class->task_tick, so it does on the crash
scheduler_tick+2: mov %r15,%rsi

RDI: 0000000007d91e00 <- RDI should match RBX on
curr->sched_class->task_tick, here it is 0xffff882007d91e00ULL &
0xffffffff
which coresponds to
cpus_share_cache+3: movslq %edi,%rdi

[2257877.222647] RBP: ffff882007d83e38 R08: 0000000000000000 R09:
0016718e9d00e7aa
[2257877.222848] R10: 7fffffffffffffff R11: 7fffffffffffffff R12:
0000000000011e00
[2257877.223074] R13: 000000000000001e <- should contain cpu id on
curr->sched_class->task_tick

scheduler_tick+16: mov %gs:0xa0e4,%r13d

R14: 000000000000001e <- same as above: scheduler_tick+25>: movslq %r13d,%r14
R15: ffff880dfcd2e0b0 <- rsi should match R15 on
curr->sched_class->task_tick, so it does on the crash
scheduler_tick+2: mov %r15,%rsi
FS: 00007f36497178e0(0000) GS:ffff882007d80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffffffffc094ad80 CR3: 0000001fd8ee7000 CR4: 00000000000407e0

==STACK==

(ffff882007d83df8) ffff882007d83e38 <- RSP points here, and it
contains the RBP value (cpus_share_cache+6: push %rbp )
(ffff882007d83e00) ffffffff81073548 [ scheduler_tick+0x68/0xe0 ] <-
return address
(ffff882007d83e08) ffff882007d83e28 <- rsp in scheduler_tick points
here, (scheduler_tick+4: sub $0x30,%rsp)
(ffff882007d83e10) ffff880dfcd2e0b0
(ffff882007d83e18) 0000000000000000 <- scheduler_tick+32: mov
%r12,-0x20(%rbp) , saving r12, containing "int user_tick == 0"
(ffff882007d83e20) 000000000000001e <- scheduler_tick+8: mov
%r13,-0x18(%rbp) r13 assigned with the cpu id in update_process_times,
which matches the cpu id we get on panic
(ffff882007d83e28) ffff882007d8c1c0
(ffff882007d83e30) 000803de46416dca
(ffff882007d83e38) ffff882007d83e68 <- correct rbp data pointing to
the next frame
(ffff882007d83e40) ffffffff81052a12 [ update_process_times+0x62/0x80 ]
<- return address
...

Thanks and best regards,
Mikhail
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Thomas Gleixner: "Re: [PATCH 3/7] x86/intel_rdt: Add support for cache bit mask management"
Previous message: Steven Rostedt: "Re: [PATCH 3/5] trace-cmd: lookup syscall names in profile"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]