Re: Random panic in load_balance() with 3.16-rc

From: Peter Zijlstra
Date: Wed Jul 23 2014 - 13:03:41 EST


On Wed, Jul 23, 2014 at 09:54:23AM -0700, Linus Torvalds wrote:
> On Wed, Jul 23, 2014 at 8:55 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> >>
> >> I haven't seen the full oops, can you forward the screenshot? The
> >> exact register state might give some clues.
> >
> > Sure, here goes.
>
> So the length is fine, and the disassembly shows that it is fixed (16
> 32-bit words - why the heck does it use "movsl" rather than "movsq",
> whatever).
>
> The problem is %rdi, which has the value ffff10043c803e8c, which isn't
> canonical. Which is why it GP-faults.
>
> That value is loaded from the stack:
>
> mov -0x88(%rbp),%rdi
>
> so apparently the original "__get_cpu_var(load_balance_mask)" is
> already corrupted, or something has corrupted it on the stack since
> loading (but that looks unlikely).
>
> And I wonder if I have a clue. Look, load_balance_mask is a
> "cpumask_var_t", but I don't see a "alloc_cpumask_var()" for it.
> That's broken with CONFIG_CPUMASK_OFFSTACK.

kernel/sched/core.c:sched_init()

plays horrible allocation tricks.. which I suppose we should clean up,
sched_init() appears to be called late enough to use regular per-cpu
allocations.

> I think you actually want "load_balance_mask" to be a "struct cpumask *", no?
>
> Alternatively, keep it a "cpumask_var_t", but then you need to use
> __get_cpu_pointer() to get the address of it, and use
> "alloc_cpumask_var()" to allocate area for the OFFSTACK case.

I'm always terminally confused on that interface.. but this code hasn't
changed in a long while and I would expect other crashes if this was
really funky like that.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/