Re: Debug hints for fpu state NULL pointer dereference on context switch during core dump in 3.0.101

From: Lennart Sorensen
Date: Tue Dec 20 2016 - 11:59:47 EST


On Mon, Dec 19, 2016 at 01:09:39PM -0500, Lennart Sorensen wrote:
> I am trying to debug a problem that has been happening occationally for
> years on some of our systems running 3.0.101 kernel (yes I know it is
> old, we are moving to 4.9 at the moment but I would like older releases
> to be fixed too, assuming 4.9 makes this problem disappear).
>
> What is happening is that once in a while a process does something wrong
> and segfaults, and dumps core. We have a handler to process the core dump
> to name it and compress it and make sure we don't keep to many around,
> so the core_pattern uses the pipe option to pipe the dump to a shell
> script that saves it with the pid and current timestamp and gzips it.
>
> Once in a while when this happens, the kernel hits a null pointer
> dereference in fpu.state->xsave while doing __switch_to.
>
> The system ix x86_64 with dual E5-2620 CPUs (6 cores each with
> hyperthreading). Some people think they have seen it on other systems,
> but are not sure. I have not been able to trigger it on other systems
> yet.
>
> It used to take about a week of running tests to trigger it, but I have
> now managed to hit it in a few minutes pretty reliably.

If the core_pattern is not set to use a pipe, but just save as core.%e.%p
then the problem does not happen.

--
Len Sorensen