Re: State of kgdb on x86-64

From: Jason Wessel
Date: Tue Jan 15 2008 - 23:10:26 EST


Jan Kiszka wrote:
> Jason Wessel wrote:
>
>> Jan Kiszka wrote:
>>
>>> Jason Wessel wrote:
>>>
>>>
>>>> It was working at the point that I tested it with the 2.6.24-rc5 on
>>>> x86_64. However I suspect my kernel config may differ drastically from
>>>> what you are using.
>>>>
>>>> Without any other context provided than the generic message, it is hard
>>>> to know what might have happened.
>>>>
>>>>
>>> Here is the promised .config. I could also dig out the backtrace of the
>>> panic as kgdb sees it if that helps, just let me know.
>>>
>>> Jan
>>>
>>>
>>>
>> The backtrace might be very telling as to what happened. More
>> information is always better than less :-)
>>
>>
>
> My primary test box is again out of reach, but meanwhile I was able to
> reproduce some kind of problem under QEMU - that one at least is
> triggered by SMP. With only one CPU -> all apparently fine. Once booting
> QEMU with "-smp 2" -> this happens:
>
> (gdb) tar remote /dev/pts/6
> Remote debugging using /dev/pts/6
> Not all CPUs have been synced for KGDB
> breakpoint () at kernel/kgdb.c:1895
> 1895 wmb(); /* Sync point after breakpoint */
> (gdb) c
> Continuing.
> Not all CPUs have been synced for KGDB
> [New Thread 32769]
>
> Program received signal SIGFPE, Arithmetic exception.
> [Switching to Thread 32769]
> 0xffffffff8020adb7 in default_idle () at include/asm/irqflags_64.h:140
> 140 __asm__ __volatile__("sti; hlt" : : : "memory");
> (gdb) bt
> #0 0xffffffff8020adb7 in default_idle () at include/asm/irqflags_64.h:140
> #1 0xffffffff8020ae65 in cpu_idle () at arch/x86/kernel/process_64.c:225
> #2 0xffffffff8021ccb9 in start_secondary () at arch/x86/kernel/smpboot_64.c:375
> #3 0x0000000000000000 in ?? ()
> (gdb)
>
> The problem seems to be related to continuing SMP boxes. I'm able to
> boot my box up if I leave kgdb unattached. But when I then later attach
> and continue execution, I get the same crash. Any ideas what goes wrong,
> any suggestion where to start digging? Maybe at "Not all CPUs have been
> synched"?
>

Generally speaking when you get an error that the CPUs have not been
synced, it means that the IPI which was sent to all the non-master
processors failed. I took a quick look and it appears that the DIE_TRAP
is occuring after kgdb sends the IPI to the non master cores with the call:

send_IPI_allbutself(APIC_DM_NMI);

In prior kernels that ultimately resulted in an NMI trap. I am not sure
of the cause of the DIE_TRAP as a result of the IPI. For now, if you
add the statement "case DIE_TRAP:" right before " case
DIE_NMIWATCHDOG:" in arch/x86/kernel/kgdb_64.c it will sync te
processors, however the kernel should not be trapping for this error
code from the IPI event. I suspect there has been some kind of change
to the way the IPI/NMI handling is being done in the latest kernels.

Jason.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/