Re: 4.4-rc5 Setting hardware breakpoint in int_ret_from_sys_call causes triple fault/reboot

From: Jeff Merkey
Date: Thu Dec 17 2015 - 13:57:31 EST


On 12/17/15, Jeff Merkey <linux.mdb@xxxxxxxxx> wrote:
> On 12/16/15, Jeff Merkey <linux.mdb@xxxxxxxxx> wrote:
>> On 12/16/15, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>> On Wed, Dec 16, 2015 at 4:31 PM, Jeff Merkey <linux.mdb@xxxxxxxxx>
>>> wrote:
>>>> On 12/16/15, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>>>> On Dec 16, 2015 3:12 PM, "Jeff Merkey" <linux.mdb@xxxxxxxxx> wrote:
>>>>>>
>>>>>> Setting a hardware breakpoint at the
>>>>>>
>>>>>> rex64 sysret
>>>>>>
>>>>>> instruction at the end of int_ret_from_sys_call causes the system to
>>>>>> triple fault
>>>>>> and reboot when the breakpoint is triggered. Appears to be related
>>>>>> the same problem
>>>>>> as the lockup.
>>>>>>
>>>>>> This function can be stepped over and traced through with the TRAP
>>>>>> FLAG set so long as a hardware breakpoint is set somewhere in the
>>>>>> function. Otherwise upon exist the system hard hangs. If you break
>>>>>> exactly on that instruction -- reboot. If you break a few
>>>>>> instructions before it and single step through the call it works. If
>>>>>> you step through the call with no breakpoint the system hard hangs.
>>>>>> Same behavior as when you try to step from inside an nmi handler.
>>>>>> Looks related.
>>>>>
>>>>> You're probably encountering the user mode RSP when SYSRET happens.
>>>>>
>>>>> --Andy
>>>>>
>>>>
>>>> Hi Andy,
>>>>
>>>> Could be, but I am getting a double fault message with an error code
>>>> of 0 that then scrolls off the screen when the triple fault hits. It
>>>> flashes too quickly to get the function address -- wish I had a logic
>>>> analyzer with an inverse assembler -- would already be there. A
>>>> usermode RSP would I assume clear TRAP flag and that does not explain
>>>> why it works if I set a breakpoint right above the instruction then
>>>> step over it, which I can without the triple fault.
>>>>
>>>> Easy to reproduce, download the mdb debugger for 4.3.3 and apply it to
>>>> 4.4-rc5, modprobe mdb, echo a > /proc/sysrq_trigger, u
>>>> int_ret_from_syscall (scroll til you get to the swapgs then rex64
>>>> sysret, set a hardware breakpoint at that address , i.e. b
>>>> ffffffff81673ae1 (or whatever address the swapgs instruction is at),
>>>> then step through with t a few times (should just return after rex64
>>>> sysret since it returns to user space). The set a breakpoint at the
>>>> rex64 sysret instruction, b <address>, let it break at the
>>>> instruction, then hit g for go and watch the fireworks -- it will try
>>>> to print a double fault message then reboot.
>>>>
>>>> I handle the whole user RSP thing, I just return if I see regs set to
>>>> user space. This looks like some sort of problem in the exception
>>>> handlers.
>>>
>>> It's kernel regs but user RSP.
>>>
>>> --Andy
>>>
>>
>> right, I handle that case and I have handled that case since about
>> 2001. Used to before all the change I could just step from userspace
>> to kernel space with mdb. Have not been able to do that for while
>> since Linus fixed the VM in about 2002.
>>
>> So I handle that case.
>>
>> Jeff
>>
>
> It looks like that an architectural decision is the result of this bug
> and I don't think there is anything I can do about it without a very
> large, very ugly patch that alters the architecture of linux. Linux
> has loaded an MSR value into the processor and called swapgs, gets a
> breakpoint exception, MSR gets changed again and swapped somewhere
> else, then hits the next instruction. The triple fault is a GP, SS,
> and UD.
>
> This is a case where linux was not designed for a debugger, and to fix
> this is a BIG job. Will require lots of changes in places we probably
> shouldn't be changing including all exception handlers and possible
> removal of the swapgs instruction. This one I will document as a
> known limitation of Linux and move on.
>
> There will be no patch unless someone asks me to try to fix this.
> Bottom line, linux is debugger hostile and not designed for one. What
> tools there are will have problems on linux for debugging until Linus
> decides Linux will become a more debugger friendly place. I've
> written several commercial operating systems in my 35 years of
> programming, and the first item I always write before a kernel,
> drivers, or anything else is a debugger. The OS is then built on top
> of it.
>
> Linus read a book and decided to write an OS and his system reflects
> that -- no thought of debuggers and his development process operates a
> lot like a public library. It's not all bad -- look how far he got.
>
> This bug is closed since I know what it is. The probability of this
> occurring during normal operations is very low unless you debug and
> break between a swapgs function and a rex64 sysret or set a breakpoint
> anywhere near this instruction.
>
> Linux Documentation
>
> https://www.kernel.org/doc/Documentation/x86/entry_64.txt
>
> "... Dealing with the swapgs instruction is especially tricky. Swapgs
> toggles whether gs is the kernel gs or the user gs. The swapgs
> instruction is rather fragile: it must nest perfectly and only in
> single depth, it should only be used if entering from user mode to
> kernel mode and then when returning to user-space, and precisely
> so. If we mess that up even slightly, we crash.
>
> So when we have a secondary entry, already in kernel mode, we *must
> not* use SWAPGS blindly - nor must we forget doing a SWAPGS when it's
> not switched/swapped yet. ..."
>
> :-)
>
> Jeff
>

Added to the MDB website and project pages to explain this problem.

Limitations of Linux with Kernel Debuggers
Linux was not architected to support kernel debuggers and there are
several areas of Linux which are blacked out to kernel debuggers due
to how Linux is designed. Linux uses the swapgs instruction in x86_64
mode to swap gs frames between user space and kernel space
transitions. You can set breakpoints on and around a swapgs
instruction, however, the system may crash due to how the instruction
works if you attempt to step between user space and kernel space after
this instruction has been executed up to the instruction that performs
a sysret. This is a very rare instance that typically will not be
encountered but don't try to step over a section of code with a swapgs
instruction that subsequently calls some sort of system return. On
Linux you will see something like this in the disassembly:

swapgs
rex64 sysret
Don't try to step in between these two instructions. It's safe to do
so after the sysret executes but not between them. Debugging NMI
handlers in Linux can be done but the system may not be recoverable
after you have debugged these sections of code in the NMI handlers due
to a how Linux designed it's NMI callbacks. If you want to debug linux
without many of these limitations use MDB in Direct Mode when you
compile it. Direct Mode allows MDB to take control of the debugger
hardware from the operating system and removes many of the blacked out
areas of the operating system and allows you to debug them. Direct
Mode will not help you with the swapgs instruction problem however.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/