Re: [PATCH 1/1] Fix int1 recursion when no perf_bp_event is registeredy

From: Jeff Merkey
Date: Mon Dec 14 2015 - 13:16:58 EST


On 12/14/15, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
> On Mon, Dec 14, 2015 at 9:52 AM, Jeff Merkey <linux.mdb@xxxxxxxxx> wrote:
>> On 12/14/15, Ingo Molnar <mingo@xxxxxxxxxx> wrote:
>>>
>>> * Jeff Merkey <linux.mdb@xxxxxxxxx> wrote:
>>>
>>>> On 12/14/15, Ingo Molnar <mingo@xxxxxxxxxx> wrote:
>>>> >
>>>> > A: Because it messes up the order in which people normally read text.
>>>> > Q: Why is top-posting such a bad thing?
>>>> > A: Top-posting.
>>>> > Q: What is the most annoying thing in e-mail?
>>>> >
>>>> > * Jeff Merkey <linux.mdb@xxxxxxxxx> wrote:
>>>> >
>>>> >> I trigger it by writing to the dr7 and dr1, 2, 3 or four register
>>>> >> and
>>>> >> set an
>>>> >> execute breakpoint without going through arch_install_hw_breakpoint.
>>>> >> When
>>>> >> the breakpoint fires, the system crashes and hangs on the processor
>>>> >> stuck in
>>>> >> an endless loop inside the int1 handler in hw_breakpoint.c --
>>>> >
>>>> > What is still not clear to me, can you trigger the hang not via some
>>>> > special
>>>> >
>>>> > kernel driver that goes outside regular APIs and messes with the
>>>> > state
>>>> > of the
>>>> > debug registers, but via the proper access methods, i.e. various
>>>> > user-space
>>>> > ABIs?
>>>>
>>>> Any process that can get access to the debug registers can trigger this
>>>> condition. [...]
>>>
>>> A process on an unmodified Linux kernel can only modify debug registers
>>> via
>>> the
>>> proper APIs:
>>>
>>>> [...] As it stands, if restricted to the established API in
>>>> hw_breakpoint.c
>>>> this bug should not occur unless someone triggers an errant breakpoint.
>>>> [...]
>>>
>>> So am I interpreting your report correctly:
>>>
>>> "If the Linux kernel is modified to change debug registers without
>>> using
>>> the
>>> proper APIs (such as loading a module that changes hardware registers
>>> in
>>> a raw
>>> fashion), things may break and a difficult to debug hang may occur."
>>>
>>> right?
>>>
>>> This key piece of information should have been part of the original
>>> report.
>>>
>>> So I'm wondering, why does your module modify debug registers in a raw
>>> fashion?
>>> Why doesn't it use the proper APIs?
>>>
>>> Thanks,
>>>
>>> Ingo
>>>
>>
>> Hi Ingo,
>>
>> This will be a lengthy reply to properly explain this to you. First
>> some fundamental assumptions to clear up.
>>
>> 1. The MDB Debugger Module does not cause this problem. This is an
>> existing bug in the kernel in an exception code path.
>
>> 8. If any process or module sets a breakpoint outside of linux
>> breakpoint API in this code path the system will crash. Its A BUG,
>> and it's been in linux 13 years. I am certain people have seen it
>> while running perf stuff but since it provides no diagnostic info,
>> someone would just reset the system.
>
> Putting "BUG" in caps doesn't make it so. What's wrong with it?

Sorry about that.

>
>> 9. This breakpoint API needs to be rewritten to be global breakpoint
>> aware, have an on/off switch so when a debugger enters an int1
>> exception, breakpoints are globally disabled (a requirement), among
>> other things.
>
> A "requirement" for what?
>

When you enter a debugger console you must disable all breakpoints
globally or you will get nested breakpoints inside the debugger. kgdb
and kdb do not handle this well. MDB handles it but sooner or later
you will run out of stack space if you allow it. Intel documentation
says it's a no no based on how they designed the debug facilities in
their processors ...

>>
>> The patch simply fixes the bug in the int handler that will cause a
>> lockup. The perf event system, kgdb, kdb, and any one of a number of
>> programs can trigger this bug, and probably have. People would blame
>> the debugger when its a bug in the int handler.
>
> You ignored feedback from me and from tglx, and you still haven't
> explained why this is a bug in the first place. Maybe the code could
> degrade more gracefully if you use it wrong, but the int1 handler and
> the rest of the kernel are very much aware of each other, and the int1
> handler's failure to do what something that isn't in the kernel wants
> it to do isn't a bug.
>
> If you submit a clean patch to improve robustness of the handler and
> if the new code is at least as clean as the old code, that might be a
> different story.
>
> --Andy
>

Responded to the rest of it.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/