Re: [PATCH 19/19] x86/dumpstack: print any pt_regs found on the stack

From: Andy Lutomirski
Date: Fri Jul 22 2016 - 19:18:37 EST


On Fri, Jul 22, 2016 at 3:20 PM, Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote:
> On Fri, Jul 22, 2016 at 02:46:10PM -0700, Andy Lutomirski wrote:
>> On Fri, Jul 22, 2016 at 8:57 AM, Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote:
>> > On Thu, Jul 21, 2016 at 10:13:03PM -0700, Andy Lutomirski wrote:
>> >> On Thu, Jul 21, 2016 at 8:30 PM, Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote:
>> >> > On Thu, Jul 21, 2016 at 03:32:32PM -0700, Andy Lutomirski wrote:
>> >> >> On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote:
>> >> >> > Now that we can find pt_regs registers in the middle of the stack due to
>> >> >> > an interrupt or exception, we can print them. Here's what it looks
>> >> >> > like:
>> >> >> >
>> >> >> > ...
>> >> >> > [<ffffffff8106f7dc>] do_async_page_fault+0x2c/0xa0
>> >> >> > [<ffffffff8189f558>] async_page_fault+0x28/0x30
>> >> >> > RIP: 0010:[<ffffffff814529e2>] [<ffffffff814529e2>] __clear_user+0x42/0x70
>> >> >> > RSP: 0018:ffff88007876fd38 EFLAGS: 00010202
>> >> >> > RAX: 0000000000000000 RBX: 0000000000000138 RCX: 0000000000000138
>> >> >> > RDX: 0000000000000000 RSI: 0000000000000008 RDI: 000000000061b640
>> >> >> > RBP: ffff88007876fd48 R08: 0000000dc2ced0d0 R09: 0000000000000000
>> >> >> > R10: 0000000000000001 R11: 0000000000000000 R12: 000000000061b640
>> >> >> > R13: 0000000000000000 R14: ffff880078770000 R15: ffff880079947200
>> >> >> > [<ffffffff814529e2>] ? __clear_user+0x42/0x70
>> >> >> > [<ffffffff814529c3>] ? __clear_user+0x23/0x70
>> >> >> > [<ffffffff81452a7b>] clear_user+0x2b/0x40
>> >> >> > ...
>> >> >>
>> >> >> This looks wrong. Here are some theories:
>> >> >>
>> >> >> (a) __clear_user is a reliable address that is indicated by RIP: ....
>> >> >> Then it's found again as an unreliable address as "?
>> >> >> __clear_user+0x42/0x70" by scanning the stack. "?
>> >> >> __clear_user+0x23/0x70" is a genuine leftover artifact on the stack.
>> >> >> In this case, shouldn't "? __clear_user+0x42/0x70" have been
>> >> >> suppressed because it matched a reliable address?
>> >> >>
>> >> >> (b) You actually intended for all the addresses to be printed, in
>> >> >> which case "? __clear_user+0x42/0x70" should have been
>> >> >> "__clear_user+0x42/0x70" and you have a bug. In this case, it's
>> >> >> plausible that your state machine got a bit lost leading to "?
>> >> >> __clear_user+0x23/0x70" as well (i.e. it's not just an artifact --
>> >> >> it's a real frame and you didn't find it).
>> >> >>
>> >> >> (c) Something else and I'm confused.
>> >> >
>> >> > So there's a subtle difference between addresses reported by regs->ip
>> >> > and normal return addresses. For example:
>> >> >
>> >> > ...
>> >> > [<ffffffff8189ff4d>] smp_apic_timer_interrupt+0x3d/0x50
>> >> > [<ffffffff8189de6e>] apic_timer_interrupt+0x9e/0xb0
>> >> > RIP: 0010:[<ffffffff8129b350>] [<ffffffff8129b350>] path_init+0x0/0x750
>> >> > RSP: 0018:ffff880036a3fd80 EFLAGS: 00000296
>> >> > RAX: ffff88003691aa40 RBX: ffff880036a3ff08 RCX: ffff880036a3ff08
>> >> > RDX: ffff880036a3ff08 RSI: 0000000000000041 RDI: ffff880036a3fdb0
>> >> > RBP: ffff880036a3fda0 R08: 0000000000000000 R09: 0000000000000010
>> >> > R10: 8080808080808080 R11: fefefefefefefeff R12: ffff880036a3fdb0
>> >> > R13: 0000000000000001 R14: ffff880036a3ff08 R15: 0000000000000000
>> >> > <EOI>
>> >> > [<ffffffff8129b350>] ? lookup_fast+0x3d0/0x3d0
>> >> > [<ffffffff8129c81b>] ? path_lookupat+0x1b/0x120
>> >> > [<ffffffff8129ddd1>] filename_lookup+0xb1/0x180
>> >> > ...
>> >> >
>> >> > In this case the irq hit right after path_lookupat() called into
>> >> > path_init(). So the "path_init+0x0" printed by __show_regs() is right.
>> >> >
>> >> > Note the backtrace reports the same address, but it instead describes it
>> >> > as "lookup_fast+0x3d0", which is the end of lookup_fast(). That's
>> >> > because normally, such an address after a call instruction at the end of
>> >> > a function would indicate a tail call (e.g., to a noreturn function).
>> >> > If that were the case, printing "path_init+0x0" would be completely
>> >> > wrong, because path_init() just happens to be the function located
>> >> > immediately after lookup_fast().
>> >> >
>> >> > Maybe I could add some special logic to say: "if this return address was
>> >> > from a call, use printk_stack_address(); else if it was from a pt_regs,
>> >> > use printk_address()." (The former prints the preceding function, the
>> >> > latter prints the current function.) Then we could remove the question
>> >> > mark.
>> >> >
>> >> > There's also the question of whether or not the address should be
>> >> > printed again, after it's already been printed by __show_regs(). I
>> >> > don't have a strong opinion either way.
>> >> >
>> >>
>> >> IIRC we don't show the actual faulting function in the call trace, so
>> >> we probably shouldn't duplicate the entry after the show_regs.
>> >
>> > Just to clarify, that's true today for cases where the stack dump starts
>> > from a handler which has regs. It starts dumping based on regs->ip and
>> > regs->bp, so the regs themselves aren't dumped.
>> >
>> > But for cases where regs are in the middle of the stack, they aren't
>> > detected today, and you'll still see the value of regs->ip dumped with a
>> > question mark.
>> >
>> > That said, with this patch, now that regs in the middle of the stack
>> > *are* being printed, I can't think of a good reason to print the return
>> > address twice: both in regs and the stack trace. So removing it from
>> > the stack trace is fine with me.
>> >
>> >> That being said, I'm still confused by the question marks. What
>> >> exactly is going on? Is the code really doing the right thing wrt
>> >> resuming the unwind? Is there a git tree with these patches applied
>> >> somewhere so I can look at it easily in context?
>> >
>> > show_trace_log_lvl() is doing two things in parallel: scanning all
>> > kernel text addresses on the stack while simultaneously using the
>> > unwinder to walk the frame pointers. Only those scanned addresses which
>> > are also found by the unwinder are printed without question marks.
>> >
>> > The pt_regs aren't in a frame of their own; they're just data inside of
>> > a bigger frame. (You may recall that you objected to my proposal to put
>> > them in their own frame :-)) So that's why the address stored in
>> > regs->ip was printed with a question mark: it's not in the header of a
>> > real frame; it's just data.
>>
>> It wasn't the separate frame part I was objecting to -- it was their
>> encoding on the stack. Maybe they should unwind as though they're a
>> separate frame. For example, the unwind API could give the frame that
>> returns to apic_timer_interrupt+0x9e/0xb0 and then the next frame
>> could literally list regs->ip as its retaddr (and maybe that frame or
>> even the following one should be the one with non-NULL
>> unwind_get_entry_regs).
>
> Having the unwinder treat the pt_regs as a "fake" frame is problematic:
>
> - As I described above, you can't treat regs->ip as a normal return
> value anyway.
>
> - Also, for exceptions and nested interrupts, the regs are stored on the
> interrupting stack. But for non-nested interrupts, they're stored on
> the thread stack. So the regs aren't always on the same stack as the
> corresponding encoded pt_regs pointer. Another issue is that there's
> not always a frame after the regs. For those reasons, creating a
> "fake" frame abstraction in the state machine is quite a bit trickier
> than just dealing with those details in the only place that cares
> about them: show_trace_log_lvl().
>
>> In some sense, the regs belong to the frame that got interrupted, not
>> the frame that did the interrupting. But maybe that's backwards -- if
>> we have DWARF, then the regs correspond to the regs at the time of a
>> call, and those regs are reasonably likely to contain the arguments to
>> the called function.
>>
>> But regardless of which way this goes, it seems quite awkward to me
>> that regs->ip never shows up as the return addr of any frame as
>> exposed by the unwind API.
>
> Again, regs->ip is special. It's not a call return address and we
> shouldn't force it to be.

This is only mostly true. If the exception was a trap, then it is
(e.g. if a function ends in int3, then regs->ip will be off the end).
But that's just me being pedantic.

More relevantly, regs->ip is a reliable address indicating a function
that will be returned to if we ever return, and both
show_trace_log_lvl() and the livepatch stuff should interpret it as
such. Whether this means the unwinder should change or
show_trace_log_lvl() should change isn't a big deal, but I think one
of them should change so we get this right.

>
> --
> Josh



--
Andy Lutomirski
AMA Capital Management, LLC