Re: çåï[PATCH] perf core: Use KSTK_ESP() instead of pt_regs->sp while output user regs

From: Andy Lutomirski
Date: Thu Dec 25 2014 - 10:49:06 EST


On Thu, Dec 25, 2014 at 4:13 AM, çæå(æå) <chenggang.qcg@xxxxxxxxxxxxxxx> wrote:
> The context is NMI (PMU) or IRQ (hrtimer). It is a bit complex. The process
> we want to sample is the current, so it is always running.
> We need to distinguish between IRQ context, syscall or user context that are
> interrupted by NMI.

Oh. So you really are trying to get the user regs from NMI context
even if you interrupted the kernel. This will be an unpleasant thing
to do correctly.

> Syscall: sp = current->thread.usersp;

Nope. For x86_64 at least, sp is in old_rsp, not usersp, *if* the
syscall you interrupted was syscall64 instead of one of the ia32
entries, which are differently strange. If you've context switched
out and back, then usersp will match it. If not, then you're looking
at some random stale sp value.

Keep in mind that there is absolutely no guarantee that TIF_IA32
matches the syscall entry type.

> old_rsp always point to current->thread.usersp. May be we
> shouldn't use current_user_stack_pointer();
> User: sp = task_pt_regs(task)->sp;
> current's pt_regs are stored in kernel stack while NMI or IRQ
> occured.

This is the only easy case.

> IRQ: sp = task_pt_regs(task)->sp;
> current's pt_regs are stored in kernel stack while IRQ which was
> interrupted occured.

Sort of. It's true by the time you actually execute the IRQ handler.

I think that trying to do this is doomed to either failure or extreme
complexity. You're in an NMI, so you could be part-way through a
context switch or you could be in the very first instruction of the
syscall handler.

On a quick look, there are plenty of other bugs in there besides just
the stack pointer issue. The ABI check that uses TIF_IA32 in the perf
core is completely wrong. TIF_IA32 may be equal to the actual
userspace bitness by luck, but, if so, that's more or less just luck.
And there's a user_mode test that should be user_mode_vm.

Also, it's not just sp that's wrong. There are various places that
you can interrupt in which many of the registers have confusing
locations. You could try using the cfi unwind data, but that's
unlikely to work for regs like cs and ss, and, during context switch,
this has very little chance of working.

What's the point of this feature? Honestly, my suggestion would be to
delete it instead of trying to fix it. It's also not clear to me that
there aren't serious security problems here -- it's entirely possible
for sensitive *kernel* values to and up in task_pt_regs at certain
times, and if you run during context switch and there's no code to
suppress this dump during context switch, then you could be showing
regs that belong to the wrong task.

--Andy

>
> Regards
> Chenggang
>
> ------------------------------------------------------------------
> åääïAndy Lutomirski <luto@xxxxxxxxxxxxxx>
> åéæéï2014å12æ23æ(ææä) 16:30
> æääïroot <chenggang.qin@xxxxxxxxx>ïlinux-kernel
> <linux-kernel@xxxxxxxxxxxxxxx>
> æãéïçæå(æå) <chenggang.qcg@xxxxxxxxxx>ïAndrew Morton
> <akpm@xxxxxxxxxxxxxxxxxxxx>ïArjan van de Ven <arjan@xxxxxxxxxxxxxxx>ïDavid
> Ahern <dsahern@xxxxxxxxx>ïIngo Molnar <mingo@xxxxxxxxxx>ïMike Galbraith
> <efault@xxxxxx>ïNamhyung Kim <namhyung@xxxxxxxxx>ïPaul Mackerras
> <paulus@xxxxxxxxx>ïPeter Zijlstra <a.p.zijlstra@xxxxxxxxx>ïWu Fengguang
> <fengguang.wu@xxxxxxxxx>ïYanmin Zhang <yanmin.zhang@xxxxxxxxx>
> äãéïRe: [PATCH] perf core: Use KSTK_ESP() instead of pt_regs->sp while
> output user regs
>
> On 12/22/2014 10:22 PM, root wrote:
>> From: Chenggang Qin <chenggang.qcg@xxxxxxxxxx>
>>
>> For x86_64, the exact value of user stack's esp should be got by
>> KSTK_ESP(current).
>> current->thread.usersp is copied from PDA while enter ring0.
>> Now, we output the value of sp from pt_regs. But pt_regs->sp has changed
>> before
>> it was pushed into kernel stack.
>>
>> So, we cannot get the correct callchain while unwind some user stacks.
>> For example, if the stack contains __lll_unlock_wake()/__lll_lock_wait(),
>> the
>> callchain will break some times with the latest version of libunwind.
>> The root cause is the sp that is used by libunwind may be wrong.
>>
>> If we use KSTK_ESP(current), the correct callchain can be got everytime.
>> Other architectures also have KSTK_ESP() macro.
>>
>> Signed-off-by: Chenggang Qin <chenggang.qcg@xxxxxxxxxx>
>> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
>> Cc: Arjan van de Ven <arjan@xxxxxxxxxxxxxxx>
>> Cc: David Ahern <dsahern@xxxxxxxxx>
>> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
>> Cc: Mike Galbraith <efault@xxxxxx>
>> Cc: Namhyung Kim <namhyung@xxxxxxxxx>
>> Cc: Paul Mackerras <paulus@xxxxxxxxx>
>> Cc: Peter Zijlstra <a.p.zijlstra@xxxxxxxxx>
>> Cc: Wu Fengguang <fengguang.wu@xxxxxxxxx>
>> Cc: Yanmin Zhang <yanmin.zhang@xxxxxxxxx>
>>
>> ---
>> arch/x86/kernel/perf_regs.c | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
>> index e309cc5..5da8df8 100644
>> --- a/arch/x86/kernel/perf_regs.c
>> +++ b/arch/x86/kernel/perf_regs.c
>> @@ -60,6 +60,9 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
>> if (WARN_ON_ONCE(idx >= ARRAY_SIZE(pt_regs_offset)))
>> return 0;
>>
>> + if (idx == PERF_REG_X86_SP)
>> + return KSTK_ESP(current);
>> +
>
> This patch is probably fine, but KSTK_ESP seems to be bogus:
>
> unsigned long KSTK_ESP(struct task_struct *task)
> {
> return (test_tsk_thread_flag(task, TIF_IA32)) ?
> (task_pt_regs(task)->sp) : ((task)->thread.usersp);
> }
>
> I swear that every time I've looked at anything that references TIF_IA32
> in the last two weeks, it's been wrong. This should be something like:
>
> if (task_thread_info(task)->status & TS_COMPAT)
> return task_pt_regs(task)->sp;
> else if (task == current && task is in a syscall)
> return current_user_stack_pointer();
> else if (task is not running && task is in a syscall)
> return task->thread.usersp;
> else if (task is not in a syscall)
> return task_pt_regs(task)->sp;
> else
> we're confused; give up.
>
> What context are you using KSTK_ESP in?
>
> --Andy
>
>> return regs_get_register(regs, pt_regs_offset[idx]);
>> }
>>
>>



--
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/