RE: [PATCH v2] [LBR] Dump LBRs on Exception

From: Berthier, Emmanuel
Date: Wed Dec 03 2014 - 14:24:21 EST


> From: Andy Lutomirski [mailto:luto@xxxxxxxxxxxxxx]
> Sent: Tuesday, December 2, 2014 9:12 PM
> To: Thomas Gleixner
> Cc: Berthier, Emmanuel; H. Peter Anvin; X86 ML; Jarzmik, Robert; LKML
> Subject: Re: [PATCH v2] [LBR] Dump LBRs on Exception
>
> On Tue, Dec 2, 2014 at 11:56 AM, Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> wrote:
> > On Tue, 2 Dec 2014, Andy Lutomirski wrote:
> >> TBH, I'm wondering whether this is actually a good idea. It might be
> >> more valuable and less scary to try to make this work for BUG instead.
> >> To get the most impact, it might be worth allocating a new exception
> >> vector for BUG and using 'int 0xwhatever', and the prologue to that
> >> could read out all the MSRs without any branches.
> >
> > BUG is pretty uninteresting. We usually know how we got there. Now
> > where LBR might be useful is if you have stack corruption and branch
> > into nirvana or access random crap via a few hoops. There the LBR data
> > might help, because the corrupted stack does not tell anything.

Right, I remember some strange Panic where the register content was not coherent with the EIP.
That's what I call "jump into space".

> > So yes, this is a corner case debugging scenario, but given the
> > complexity of coordination with perf and the possible intrusiveness in
> > the low level entry path, we really need to see a few real world
> > examples where this helps, aside of the constructed case.

Here is such "jump into space":

<1>[ 533.394885] BUG: unable to handle kernel paging request at c12329c0
<1>[ 533.394930] IP: [<c12329ed>] s0ix_complete+0x2d/0x30
<4>[ 533.394973] *pde = 362c8063 *pte = 01232161
<4>[ 533.395001] Oops: 0003 [#1] PREEMPT SMP
<4>[ 533.395030] Modules linked in: atomisp lm3554 mt9m114 ov8830 tty_hci wl12xx_sdio(O) wl12xx(O) mac80211(O) cfg80211(O) compat(O) rmi4(C) st_drv videobuf_vmalloc videobuf_core matrix(C) sgx(C)
<4>[ 533.395120]
<4>[ 533.395135] Pid: 0, comm: swapper/1 Tainted: G C O 3.4.21-183401-g34a7d6b #1 Intel Corporation CloverTrail/FFRD
<4>[ 533.395165] EIP: 0060:[<c12329ed>] EFLAGS: 00011ed6 CPU: 1
<4>[ 533.395184] EIP is at s0ix_complete+0x2d/0x30
<4>[ 533.395198] EAX: c12329c0 EBX: c1d120a0 ECX: 00000001 EDX: 00000000
<4>[ 533.395211] ESI: e0d26b2b EDI: 12e6492c EBP: f60a9eec ESP: f60a9ef0
<4>[ 533.395225] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
<4>[ 533.395239] CR0: 8005003b CR2: c12329c0 CR3: 01dfe000 CR4: 000007d0
<4>[ 533.395253] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
<4>[ 533.395265] DR6: ffff0ff0 DR7: 00000400
<0>[ 533.395279] Process swapper/1 (pid: 0, ti=f60a8000 task=f607b520 task.ti=f60a8000)
<0>[ 533.395290] Stack:
<4>[ 533.395299] c1230b36 f60a9f40 c153d73a 00000767 f60a9f10 c123b908 f60a9f0c 00000000
<4>[ 533.395342] 00000052 00000006 f67eb45c 00000001 00000007 00000000 c1d120a0 00000052
<4>[ 533.395384] 00000001 c1d120a0 00000214 f67eb45c f60a9f50 c174ccd5 fffffff0 00000001
<0>[ 533.395427] Call Trace:
<4>[ 533.395445] [<c1230b36>] pmu_set_s0ix_complete+0x16/0x20
<4>[ 533.395468] [<c153d73a>] ? soc_s0ix_idle+0x13a/0x4b0
<4>[ 533.395488] [<c123b908>] ? ns_to_timespec+0x28/0x40
<4>[ 533.395510] [<c174ccd5>] ? cpuidle_enter+0x15/0x20
<4>[ 533.395529] [<c174d2ef>] ? cpuidle_idle_call+0x9f/0x320
<4>[ 533.395549] [<c1209795>] ? cpu_idle+0x75/0xd0
<4>[ 533.395569] [<c19955bd>] ? start_secondary+0x23c/0x23e
<0>[ 533.395583] Code: e5 3e 8d 74 26 00 a1 44 91 e0 c1 8b 90 38 0a 00 00 85 d2 75 02 5d c3 8b 90 34 1d 00 00 31 c0 89 42 58 5d c3 8d 76 00 8d bc 27 00 <00> 00 00 55 89 e5 53 3e 8d 74 26 00 89 c1 31 c0 83 f9 3f 77 24
<0>[ 533.395862] EIP: [<c12329ed>] s0ix_complete+0x2d/0x30 SS:ESP 0068:f60a9ef0
<4>[ 533.395888] CR2: 00000000c12329c0

Analysis:

After replaying the game with T32 in normal situation and comparing the snapshots, it appears that to get the same call stack, we come from pmu_set_s0ix_complete():

|void pmu_set_s0ix_complete(void)
437 |{
438 | if (pmu_ops->set_s0ix_complete)
NP:0000:C1230B28 |A14071E0C1 mov eax,[0xC1E07140] ; eax,pmu_ops
NP:0000:C1230B2D |8B401C mov eax,[eax+0x1C]
NP:0000:C1230B30 |85C0 test eax,eax
NP:0000:C1230B32 |7402 je 0xC1230B36
439 | pmu_ops->set_s0ix_complete();
NP:0000:C1230B34 |FFD0 call eax
440 |}
NP:0000:C1230B36 |5D pop ebp
NP:0000:C1230B37 |C3 ret

and we have not yet executed the beginning of s0ix_complete() as EBP is not yet stored in the stack and EAX has not be modified:

|void s0ix_complete(void)
257 |{
NP:0000:C12329C0 |55______________s0ix_com.:push_______ebp
NP:0000:C12329C1 |89E5 mov ebp,esp
NP:0000:C12329C3 |3E8D742600 lea esi,ds:[esi+0x0]
|
|void s0ix_complete(void)
257 |{
258 | if (unlikely(mid_pmu_cxt->s0ix_entered))
NP:0000:C12329C8 |A14471E0C1 mov eax,[0xC1E07144] ; eax,mid_pmu_cxt
NP:0000:C12329CD |8B90380A0000 mov edx,[eax+0xA38]
NP:0000:C12329D3 |85D2 test edx,edx
NP:0000:C12329D5 |7502 jne 0xC12329D9
260 |}
NP:0000:C12329D7 |5D pop ebp
NP:0000:C12329D8 |C3 ret
259 | writel(0, &mid_pmu_cxt->pmu_reg->pm_msic);
NP:0000:C12329D9 |8B90341D0000 mov edx,[eax+0x1D34]
63 |build_mmio_write(writel, "l", unsigned int, "r", :"memory")
NP:0000:C12329DF |31C0 xor eax,eax
NP:0000:C12329E1 |894258 mov [edx+0x58],eax
260 |}
NP:0000:C12329E4 |5D pop ebp
NP:0000:C12329E5 |C3 ret
NP:0000:C12329E6 |8D7600 lea esi,[esi+0x0]
NP:0000:C12329E9 |8DBC2700000000 lea edi,[edi+0x0]

So, the execution of "call eax" has jumped directly to eax+2d.

> One option would be to treat this as the debugging feature that it is.

Agree.

> Don't change the exception entry code at all. Instead add a run-time switch
> that will ask perf to set up the LBR stuff appropriately and will change the IDT
> entries to stubs that first use rdmsr to copy all the LBRs to some per-cpu, per-
> vector area and then jump to the real entry point.
>
> Yes, performance will suck, and there will be some artifacts if the exception
> nests inside itself before reporting the error (which can happen for page
> faults, I think), but this isn't the end of the world if it's an opt-in feature.
>
> And the whole pile of rdmsrs can be generated at run-time to match the
> running CPU.

The final patch will bypass the new code in case of UserSpace page fault, so performance impact will be very low.
LBRs copy takes much more time than LBR stop/start.

The simple is the better:

.macro STOP_LBR
#ifdef CONFIG_LBR_DUMP_ON_EXCEPTION
testl $3,CS(%rsp) /* Kernel Space? */
jnz 1f
testl $3, PER_CPU_VAR(lbr_dump_state) /* Disabled? */
jnz 1f
push %rax
push %rcx
push %rdx
movl $MSR_IA32_DEBUGCTLMSR, %ecx
rdmsr
and $~1, %eax /* Disable LBR recording */
wrmsr
pop %rdx
pop %rcx
pop %rax
1:
#endif
.endm

Thx,

Emmanuel.
N‹§²æ¸›yú²X¬¶ÇvØ–)Þ{.nlj·¥Š{±‘êX§¶›¡Ü}©ž²ÆzÚj:+v‰¨¾«‘êZ+€Êzf£¢·hšˆ§~†­†Ûÿû®w¥¢¸?™¨è&¢)ßf”ùy§m…á«a¶Úÿ 0¶ìå