Re: [PATCH 1/5] x86, perf: Fix LBR call stack save/restore

From: Ingo Molnar
Date: Wed Oct 21 2015 - 12:24:14 EST



* Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:

> > mask = x86_pmu.lbr_nr - 1;
> > - tos = intel_pmu_lbr_tos();
> > + tos = task_ctx->tos;
> > for (i = 0; i < tos; i++) {
> > lbr_idx = (tos - i) & mask;
> > wrmsrl(x86_pmu.lbr_from + lbr_idx, task_ctx->lbr_from[i]);
> > @@ -247,6 +247,7 @@ static void __intel_pmu_lbr_restore(struct x86_perf_task_context *task_ctx)
> > if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_INFO)
> > wrmsrl(MSR_LBR_INFO_0 + lbr_idx, task_ctx->lbr_info[i]);
> > }
> > + wrmsrl(x86_pmu.lbr_tos, tos);
> > task_ctx->lbr_stack_state = LBR_NONE;
> > }
>
> Any idea who much more expensive that wrmsr() is compared to the rdmsr() it
> replaces?
>
> If its significant we could think about having this behaviour depend on
> callstacks.

The WRMSR extra cost is probably rather significant - here is a typical Intel
WRMSR vs. RDMSR (non-hardwired) cache-hot/cache-cold cost difference:

[ 170.798574] x86/bench: -------------------------------------------------------------------
[ 170.807258] x86/bench: | RDTSC-cycles: hot (±noise) / cold (±noise)
[ 170.816115] x86/bench: -------------------------------------------------------------------
[ 212.146982] x86/bench: rdtsc : 16 / 60
[ 213.725998] x86/bench: rdmsr : 100 / 148
[ 215.469958] x86/bench: wrmsr : 456 / 708

That's on a Xeon E7-4890 (22nm IvyBridge-EX).

So it's 350-550 RDTSC cycles ...

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/