Re: [tip:perfcounters/core] perf_counter: x86: Fix call-chainsupport to use NMI-safe methods

From: Mathieu Desnoyers
Date: Mon Jun 15 2009 - 17:02:56 EST


* Ingo Molnar (mingo@xxxxxxx) wrote:
>
> * Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxx> wrote:
>
> > In the category "crazy ideas one should never express out loud", I
> > could add the following. We could choose to save/restore the cr2
> > register on the local stack at every interrupt entry/exit, and
> > therefore allow the page fault handler to execute with interrupts
> > enabled.
> >
> > I have not benchmarked the interrupt disabling overhead of the
> > page fault handler handled by starting an interrupt-gated handler
> > rather than trap-gated handler, but cli/sti instructions are known
> > to take quite a few cycles on some architectures. e.g. 131 cycles
> > for the pair on P4, 23 cycles on AMD Athlon X2 64, 43 cycles on
> > Intel Core2.
>
> The cost on Nehalem (1 billion local_irq_save()+restore() pairs):
>
> aldebaran:~> perf stat --repeat 5 ./prctl 0 0
>
> Performance counter stats for './prctl 0 0' (5 runs):
>
> 10950.813461 task-clock-msecs # 0.997 CPUs ( +- 1.594% )
> 3 context-switches # 0.000 M/sec ( +- 0.000% )
> 1 CPU-migrations # 0.000 M/sec ( +- 0.000% )
> 145 page-faults # 0.000 M/sec ( +- 0.000% )
> 33946294720 cycles # 3099.888 M/sec ( +- 1.132% )
> 8030365827 instructions # 0.237 IPC ( +- 0.006% )
> 100933 cache-references # 0.009 M/sec ( +- 12.568% )
> 27250 cache-misses # 0.002 M/sec ( +- 3.897% )
>
> 10.985768499 seconds time elapsed.
>
> That's 33.9 cycles per iteration, with a 1.1% confidence factor.
>
> Annotation gives this result:
>
> 2.24 : ffffffff810535e5: 9c pushfq
> 8.58 : ffffffff810535e6: 58 pop %rax
> 10.99 : ffffffff810535e7: fa cli
> 20.38 : ffffffff810535e8: 50 push %rax
> 0.00 : ffffffff810535e9: 9d popfq
> 46.71 : ffffffff810535ea: ff c6 inc %esi
> 0.42 : ffffffff810535ec: 3b 35 72 31 76 00 cmp 0x763172(%rip),%e
> 10.69 : ffffffff810535f2: 7c f1 jl ffffffff810535e5
> 0.00 : ffffffff810535f4: e9 7c 01 00 00 jmpq ffffffff81053775
>
> i.e. pushfq+cli is roughly 42.19% or 14 cycles, the popfq is 46.71
> or 16 cycles. So the combo cost is 30 cycles, +- 1 cycle.
>
> (Actual effective cost in a real critical section can be better than
> this, dependent on surrounding instructions.)
>
> It got quite a bit faster than Core2 - but still not as fast as AMD.
>
> Ingo

Interesting, but in our specific case, what would be even more
interesting to know is how many trap gates/s vs interrupt gates/s can be
called. This would allow us to see if it's worth trying to make the page
fault handler interrupt-safe by mean of atomicity and context
save/restore by interrupt handlers (which would let us run the PF
handler with interrupts enabled).

Mathieu



--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/