Re: [RFC][PATCH 0/6] perf: x86 RDPMC and RDTSC support

From: Ingo Molnar
Date: Wed Dec 21 2011 - 09:33:12 EST



* Peter Zijlstra <a.p.zijlstra@xxxxxxxxx> wrote:

> > I used the mmap_read_self() routine from your example as the
> > "read" performance that I measured.
>
> Yeah that's about it, if you want to discard the overload
> scenario, eg you use pinned counters or so, you can optimize
> it further by stripping out the tsc and scaling muck.

With the TSC scaling muck it a bit above 50 cycles here.

Without that, by optimizing it further for pinned counters, the
overhead of mmap_read_self() gets down to 36 cycles on a Nehalem
box.

A PEBS profile run of it shows that 90% of the overhead is in
the RDPMC instruction:

2.92 : 42a919: lea -0x1(%rax),%ecx â
: â
: static u64 rdpmc(unsigned int counter) â
: { â
: unsigned int low, high; â
: â
: asm volatile("rdpmc" : "=a" (low), "=d" (high) : "c" (counter)); â
89.20 : 42a91c: rdpmc â
: count = pc->offset; â
: if (idx) â
: count += rdpmc(idx - 1); â
: â

So the RDPMC instruction alone is 32 cycles. The perf way of
reading it adds another 4 cycles to it.

So the measured 'perf overhead' is so ridiculously low that it's
virtually non-existent.

Here's "pinned events" variant i've measured:

static u64 mmap_read_self(void *addr)
{
struct perf_event_mmap_page *pc = addr;
u32 seq, idx;
u64 count;

do {
seq = pc->lock;
barrier();

idx = pc->index;
count = pc->offset;
if (idx)
count += rdpmc(idx - 1);

barrier();
} while (pc->lock != seq);

return count;
}

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/