Re: [patch][x86][hardcntr] Harware counter per process support

Dean Gaudet (dgaudet-list-linux-kernel@arctic.org)
Thu, 27 Nov 1997 02:33:35 -0800 (PST)


This message is in MIME format. The first part should be readable text,
while the remaining parts are likely unreadable without MIME-aware tools.
Send mail to mime@docserver.cac.washington.edu for more info.

--Multipart_Wed_Nov_26_16:10:43_1997-1
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
Content-ID: <Pine.LNX.3.95dg3.971127014606.5099C@twinlark.arctic.org>

On 26 Nov 1997, David Mentre wrote:

> Per task/process hardware counters support for PPro and
> PII. /proc/<pid> interface.

In addition to the comments from other folks here's some more:

You should implement lazy setting of counters, much like the fpu is dealt
with in a lazy way. For example, on ppro this means you can rdpmc twice
when switching from a task with counters, and rdpmc twice again before
going into it the next time. Then you don't pay wrmsr costs unless two
tasks are using counters. This may or may not help on other processors.

Note that rdtsc is available no matter what you do with ctr0 and ctr1,
so if you abstract event types be sure to include a "time stamp" event;
and allow up to three counters to run, one of which is the tsc.

The intel counters are 40 bits (tsc is 64), which means for some
settings you'll wrap them in around an hour (I think 20 minutes is
optimal if you're counting "uops retired" and somehow manage to retire
3 uops every cycle ;). So keeping a running total is a good thing...
you should be able to do it such that you never have to zero out the
hardware counters I think.

Yeah... try this. Ignore for the moment that gcc is really bad
at 64-bit arithmetic on 32-bit boxes, this has to be rewritten
in assembly.

... in schedule ...

#if CONFIG_HARDCNTR
if (prev->tss.hardcntr.is_counting) {
prev->tss.hardcntr.cntr0_total += rdpmc(0);
prev->tss.hardcntr.cntr1_total += rdpmc(1);
}
#endif
switch_to(prev,next);
#if CONFIG_HARDCNTR
/* change events if necessary */
if (prev->tss.hardcntr.is_counting) {
if (prev->tss.hardcntr.ctrl1 != ctrl1) {
write_msr(EVNTSEL1, 0, prev->tss.hardcntr.ctrl1);
ctrl1 = prev->tss.hardcntr.ctrl1;
}
if (prev->tss.hardcntr.ctrl0 != ctrl0) {
write_msr(EVNTSEL0, 0, prev->tss.hardcntr.ctrl0);
ctrl0 = prev->tss.hardcntr.ctrl0;
}
prev->tss.hardcntr.cntr0_total -= rdpmc(0);
prev->tss.hardcntr.cntr1_total -= rdpmc(1);
}
#endif

Then to read the counters you just do an rdpmc() and add the current
total to it. This means you pay 4 rdpmc's in the best case, and
an extra four wrmsr()s if you're running multiple tasks with counters.

Oh yeah, on the ppro, you can only set the lower 32-bits of the
counter, bits 32..39 are extended from bit 31. I seem to recall
the upper 24 bits being the same as the upper 24 bits of the tsc...
which means they're useless and you need to zero them out on read.

The kernel interface abstraction shouldn't worry about allocating one
counter or another to particular events... user libraries should deal with
getting that right. So, for example, the kernel abstraction should just
provide a way to configure counter 0, 1, 2, ... and a way to read each.
On intel, you *could* let user programs read via "rdpmc" and "rdtsc"...
but you'd have to fiddle cr4 so that they both trap and then emulate them.
This probably isn't a good idea. I think the interface would be better
done via a syscall than via /proc, for less overhead.

A way to hook up with SIGPROF would be cool ... that way you could
build a modified gprof which shows you where in your program problem
events occur. This works fine for single process/threaded programs.
You need to abstract counting over clone() too :)

Dean

--Multipart_Wed_Nov_26_16:10:43_1997-1--