Re: [RFC] perf: need to expose sched_clock to correlate user sampleswith kernel samples

From: John Stultz
Date: Mon Nov 12 2012 - 13:53:58 EST


On 11/11/2012 12:32 PM, Stephane Eranian wrote:
On Sat, Nov 10, 2012 at 3:04 AM, John Stultz <john.stultz@xxxxxxxxxx> wrote:
On 10/16/2012 10:23 AM, Peter Zijlstra wrote:
On Tue, 2012-10-16 at 12:13 +0200, Stephane Eranian wrote:
Hi,

There are many situations where we want to correlate events happening at
the user level with samples recorded in the perf_event kernel sampling
buffer.
For instance, we might want to correlate the call to a function or
creation of
a file with samples. Similarly, when we want to monitor a JVM with jitted
code,
we need to be able to correlate jitted code mappings with perf event
samples
for symbolization.

Perf_events allows timestamping of samples with PERF_SAMPLE_TIME.
That causes each PERF_RECORD_SAMPLE to include a timestamp
generated by calling the local_clock() -> sched_clock_cpu() function.

To make correlating user vs. kernel samples easy, we would need to
access that sched_clock() functionality. However, none of the existing
clock calls permit this at this point. They all return timestamps which
are
not using the same source and/or offset as sched_clock.

I believe a similar issue exists with the ftrace subsystem.

The problem needs to be adressed in a portable manner. Solutions
based on reading TSC for the user level to reconstruct sched_clock()
don't seem appropriate to me.

One possibility to address this limitation would be to extend
clock_gettime()
with a new clock time, e.g., CLOCK_PERF.

However, I understand that sched_clock_cpu() provides ordering guarantees
only
when invoked on the same CPU repeatedly, i.e., it's not globally
synchronized.
But we already have to deal with this problem when merging samples
obtained
from different CPU sampling buffer in per-thread mode. So this is not
necessarily
a showstopper.

Alternatives could be to use uprobes but that's less practical to setup.

Anyone with better ideas?
You forgot to CC the time people ;-)

I've no problem with adding CLOCK_PERF (or another/better name).
Hrm. I'm not excited about exporting that sort of internal kernel details to
userland.

The behavior and expectations from sched_clock() has changed over the years,
so I'm not sure its wise to export it, since we'd have to preserve its
behavior from then on.

It's not about just exposing sched_clock(). We need to expose a time source
that is exactly equivalent to what perf_event uses internally. If sched_clock()
changes, then perf_event clock will change too and so would that new time
source for clock_gettime(). As long as everything remains consistent, we are
good.

Sure, but I'm just hesitant to expose that sort of internal detail. If we change it later, its not just perf_events, but any other applications that have come to depend on the particular behavior we expose. We can claim "that was never promised" but it still leads to a bad situation.

Also I worry that it will be abused in the same way that direct TSC access
is, where the seemingly better performance from the more careful/correct
CLOCK_MONOTONIC would cause developers to write fragile userland code that
will break when moved from one machine to the next.

The only goal for this new time source is for correlating user-level
samples with
kernel level samples, i.e., application level events with a PMU counter overflow
for instance. Anybody trying anything else would be on their own.

clock_gettime(CLOCK_PERF): guarantee to return the same time source as
that used by the perf_event subsystem to timestamp samples when
PERF_SAMPLE_TIME is requested in attr->sample_type.

I'm not familiar enough with perf's interfaces, but if you are going to make this clockid bound so tightly with perf, could you maybe export a perf timestamp from one of perf's interfaces rather then using the more generic clock_gettime() interface?



I'd probably rather perf output timestamps to userland using sane clocks
(CLOCK_MONOTONIC), rather then trying to introduce a new time domain to
userland. But I probably could be convinced I'm wrong.

Can you get CLOCK_MONOTONIC efficiently and in ALL circumstances without
grabbing any locks because that would need to run from NMI context?
No, of course why we have sched_clock. But I'm suggesting we consider changing what perf exports (via maybe interpolation/translation) to be CLOCK_MONOTONIC-ish.


I'm not strongly objecting here, I just want to make sure other alternatives are explored before we start giving applications another internal kernel behavior dependent interface to hang themselves with. :)

thanks
-john

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/