Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling

From: Brendan Gregg
Date: Fri Aug 05 2016 - 00:42:35 EST


On Thu, Aug 4, 2016 at 6:43 PM, Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
> On Thu, Aug 04, 2016 at 04:28:53PM +0200, Peter Zijlstra wrote:
>> On Wed, Aug 03, 2016 at 11:57:05AM -0700, Brendan Gregg wrote:
>>
>> > As for pmu tracepoints: if I were to instrument it (although I wasn't
>> > planning to), I'd put a tracepoint in perf_event_overflow() called
>> > "perf:perf_overflow", with the same arguments. That could then be used
>> > for all PMU overflow events, without needing to add specific
>> > tracepoints.
>>
>> Could we not teach BPF to replace event->overflow_handler and inject
>> itself there?
>>
>> We don't currently have nice interfaces for doing that, but it should be
>> possible to do I think. We already have the indirect function call, so
>> injecting ourself there has 0 overhead.

Sounds like a good idea, especially for things like struct
file_operations so that we can statically instrument file system
read/writes with zero non-enabled overhead, and not worry about high
frequency workloads (>10M events/sec).

These perf probes aren't high frequency, though, and the code is not
normally in use, so overhead should be much less of a concern.
Sampling at 999 Hertz * CPUs is as frequent as I'd go. And if the
tracepoint code is still adding a mem read, conditional, and branch,
then that's not many instructions, especially considering the normal
use case of these perf functions: creating records and writing to a
perf ring buffer, then picking that up in user space by perf, then
either processing it live or writing to perf.data, back to the file
system, etc. It would be hard to benchmark the effect of adding a few
instructions to that path (and any results may be more sensitive to
cache line placement than the instructions).

The perf:perf_hrtimer probe point is also reading state mid-way
through a function, so it's not quite as simple as wrapping the
function pointer. I do like that idea, though, but for things like
struct file_operations.

>
> you're right. All makes sense. I guess I was too lazy to look into
> how to do it properly. Adding a tracepoint looked like quick and
> easy way to achieve the same.
> As far as api goes probably existing IOC_SET_BPF ioctl will do too.
> Currently overflow_handler is set at event alloc time. If we start
> changing it on the fly with atomic xchg(), afaik things shouldn't
> break, since each overflow_handler is run to completion and doesn't
> change global state, right?
>

How would it be implemented? I was thinking of adding explicit wrappers, eg:

static ssize_t
__ext4_file_write_iter_tracepoint(struct kiocb *iocb, struct iov_iter *from)
{
trace_ext4_write_iter(iocb, from);
ext4_file_write_iter(iocb, from);
}

Then switching in __ext4_file_write_iter_tracepoint() as needed.

I was wondering about doing this dynamically, but wouldn't that then
create another problem of maintenance -- we'd need to decorate such
code with a comment, at least, to say "this function is exposed as a
tracepoint".

If a dynamic approach is still desired, and the goal is zero
non-enabled overhead, then I'd also wonder if we could leverage
kprobes to do this. Imagine walking a file_operations to find the
addresses, and then kprobe-ing the addresses. Not the same (or
probably as efficient) as setting the function pointer, but, using
existing kprobes.

Brendan