Re: [PATCH 0/1] Support PERF_SAMPLE_READ with inherit_stat

From: Ben Gainey
Date: Sat Jan 20 2024 - 11:15:29 EST


On Fri, 2024-01-19 at 16:49 -0800, Namhyung Kim wrote:
> Hello,
>
> On Fri, Jan 19, 2024 at 8:39 AM Ben Gainey <ben.gainey@xxxxxxx>
> wrote:
> >
> > This change allows events to use PERF_SAMPLE READ with inherit so
> > long
> > as both inherit_stat and PERF_SAMPLE_TID are set.
> >
> > Currently it is not possible to use PERF_SAMPLE_READ with inherit.
> > This
> > restriction assumes the user is interested in collecting aggregate
> > statistics as per `perf stat`. It prevents a user from collecting
> > per-thread samples using counter groups from a multi-threaded or
> > multi-process application, as with `perf record -e '{....}:S'`.
> > Instead
> > users must use system-wide mode, or forgo the ability to sample
> > counter
> > groups. System-wide mode is often problematic as it requires
> > specific
> > permissions (no CAP_PERFMON / root access), or may lead to capture
> > of
> > significant amounts of extra data from other processes running on
> > the
> > system.
> >
> > Perf already supports the ability to collect per-thread counts with
> > `inherit` via the `inherit_stat` flag. This patch changes
> > `perf_event_alloc` relaxing the restriction to combine `inherit`
> > with
> > `PERF_SAMPLE_READ` so that the combination will be allowed so long
> > as
> > `inherit_stat` and `PERF_SAMPLE_TID` are enabled.
>
> I'm not sure if it's correct. Maybe I misunderstand inherit_stat but
> AFAIK it's just to use prev_task's events when next_task has the
> compatible event context. So the event values it sees in samples
> would depend on the timing or scheduler behavior.

I think you are referring to __perf_event_sync_stat as called from
perf_event_context_sched_out, but isn't the point that
perf_event_context_sched_out will just swap the tasks and RCU pointers
as an optmisation when the incoming context is the same as the outgoing
context (rather than stopping and restarting), and so when inherit_stat
is used, along with swaping the outgong and incoming task data in the
event, it also reads and then swaps the outgoing and incoming counter
values in the event so that the counter values that belong to the
outgoing event are correctly associated to that task.

Looking again at perf_event_context_sched_out though, one thing i note
is that the call to perf_event_sync_stat happens after the call to
perf_ctx_enable, which I think means two things:
* The pmu->read call in the sync will be operating on an actively
counting PMU, so the counts recorded in each event in the context may
be skewed relative to each other. This is not the case elsewhere so
that the events on the PMU are read in the disabled state.
* perc_ctx_enable (at least for arm_pmu) will not reload the the
incoming "remaining" period for any sampling events so its possible
that an overflow would happen sooner than expected in the incoming
context (though i don't think this will leave a corrupted count, just a
smaller than expected count for that sample).


>
> Also event counts and time values PERF_SAMPLE_READ sees
> include child event's so the values of the parent event can be
> updated even if it's inactive. And the values will vary for the
> next_task whether prev_task is the parent or not. I think it
> would return consistent values only if it iterates all child events
> and sums up the values like it does for read(2). But it cannot
> do that in the NMI handler.
>
> Frankly I don't understand how inherit_stat supports per-thread
> counts properly. Also it doesn't seem to be used by default in
> the perf tools.

Hmmm, ok rereading through core.c, i think the thing I have missed is
the interaction between perf_event_count and sync_child_event, which I
guess I missed when testing this because IIRC sync_child_event is only
called on migration, hotplug and task exit events.

I don't think it would be sensible to skip the update to child_count in
sync_child_event as this would break the behaviour of the `read()` on
an event's fd. I suppose the better thing to do would be to have
perf_output_read_group/_one avoid reading child_count for these kinds
of events. That would ensure the mmap sample is correct.


> IIUC per-thread count is supported when you
> don't set the inherit bit and open separate events for each
> thread but I guess that's not what you want.

You can take that approach... but I don't think it is pleasent :-).
If you have an application that spawns threads at runtime the tool must
somehow track them (perhaps by polling proc or using ptrace). Polling
is not ideal as it can miss things or introduce high overhead in the
tool. More importantly, creating a new event per thread can consume a
lot of file descriptors, particularly if you open per-thread-per-cpu
events to have them write to the same mmap.

FWIW I recently prototyped a version of this in Arm's profiler tools,
where we are also prototyping support for per-function metrics... On
something like a Gravaton 3, having 64 cores, where the full set of
published metrics for Neoverse-V1 contains something like 60 PMU
events, tracing a database application that creates a new thread per
connection, it spawned ~30 threads... the tool tried to create ~100k
perf events... hit ulimit and terminated. I realize this annecdote is
on the extreme end of things, but it is possible this would be a common
issue for large core-count server systems.

>
> Anyway, I'm ok with the idea of using PERF_SAMPLE_READ to
> improve per-thread profiling especially with event groups.
> But I think it should not use inherit_stat and it needs a way to
> not include child stats in the samples.
>
> What do you think?

I'm happy to find a different way to opt in to this instead of
`inherit_stat` if people prefer. Though I think if my understanding of
__perf_event_sync_stat described above is correct, then so long as i
fix the behaviour of perf_output_read_group/_one then I think the
sampled counts would be correct.


I can look at adding a new flag bit to opt in... this would probably
also eleviate some of Andi's concerns in the other part of this thread.
Open to suggestions otherwise...

Thanks
Ben


>
> Thanks,
> Namhyung
>
> >
> > In this configuration stream ids (such as may appear in the
> > read_format
> > field of a PERF_RECORD_SAMPLE) are no longer globally unique,
> > rather
> > the pair of (stream id, tid) uniquely identify each event. Tools
> > that
> > rely on this, for example to calculate a delta between samples,
> > would
> > need updating to take this into account. Previously valid event
> > configurations (system-wide, no-inherit and so on) where each
> > stream id
> > is the identifier are unaffected.
> >
> > This patch has been tested on aarch64 both my manual inspection of
> > the
> > output of `perf script -D` and through a modified version of Arm's
> > commercial profiling tools and the numbers appear to line up as one
> > would expect, but some further validation across other
> > architectures
> > and/or edge cases would be welcome.
> >
> > This patch was developed and tested on top of v6.7.
> >
> >
> > Ben Gainey (1):
> > perf: Support PERF_SAMPLE_READ with inherit_stat
> >
> > kernel/events/core.c | 7 +++++--
> > 1 file changed, 5 insertions(+), 2 deletions(-)
> >
> > --
> > 2.43.0
> >

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.