perf/x86/intel: Collecting CPU-local performance counters from all cores in parallel

From: Michael Edwards
Date: Tue May 23 2017 - 01:42:58 EST


I'm working on a system-wide profiling tool that uses perf_event to
gather CPU-local performance counters (L2/L3 cache misses, etc.)
across all CPUs (hyperthreads) of a multi-socket system. We'd like
for the monitoring process to run on a single core, and to be able to
sample at frequent, regular intervals (sub-millisecond), with minimal
impact on the tasks running on other CPUs. I've prototyped this using
perf_events (with one event group per CPU), and on a two-socket,
32-(logical)-CPU system the prototype reaches about 2,700 samples per
second per CPU, at which point it's spending about 30% of its time
inside the read() syscall. Optimizing the other 70% (the prototype
userland) looks fairly routine, so I'm looking at what it would take
to get beyond 10K samples per second.

I'm aware of the mmap()/RDPMC path to sampling counters from userland,
but I'd prefer not to go down that road; it involves mmap()ing all the
individual perf_event fds and reading them from userland tasks on the
relevant core, which is needlessly intrusive on the actual workload.
The measured overheads of the IPI-dispatched __perf_event_read() are
acceptable, if we could just dispatch it in parallel to all CPUs from
a single read() syscall.

I've dug through the perf_event code and think I have a fair idea of
what it would take to implement a sort of "event meta-group" file.
Its read() handler would be equivalent to concatenating the read()
output of its member fds (per-CPU event group leaders), except that it
would only take the syscall / VFS indirection / locking / copy_to_user
overhead once, and would dispatch one IPI (with a per-cpu array of
cache-line-aligned struct perf_read_data arguments) via
on_each_cpu_mask() (thus effectively waiting in parallel on all the
responses). Implementing that is a bit tedious but it's just plumbing
-- except for the small matter of taking all the perf_event_ctx::mutex
locks in the right order. There is a logical sequence (by mutex
address; see mutex_lock_double()), but acquiring several dozen mutexes
in every read() call may be problematic.

One could add a per-meta-group mutex, and add code to
perf_event_ctx_lock() (and other callers / variants of
perf_event_ctx_lock_nested()) that checks for meta-group membership
and takes the per-meta-group mutex before taking the ctx mutex. Then
the meta-group read() path only has to take this one mutex. That
means an event group can only be attached to one meta-group, but
that's probably okay. Still, it's fiddly code, what with the lock
nesting - though I think it helps that we're dealing exclusively with
the group leaders for hardware events, so the move_group code path in
perf_event_open() isn't relevant.

Am I going about this wrong? Is there some better way to pursue the
high-level goal of gathering PMC-based statistics frequently and
efficiently from all cores, without breaking everything else that uses
perf_events?