Re: [PATCH 1/3 v2] perf/amd/uncore: Prepare L3 thread mask code for Family 19h support

From: Stephane Eranian
Date: Wed Mar 18 2020 - 17:26:32 EST


On Wed, Mar 18, 2020 at 1:43 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Wed, Mar 18, 2020 at 09:46:41AM -0500, Kim Phillips wrote:
>
> > > But this does not work with the cpumask programmed for the amd_l3 PMU. This mask
> > > shows, as it should, one CPU/CCX. So that means that when I do:
> > >
> > > $ perf stat -a amd_l3/event=llc_event/
> > >
> > > This only collects on the CPUs listed in the cpumask: 0,4,8,12 ....
> > > That means that L3 events generated by the other CPUs on the CCX are
> > > not monitored.
> > > I can easily see the problem by pinning a memory bound program to
> > > CPU64, for instance.
> >
> > Right, the higher level code calls the driver with a single cpu==0
> > call if the perf tool is invoked with a simple -a style system-wide.

No, it does not.

With -a, when -C is not passed, the perf tool picks up the cpumask for
the PMU from sysfs:
$ cat /proc/sys/devices/amd_l3/cpumask

You can easily verify this by running: strace -etrace=perf_event_open
perf stat -a -e amd_l3/event=0x00/.
This is the default common mode.

The problem is that here to get any meaningful result, you need to force a -C.
The CPU in the cpumask is just the CPU to which to attach the event in
order to access the correct uncore PMU.
Here, you have one CPU per CCX which is expected and perfectly fine.

The thread_mask is a hardware filter on the uncore L3 PMU. If you set
by default the thread_mask to 0xff, then
you obtain a full system view with a simple -a, or per socket with
--per-socket. So we need to find a way to
make this common case work properly first. Expecting the users to know
that for some amd_l3 events you need
to force -C 0-255 is not practical. I also think that forcing the
cpumask to 0-255 is not right solution. This is not how
this is done for any other uncore PMU I know of and some do have the
thread filter, such as the Skylake CHA.



> > If the tool is invoked with supplemental switches to -a, like -C 0-255,
> > and -A, the driver gets called multiple times with all the unique cpu
> > values. The latter is the expected invocation style when measuring
> > a benchmark pinned on a subset of cpus, i.e., when evaluating
> > the driver, and is the more deterministic behaviour for the driver
> > to have, given it cannot tell the difference otherwise.
>
> That seems to suggest it is all horribly broken.