Re: [PATCH 1/3 v2] perf/amd/uncore: Prepare L3 thread mask code for Family 19h support

From: Kim Phillips
Date: Mon Mar 23 2020 - 16:50:44 EST




On 3/18/20 4:26 PM, Stephane Eranian wrote:
> On Wed, Mar 18, 2020 at 1:43 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>>
>> On Wed, Mar 18, 2020 at 09:46:41AM -0500, Kim Phillips wrote:
>>
>>>> But this does not work with the cpumask programmed for the amd_l3 PMU. This mask
>>>> shows, as it should, one CPU/CCX. So that means that when I do:
>>>>
>>>> $ perf stat -a amd_l3/event=llc_event/
>>>>
>>>> This only collects on the CPUs listed in the cpumask: 0,4,8,12 ....
>>>> That means that L3 events generated by the other CPUs on the CCX are
>>>> not monitored.
>>>> I can easily see the problem by pinning a memory bound program to
>>>> CPU64, for instance.
>>>
>>> Right, the higher level code calls the driver with a single cpu==0
>>> call if the perf tool is invoked with a simple -a style system-wide.
>
> No, it does not.
>
> With -a, when -C is not passed, the perf tool picks up the cpumask for
> the PMU from sysfs:
> $ cat /proc/sys/devices/amd_l3/cpumask
>
> You can easily verify this by running: strace -etrace=perf_event_open
> perf stat -a -e amd_l3/event=0x00/.
> This is the default common mode.

What I meant was that with -a, the driver only gets called with the
'base' cpu for each L3 PMU domain, i.e., 0, 4, 8, and so on. With -C, it
gets called with all the CPUs the user specifies: these are different
behaviours, and the driver can't tell the difference between e.g., -a
or -C 0,4,8, etc.

> The problem is that here to get any meaningful result, you need to force a -C.
> The CPU in the cpumask is just the CPU to which to attach the event in
> order to access the correct uncore PMU.
> Here, you have one CPU per CCX which is expected and perfectly fine.
>
> The thread_mask is a hardware filter on the uncore L3 PMU. If you set
> by default the thread_mask to 0xff, then
> you obtain a full system view with a simple -a, or per socket with
> --per-socket. So we need to find a way to
> make this common case work properly first. Expecting the users to know

OK, I'll send a patch to revert the thread filter feature until the above
issue is addressed.

> that for some amd_l3 events you need
> to force -C 0-255 is not practical. I also think that forcing the
> cpumask to 0-255 is not right solution. This is not how
> this is done for any other uncore PMU I know of and some do have the
> thread filter, such as the Skylake CHA.

Odd, the Intel uncore driver's cpumask is 0, so not sure if AMD's
is right to set it any more...

Thanks,

Kim

>>> If the tool is invoked with supplemental switches to -a, like -C 0-255,
>>> and -A, the driver gets called multiple times with all the unique cpu
>>> values. The latter is the expected invocation style when measuring
>>> a benchmark pinned on a subset of cpus, i.e., when evaluating
>>> the driver, and is the more deterministic behaviour for the driver
>>> to have, given it cannot tell the difference otherwise.
>>
>> That seems to suggest it is all horribly broken.