Re: [PATCH 1/3 v2] perf/amd/uncore: Prepare L3 thread mask code for Family 19h support

From: Kim Phillips
Date: Wed Mar 18 2020 - 10:47:05 EST


On 3/17/20 9:09 PM, Stephane Eranian wrote:
> On Fri, Mar 13, 2020 at 4:10 PM Kim Phillips <kim.phillips@xxxxxxx> wrote:
>> +++ b/arch/x86/events/amd/uncore.c
>> @@ -180,6 +180,20 @@ static void amd_uncore_del(struct perf_event *event, int flags)
>> hwc->idx = -1;
>> }
>>
>> +/*
>> + * Convert logical cpu number to L3 PMC Config ThreadMask format
>> + */
>> +static u64 l3_thread_slice_mask(int cpu)
>> +{
>> + int thread = 2 * (cpu_data(cpu).cpu_core_id % 4);
>> +
>> + if (smp_num_siblings > 1)
>> + thread += cpu_data(cpu).apicid & 1;
>> +
>> + return (1ULL << (AMD64_L3_THREAD_SHIFT + thread) &
>> + AMD64_L3_THREAD_MASK) | AMD64_L3_SLICE_MASK;
>> +}
>> +
>> static int amd_uncore_event_init(struct perf_event *event)
>> {
>> struct amd_uncore *uncore;
>> @@ -209,15 +223,8 @@ static int amd_uncore_event_init(struct perf_event *event)
>> * SliceMask and ThreadMask need to be set for certain L3 events in
>> * Family 17h. For other events, the two fields do not affect the count.
>> */
>> - if (l3_mask && is_llc_event(event)) {
>> - int thread = 2 * (cpu_data(event->cpu).cpu_core_id % 4);
>> -
>> - if (smp_num_siblings > 1)
>> - thread += cpu_data(event->cpu).apicid & 1;
>> -
>> - hwc->config |= (1ULL << (AMD64_L3_THREAD_SHIFT + thread) &
>> - AMD64_L3_THREAD_MASK) | AMD64_L3_SLICE_MASK;
>> - }
>> + if (l3_mask && is_llc_event(event))
>> + hwc->config |= l3_thread_slice_mask(event->cpu);
>>
> By looking at this code, I realized that even on Zen2 this is wrong.
> It does not work well.
> You are basically saying that the L3 event is tied to the CPU the
> event is programmed to.
> But this does not work with the cpumask programmed for the amd_l3 PMU. This mask
> shows, as it should, one CPU/CCX. So that means that when I do:
>
> $ perf stat -a amd_l3/event=llc_event/
>
> This only collects on the CPUs listed in the cpumask: 0,4,8,12 ....
> That means that L3 events generated by the other CPUs on the CCX are
> not monitored.
> I can easily see the problem by pinning a memory bound program to
> CPU64, for instance.

Right, the higher level code calls the driver with a single cpu==0
call if the perf tool is invoked with a simple -a style system-wide.
If the tool is invoked with supplemental switches to -a, like -C 0-255,
and -A, the driver gets called multiple times with all the unique cpu
values. The latter is the expected invocation style when measuring
a benchmark pinned on a subset of cpus, i.e., when evaluating
the driver, and is the more deterministic behaviour for the driver
to have, given it cannot tell the difference otherwise.

> I think the thread mask should be exposed to the user. If not
> specified, then set the mask to
> cover all CPUs of the CCX. That way you can pick and choose what you
> want. And with one event/CCX
> you can monitor for all CPUs. I can send a patch that does that.

Do you mean something that will allow the user to do something
like this?:

perf stat -a amd_l3/event=llc_event,core=X,thread_mask={1,2,3}/

Wouldn't users rather specify cpus using -C etc.?

> With what you have now, you have to force the list of CPUs with -C to
> work around
> the cpumask. And forcing the cpumask to 0-255 does not make sense because not
> all L3 events necessarily need the L3 mask, so you don't want to program them on
> all CPUs especially with 8 cpus/CCX and only 6 counters.

Is it not possible for those to be run in separate invocations
that use the simple system-wide case, e.g., -a?

How would adding core=X,thread_mask={1,2,3} specification
change the -C invocation behaviour?

I thought of having the driver set all CPUs in the threadmask
if invoked with a cpu == 0, but that means one cannot specify
-C 0,4,8, etc.

Thanks,

Kim