Re: [PATCH 2/6] perf: Support branch events logging

From: Liang, Kan
Date: Fri Apr 14 2023 - 09:35:45 EST




On 2023-04-14 6:38 a.m., Peter Zijlstra wrote:
> On Mon, Apr 10, 2023 at 01:43:48PM -0700, kan.liang@xxxxxxxxxxxxxxx wrote:
>> From: Kan Liang <kan.liang@xxxxxxxxxxxxxxx>
>>
>> With the cycle time information between branches, stalls can be easily
>> observed. But it's difficult to explain what causes the long delay.
>>
>> Add a new field to collect the occurrences of events since the last
>> branch entry, which can be used to provide some causality information
>> for the cycle time values currently recorded in branches.
>>
>> Add a new branch sample type to indicate whether include occurrences of
>> events in branch info.
>>
>> Only support up to 4 events with saturating at value 3.
>> In the current kernel, the events are ordered by either the counter
>> index or the enabling sequence. But none of the order information is
>> available to the user space tool.
>> Add a new PERF_SAMPLE format, PERF_SAMPLE_BRANCH_EVENT_IDS, and generic
>> support to dump the event IDs of the branch events.
>> Add a helper function to detect the branch event flag.
>> These will be used in the following patch.
>
> I'm having trouble reverse engineering this. Can you more coherently
> explain this feature and how you've implemented it?

Sorry for that.

The feature is an enhancement of ARCH LBR. It adds new fields in the
LBR_INFO MSRs to log the occurrences of events on the first 4 GP
counters. Worked with the previous timed LBR feature together, the user
can understand not only the latency between two LBR blocks, but also
which events causes the stall.

The spec can be found at the latest Intel® Architecture Instruction Set
Extensions and Future Features, v048. Chapter 8.4.
https://cdrdv2.intel.com/v1/dl/getContent/671368

To support the feature, there are three main changes in ABIs.
- A new branch sample type, PERF_SAMPLE_BRANCH_EVENT, is used as a knob
to enable the feature.
- Extend the struct perf_branch_entry layout, because we have to save
and pass the occurrences of events to user space. Since it's only
available for 4 counters and saturating at value 3, it only occupies 8
bits. For the current Intel implementation, the order is the order of
counters.
- Add a new PERF_SAMPLE format, PERF_SAMPLE_BRANCH_EVENT_IDS, to dump
the order information. User space tool doesn't understand the order of
counters. So it cannot map the new fields in struct perf_branch_entry to
a specific event. We have to dump the order information.
I once considered using enabling order to avoid this new sample format.
It works for some cases, e.g., group. But it doesn't work for some
complex cases, e.g., multiplexing, in which the enabling order keeps
changing.
Ideally, we should dump the order information for each LBR entry. But
that will include too much duplicate information. So the order
information is only dumped for each sample. The drawback is that we have
to flush/update old LBR entries once the events are rescheduled between
samples, e.g., multiplexing. Because it's possible that the new sample
can still see the stall LBR entries. That's specially handled in the
next Intel specific patch.

For the current implementation, perf tool has to apply both
PERF_SAMPLE_BRANCH_EVENT and PERF_SAMPLE_BRANCH_EVENT_IDS to enable the
feature.

Thanks,
Kan