Re: [PATCH V16 3/8] drivers: perf: arm_pmuv3: Enable branch stack sampling framework

From: Mark Rutland
Date: Wed Feb 21 2024 - 12:25:23 EST


Hi Anshuman,

On Thu, Jan 25, 2024 at 03:11:14PM +0530, Anshuman Khandual wrote:
> Branch stack sampling support i.e capturing branch records during execution
> in core perf, rides along with normal HW events being scheduled on the PMU.
> This prepares ARMV8 PMU framework for branch stack support on relevant PMUs
> with required HW implementation.

Please can we start a bit more clearly, e.g.

| drivers: perf: arm_pmu: add instructure for branch stack sampling
|
| In order to support the Branch Record Buffer Extension (BRBE), we need to
| extend the arm_pmu framework with some basic infrastructure for branch stack
| sampling which arm_pmu drivers can opt-in to using. Subsequent patches will
| use this to add support for BRBE in the PMUv3 driver.

> ARMV8 PMU hardware support for branch stack sampling is indicated via a new
> feature flag called 'has_branch_stack' that can be ascertained via probing.
> This modifies current gate in armpmu_event_init() which blocks branch stack
> sampling based perf events unconditionally. Instead allows such perf events
> getting initialized on supporting PMU hardware.

This paragraph can be deleted. The addition of 'has_branch_stack' and its use
in armpmu_event_init() is trivial and obvious in-context, and this distracts
from the important parts of this patch.

> Branch stack sampling is enabled and disabled along with regular PMU events
> . This adds required function callbacks in armv8pmu_branch_xxx() format, to
> drive the PMU branch stack hardware when supported. This also adds fallback
> stub definitions for these callbacks for PMUs which would not have required
> support.

Those additions to the PMUv3 driver should all be in the next patch.

We don't add anything for the other PMU drivers that don't support branch
sampling, so why do we need to do *anything* to the PMUv3 driver here, given we
add the support in the next patch? Those additions only make this patch bigger
and more confusing (and hence more painful to review).

> If a task gets scheduled out, the current branch records get saved in the
> task's context data, which can be later used to fill in the records upon an
> event overflow. Hence, we enable PERF_ATTACH_TASK_DATA (event->attach_state
> based flag) for branch stack requesting perf events. But this also requires
> adding support for pmu::sched_task() callback to arm_pmu.

I think what this is trying to say is:

| With BRBE, the hardware records branches into a hardware FIFO, which will be
| sampled by software when perf events overflow. A task may be context-switched
| an arbitrary number of times between overflows, and to avoid losing samples
| we need to save the current records when a task is context-switched out. To
| do these we'll need to use the pmu::sched_task() callback, and we'll need to
| allocate some per-task storage space using PERF_ATTACH_TASK_DATA.

> Cc: Catalin Marinas <catalin.marinas@xxxxxxx>
> Cc: Will Deacon <will@xxxxxxxxxx>
> Cc: Mark Rutland <mark.rutland@xxxxxxx>
> Cc: linux-arm-kernel@xxxxxxxxxxxxxxxxxxx
> Cc: linux-kernel@xxxxxxxxxxxxxxx
> Signed-off-by: Anshuman Khandual <anshuman.khandual@xxxxxxx>
> ---
> Changes in V16:
>
> - Renamed arm_brbe.h as arm_pmuv3_branch.h
> - Updated perf_sample_save_brstack()'s new argument requirements with NULL
>
> drivers/perf/arm_pmu.c | 57 ++++++++++++-
> drivers/perf/arm_pmuv3.c | 141 +++++++++++++++++++++++++++++++-
> drivers/perf/arm_pmuv3_branch.h | 50 +++++++++++
> include/linux/perf/arm_pmu.h | 29 ++++++-
> include/linux/perf/arm_pmuv3.h | 1 -
> 5 files changed, 273 insertions(+), 5 deletions(-)
> create mode 100644 drivers/perf/arm_pmuv3_branch.h
>
> diff --git a/drivers/perf/arm_pmu.c b/drivers/perf/arm_pmu.c
> index 8458fe2cebb4..16f488ae7747 100644
> --- a/drivers/perf/arm_pmu.c
> +++ b/drivers/perf/arm_pmu.c
> @@ -317,6 +317,15 @@ armpmu_del(struct perf_event *event, int flags)
> struct hw_perf_event *hwc = &event->hw;
> int idx = hwc->idx;
>
> + if (has_branch_stack(event)) {
> + WARN_ON_ONCE(!hw_events->brbe_users);
> + hw_events->brbe_users--;
> + if (!hw_events->brbe_users) {
> + hw_events->brbe_context = NULL;
> + hw_events->brbe_sample_type = 0;
> + }
> + }
> +

If this is going to leak into the core arm_pmu code, use "branch_stack" rather
than "brbe" for these field names.

However, I reckon we could just have two new callbacks on arm_pmu:

branch_stack_add(struct perf_event *event, ...);
branch_stack_del(struct perf_event *event, ...);

.. and hide all of the details in the PMUv3 (or BRBE) driver for now, and the
code above can just do:

if (has_branch_stack(event))
branch_stack_del(event, ...);

.. and likewise in armpmu_add().

That way the actuel management logic for the context and so on can be added in
the next patch, where the lifetime would be *much* clearer.

> armpmu_stop(event, PERF_EF_UPDATE);
> hw_events->events[idx] = NULL;
> armpmu->clear_event_idx(hw_events, event);
> @@ -333,6 +342,38 @@ armpmu_add(struct perf_event *event, int flags)
> struct hw_perf_event *hwc = &event->hw;
> int idx;
>
> + if (has_branch_stack(event)) {
> + /*
> + * Reset branch records buffer if a new CPU bound event
> + * gets scheduled on a PMU. Otherwise existing branch
> + * records present in the buffer might just leak into
> + * such events.
> + *
> + * Also reset current 'hw_events->brbe_context' because
> + * any previous task bound event now would have lost an
> + * opportunity for continuous branch records.
> + */

Doesn't this mean some user silently loses events? Why is that ok?

> + if (!event->ctx->task) {
> + hw_events->brbe_context = NULL;
> + if (armpmu->branch_reset)
> + armpmu->branch_reset();
> + }
> +
> + /*
> + * Reset branch records buffer if a new task event gets
> + * scheduled on a PMU which might have existing records.
> + * Otherwise older branch records present in the buffer
> + * might leak into the new task event.
> + */
> + if (event->ctx->task && hw_events->brbe_context != event->ctx) {
> + hw_events->brbe_context = event->ctx;
> + if (armpmu->branch_reset)
> + armpmu->branch_reset();
> + }

Same question here.

How does this work on other architectures?

What do we do if the CPU-bound and task-bound events want different filters,
etc?

This is the sort of gnarly detail that should be explained (or at least
introduced) in the commit message.

> + hw_events->brbe_users++;
> + hw_events->brbe_sample_type = event->attr.branch_sample_type;

What exactly is brbe_sample_type, and why does it get overriden *every time* we
add a new event? What happens when events have different values for
brbe_sample_type? Or is that forbidden somehow?

> + }
> +
> /* An event following a process won't be stopped earlier */
> if (!cpumask_test_cpu(smp_processor_id(), &armpmu->supported_cpus))
> return -ENOENT;

Unless this cpumask check has been made redundant, it means that the code above
it is obviously wrong, since that pokes the BRBE HW and increments brbe_users
*before* we decide whether the event can be installed on this CPU. That'll blow
up on big.LITTLE, e.g. we try and install a 'big' CPU event on a 'little' CPU,
poke the BRBE HW and increment brbe_users, then *after* that we abort
installing the event.

Even ignoring big.LITTLE, we can fail immediately after this when we don't have
enough counters, since the following code is:

| /* An event following a process won't be stopped earlier */
| if (!cpumask_test_cpu(smp_processor_id(), &armpmu->supported_cpus))
| return -ENOENT;
|
| /* If we don't have a space for the counter then finish early. */
| idx = armpmu->get_event_idx(hw_events, event);
| if (idx < 0)
| return idx;

.. which'll go wrong if you try to open 1 more event than the CPU has
counters.

> @@ -511,13 +552,24 @@ static int armpmu_event_init(struct perf_event *event)
> !cpumask_test_cpu(event->cpu, &armpmu->supported_cpus))
> return -ENOENT;
>
> - /* does not support taken branch sampling */
> - if (has_branch_stack(event))
> + /*
> + * Branch stack sampling events are allowed
> + * only on PMU which has required support.
> + */
> + if (has_branch_stack(event) && !armpmu->has_branch_stack)
> return -EOPNOTSUPP;
> return __hw_perf_event_init(event);
> }
>

I think we can delete the comment entirely here, but the code itself looks
fine here.

> +static void armpmu_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in)
> +{
> + struct arm_pmu *armpmu = to_arm_pmu(pmu_ctx->pmu);
> +
> + if (armpmu->sched_task)
> + armpmu->sched_task(pmu_ctx, sched_in);
> +}

This looks fine.

> static void armpmu_enable(struct pmu *pmu)
> {
> struct arm_pmu *armpmu = to_arm_pmu(pmu);
> @@ -864,6 +916,7 @@ struct arm_pmu *armpmu_alloc(void)
> }
>
> pmu->pmu = (struct pmu) {
> + .sched_task = armpmu_sched_task,
> .pmu_enable = armpmu_enable,
> .pmu_disable = armpmu_disable,
> .event_init = armpmu_event_init,
> diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
> index 23fa6c5da82c..9e17764a0929 100644
> --- a/drivers/perf/arm_pmuv3.c
> +++ b/drivers/perf/arm_pmuv3.c
> @@ -26,6 +26,7 @@
> #include <linux/nmi.h>
>
> #include <asm/arm_pmuv3.h>
> +#include "arm_pmuv3_branch.h"

As above, I do not thing that the PMUv3 driver should change at all in this
patch. As of this patch it achieves nothing, and it makes it really hard to
understand what's going on because the important aspects are spread randomly
across this patch and the next patch which actually adds the BRBE management.

Please factor the PMUv3 changes out into the patch adding the actual BRBE code.

[...]

> diff --git a/include/linux/perf/arm_pmuv3.h b/include/linux/perf/arm_pmuv3.h
> index 46377e134d67..c3e7d2cfb737 100644
> --- a/include/linux/perf/arm_pmuv3.h
> +++ b/include/linux/perf/arm_pmuv3.h
> @@ -308,5 +308,4 @@
> default: WARN(1, "Invalid PMEV* index\n"); \
> } \
> } while (0)
> -
> #endif

Unrelated whitespace change.

Mark.