Re: [PATCH V4 01/11] perf/core: Add aux_pause, aux_resume, aux_start_paused

From: Adrian Hunter
Date: Tue Jan 23 2024 - 05:09:34 EST


On 19/01/24 23:40, Namhyung Kim wrote:
> On Thu, Jan 11, 2024 at 12:19 AM Adrian Hunter <adrian.hunter@xxxxxxxxx> wrote:
>>
>> Hardware traces, such as instruction traces, can produce a vast amount of
>> trace data, so being able to reduce tracing to more specific circumstances
>> can be useful.
>>
>> The ability to pause or resume tracing when another event happens, can do
>> that.
>>
>> Add ability for an event to "pause" or "resume" AUX area tracing.
>>
>> Add aux_pause bit to perf_event_attr to indicate that, if the event
>> happens, the associated AUX area tracing should be paused. Ditto
>> aux_resume. Do not allow aux_pause and aux_resume to be set together.
>>
>> Add aux_start_paused bit to perf_event_attr to indicate to an AUX area
>> event that it should start in a "paused" state.
>>
>> Add aux_paused to struct perf_event for AUX area events to keep track of
>> the "paused" state. aux_paused is initialized to aux_start_paused.
>>
>> Add PERF_EF_PAUSE and PERF_EF_RESUME modes for ->stop() and ->start()
>> callbacks. Call as needed, during __perf_event_output(). Add
>> aux_in_pause_resume to struct perf_buffer to prevent races with the NMI
>> handler. Pause/resume in NMI context will miss out if it coincides with
>> another pause/resume.
>>
>> To use aux_pause or aux_resume, an event must be in a group with the AUX
>> area event as the group leader.
>>
>> Example (requires Intel PT and tools patches also):
>>
>> $ perf record --kcore -e intel_pt/aux-action=start-paused/k,syscalls:sys_enter_newuname/aux-action=resume/,syscalls:sys_exit_newuname/aux-action=pause/ uname
>> Linux
>> [ perf record: Woken up 1 times to write data ]
>> [ perf record: Captured and wrote 0.043 MB perf.data ]
>> $ perf script --call-trace
>> uname 30805 [000] 24001.058782799: name: 0x7ffc9c1865b0
>> uname 30805 [000] 24001.058784424: psb offs: 0
>> uname 30805 [000] 24001.058784424: cbr: 39 freq: 3904 MHz (139%)
>> uname 30805 [000] 24001.058784629: ([kernel.kallsyms]) debug_smp_processor_id
>> uname 30805 [000] 24001.058784629: ([kernel.kallsyms]) __x64_sys_newuname
>> uname 30805 [000] 24001.058784629: ([kernel.kallsyms]) down_read
>> uname 30805 [000] 24001.058784629: ([kernel.kallsyms]) __cond_resched
>> uname 30805 [000] 24001.058784629: ([kernel.kallsyms]) preempt_count_add
>> uname 30805 [000] 24001.058784629: ([kernel.kallsyms]) in_lock_functions
>> uname 30805 [000] 24001.058784629: ([kernel.kallsyms]) preempt_count_sub
>> uname 30805 [000] 24001.058784629: ([kernel.kallsyms]) up_read
>> uname 30805 [000] 24001.058784629: ([kernel.kallsyms]) preempt_count_add
>> uname 30805 [000] 24001.058784838: ([kernel.kallsyms]) in_lock_functions
>> uname 30805 [000] 24001.058784838: ([kernel.kallsyms]) preempt_count_sub
>> uname 30805 [000] 24001.058784838: ([kernel.kallsyms]) _copy_to_user
>> uname 30805 [000] 24001.058784838: ([kernel.kallsyms]) syscall_exit_to_user_mode
>> uname 30805 [000] 24001.058784838: ([kernel.kallsyms]) syscall_exit_work
>> uname 30805 [000] 24001.058784838: ([kernel.kallsyms]) perf_syscall_exit
>> uname 30805 [000] 24001.058784838: ([kernel.kallsyms]) debug_smp_processor_id
>> uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) perf_trace_buf_alloc
>> uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) perf_swevent_get_recursion_context
>> uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) debug_smp_processor_id
>> uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) debug_smp_processor_id
>> uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) perf_tp_event
>> uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) perf_trace_buf_update
>> uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) tracing_gen_ctx_irq_test
>> uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) perf_swevent_event
>> uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) __perf_event_account_interrupt
>> uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) __this_cpu_preempt_check
>> uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) perf_event_output_forward
>> uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) perf_event_aux_pause
>> uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) ring_buffer_get
>> uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) __rcu_read_lock
>> uname 30805 [000] 24001.058785046: ([kernel.kallsyms]) __rcu_read_unlock
>> uname 30805 [000] 24001.058785254: ([kernel.kallsyms]) pt_event_stop
>> uname 30805 [000] 24001.058785254: ([kernel.kallsyms]) debug_smp_processor_id
>> uname 30805 [000] 24001.058785254: ([kernel.kallsyms]) debug_smp_processor_id
>> uname 30805 [000] 24001.058785254: ([kernel.kallsyms]) native_write_msr
>> uname 30805 [000] 24001.058785463: ([kernel.kallsyms]) native_write_msr
>> uname 30805 [000] 24001.058785639: 0x0
>
> Looks great! I think this is very similar to what Kees asked in
>
> https://lore.kernel.org/linux-perf-users/202401091452.B73E21B6C@keescook/

Sometimes a precisely-defined workload is needed, just so that
running it repeatedly does not produce results that vary too much
to tell whether one software version is better than another.

>
> I have a couple of basic questions:
> * Can we do that for regular events too?

That would be much more complicated. The current implementation
can only pause / resume 1 event, the group leader, and it has to
be supported by the PMU callbacks.

> * What's the difference between start/stop and pause/resume?
> (IOW can we do that just using start/stop callbacks?)

It is using start / stop callbacks, albeit with a different mode
parameter. However pause / resume is not allowed unless the event
has been started and not stopped, so it is a different state.

>
> Actually I was thinking about dropping samples using a BPF filter
> outside the target scope (e.g. a syscall) but it'd be nice if we can
> have builtin support for that.

In general, I would have thought that capturing samples does not
produce so much data that it cannot be filtered in post-processing.
Looking at the email thread from above, that seems to be what
Arnaldo has proposed.

AUX area tracing is different in this regard. Intel PT can produce
more trace data than can be written out in time, so data will be
lost for large traces. Also post-processing takes a long time, so
less data captured helps a lot there also.