Re: [RFC PATCH v3 1/1] trace,smp: Add tracepoints around remotelly called functions

From: Valentin Schneider
Date: Thu May 11 2023 - 06:57:27 EST


On 10/05/23 20:01, Leonardo Bras wrote:
> When running RT workloads in isolated CPUs, many cases of deadline misses
> are caused by remote CPU requests such as smp_call_function*().
>
> For those cases, having the names of those functions running around the
> deadline miss moment could help (a lot) finding a target for the next
> improvements.
>
> Add tracepoints for acquiring the function name & csd before entry and
> after returning from the remote-cpu requested function.
>
> Also, add tracepoints on the remote cpus requesting them.

Per the v1 thread, I got some context as to why we want this, but this
changelog is quite sparse on the subject. How about:

"""
The recently added ipi_send_{cpu,cpumask} tracepoints allow finding sources
of IPIs targeting CPUs running latency-sensitive applications.

For NOHZ_FULL CPUs, all IPIs are interference, and those tracepoints are
sufficient to find them and work on getting rid of them. In some setups
however, not *all* IPIs are to be suppressed, but long-running IPI
callbacks can still be problematic.

Add a pair of tracepoints to mark the start and end of processing a CSD IPI
callback, similar to what exists for softirq, workqueue or timer callbacks.
"""

And you can probably split the csd_queue_cpu TP into a separate commit,
with something along the lines of:

"""
Add a tracepoint for when a CSD is queued to a remote CPU's
call_single_queue. This allows finding exactly which CPU queued a given CSD
when looking at a csd_function_{entry,exit} event, and also enables us to
accurately measure IPI delivery time with e.g. a synthetic event:

$ echo 'hist:keys=cpu,csd.hex:ts=common_timestamp.usecs' >\
/sys/kernel/tracing/events/smp/csd_queue_cpu/trigger
$ echo 'csd_latency unsigned int dst_cpu; unsigned long csd; u64 time' >\
/sys/kernel/tracing/synthetic_events
$ echo \
'hist:keys=common_cpu,csd.hex:'\
'time=common_timestamp.usecs-$ts:'\
'onmatch(smp.csd_queue_cpu).trace(csd_latency,common_cpu,csd,$time)' >\
/sys/kernel/tracing/events/smp/csd_function_entry/trigger

$ trace-cmd record -e 'synthetic:csd_latency' hackbench
$ trace-cmd report
<...>-467 [001] 21.824263: csd_queue_cpu: cpu=0 callsite=try_to_wake_up+0x2ea func=sched_ttwu_pending csd=0xffff8880076148b8
<...>-467 [001] 21.824280: ipi_send_cpu: cpu=0 callsite=try_to_wake_up+0x2ea callback=generic_smp_call_function_single_interrupt+0x0
<...>-489 [000] 21.824299: csd_function_entry: func=sched_ttwu_pending csd=0xffff8880076148b8
<...>-489 [000] 21.824320: csd_latency: dst_cpu=0, csd=18446612682193848504, time=36
"""

>
> Signed-off-by: Leonardo Bras <leobras@xxxxxxxxxx>

Overall I like the CSD vs IPI split, it's a saner approach than logging it
all as IPIs, even if it does generate more events as we have to emit an
event for every csd queued (i.e. no _cpumask() variant is possible).

Some nitpicks below, and one other thing: are we happy with the smp event
namespace, and with the new TP names? Should the namespace be csd instead,
to match the trace_<namespace>_* nomenclature?

> ---
>
> Changes since RFCv2:
> - Fixed some spacing issues and trace calls
>
> Changes since RFCv1:
> - Implemented trace_csd_queue_cpu() as suggested by Valentin Schneider
> - Using EVENT_CLASS in order to avoid duplication
> - Introduced new helper: csd_do_func()
> - Name change from smp_call_function_* to csd_function_*
> - Rebased on top of torvalds/master
>
> include/trace/events/smp.h | 72 ++++++++++++++++++++++++++++++++++++++
> kernel/smp.c | 41 +++++++++++++---------
> 2 files changed, 96 insertions(+), 17 deletions(-)
> create mode 100644 include/trace/events/smp.h
>
> diff --git a/include/trace/events/smp.h b/include/trace/events/smp.h
> new file mode 100644
> index 000000000000..c304318a0203
> --- /dev/null
> +++ b/include/trace/events/smp.h
> @@ -0,0 +1,72 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM smp
> +
> +#if !defined(_TRACE_SMP_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_SMP_H
> +
> +#include <linux/tracepoint.h>
> +
> +TRACE_EVENT(csd_queue_cpu,
> +
> + TP_PROTO(const unsigned int cpu,
> + unsigned long callsite,
> + smp_call_func_t func,
> + call_single_data_t *csd),
> +
> + TP_ARGS(cpu, callsite, func, csd),
> +
> + TP_STRUCT__entry(
> + __field(unsigned int, cpu)
> + __field(void *, callsite)
> + __field(void *, func)
> + __field(void *, csd)
> + ),
> +
> + TP_fast_assign(
> + __entry->cpu = cpu;
> + __entry->callsite = (void *)callsite;
> + __entry->func = func;
> + __entry->csd = csd;
> + ),
> +
> + TP_printk("cpu=%u callsite=%pS func=%pS csd=%p",
^^^^^^^^
This can be func=%ps, offsets should always be 0 so not very useful (I know
I put %pS for the IPI TPs, that's a force of habit...)

> + __entry->cpu, __entry->callsite, __entry->func, __entry->csd)
> +);

> +
> +/*
> + * Tracepoints for a function which is called as an effect of smp_call_function.*
> + */
> +DECLARE_EVENT_CLASS(csd_function,
> +
> + TP_PROTO(smp_call_func_t func, call_single_data_t *csd),
> +
> + TP_ARGS(func, csd),
> +
> + TP_STRUCT__entry(
> + __field(void *, func)
> + __field(void *, csd)
> + ),
> +
> + TP_fast_assign(
> + __entry->func = func;
> + __entry->csd = csd;
> + ),
> +
> + TP_printk("function %ps, csd = %p", __entry->func, __entry->csd)

To match the style of the other TPs, that should be:

TP_printk("func=%ps csd=%p", __entry->func, __entry->csd)