[PATCH 1/2] tracing: Simplify and fix "buffered event" synchronization

From: Petr Pavlu
Date: Mon Nov 27 2023 - 10:13:15 EST


The following warning appears when using buffered events:
[ 203.556451] WARNING: CPU: 53 PID: 10220 at kernel/trace/ring_buffer.c:3912 ring_buffer_discard_commit+0x2eb/0x420
[...]
[ 203.670690] CPU: 53 PID: 10220 Comm: stress-ng-sysin Tainted: G E 6.7.0-rc2-default #4 56e6d0fcf5581e6e51eaaecbdaec2a2338c80f3a
[ 203.670704] Hardware name: Intel Corp. GROVEPORT/GROVEPORT, BIOS GVPRCRB1.86B.0016.D04.1705030402 05/03/2017
[ 203.670709] RIP: 0010:ring_buffer_discard_commit+0x2eb/0x420
[ 203.735721] Code: 4c 8b 4a 50 48 8b 42 48 49 39 c1 0f 84 b3 00 00 00 49 83 e8 01 75 b1 48 8b 42 10 f0 ff 40 08 0f 0b e9 fc fe ff ff f0 ff 47 08 <0f> 0b e9 77 fd ff ff 48 8b 42 10 f0 ff 40 08 0f 0b e9 f5 fe ff ff
[ 203.735734] RSP: 0018:ffffb4ae4f7b7d80 EFLAGS: 00010202
[ 203.735745] RAX: 0000000000000000 RBX: ffffb4ae4f7b7de0 RCX: ffff8ac10662c000
[ 203.735754] RDX: ffff8ac0c750be00 RSI: ffff8ac10662c000 RDI: ffff8ac0c004d400
[ 203.781832] RBP: ffff8ac0c039cea0 R08: 0000000000000000 R09: 0000000000000000
[ 203.781839] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 203.781842] R13: ffff8ac10662c000 R14: ffff8ac0c004d400 R15: ffff8ac10662c008
[ 203.781846] FS: 00007f4cd8a67740(0000) GS:ffff8ad798880000(0000) knlGS:0000000000000000
[ 203.781851] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 203.781855] CR2: 0000559766a74028 CR3: 00000001804c4000 CR4: 00000000001506f0
[ 203.781862] Call Trace:
[ 203.781870] <TASK>
[ 203.851949] trace_event_buffer_commit+0x1ea/0x250
[ 203.851967] trace_event_raw_event_sys_enter+0x83/0xe0
[ 203.851983] syscall_trace_enter.isra.0+0x182/0x1a0
[ 203.851990] do_syscall_64+0x3a/0xe0
[ 203.852075] entry_SYSCALL_64_after_hwframe+0x6e/0x76
[ 203.852090] RIP: 0033:0x7f4cd870fa77
[ 203.982920] Code: 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 b8 89 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e9 43 0e 00 f7 d8 64 89 01 48
[ 203.982932] RSP: 002b:00007fff99717dd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000089
[ 203.982942] RAX: ffffffffffffffda RBX: 0000558ea1d7b6f0 RCX: 00007f4cd870fa77
[ 203.982948] RDX: 0000000000000000 RSI: 00007fff99717de0 RDI: 0000558ea1d7b6f0
[ 203.982957] RBP: 00007fff99717de0 R08: 00007fff997180e0 R09: 00007fff997180e0
[ 203.982962] R10: 00007fff997180e0 R11: 0000000000000246 R12: 00007fff99717f40
[ 204.049239] R13: 00007fff99718590 R14: 0000558e9f2127a8 R15: 00007fff997180b0
[ 204.049256] </TASK>

For instance, it can be triggered by running these two commands in
parallel:
$ while true; do
echo hist:key=id.syscall:val=hitcount > \
/sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger;
done
$ stress-ng --sysinfo $(nproc)

The warning indicates that the current ring_buffer_per_cpu is not in the
committing state. It happens because the active ring_buffer_event
doesn't actually come from the ring_buffer_per_cpu but is allocated from
trace_buffered_event.

The bug is in function trace_buffered_event_disable() where the
following normally happens:
* The code invokes disable_trace_buffered_event() via
smp_call_function_many() and follows it by synchronize_rcu(). This
increments the per-CPU variable trace_buffered_event_cnt on each
target CPU and grants trace_buffered_event_disable() the exclusive
access to the per-CPU variable trace_buffered_event.
* Maintenance is performed on trace_buffered_event, all per-CPU event
buffers get freed.
* The code invokes enable_trace_buffered_event() via
smp_call_function_many(). This decrements trace_buffered_event_cnt and
releases the access to trace_buffered_event.

A problem is that smp_call_function_many() runs a given function on all
target CPUs except on the current one. The following can then occur:
* Task X executing trace_buffered_event_disable() runs on CPU A.
* The control reaches synchronize_rcu() and the task gets rescheduled on
another CPU B.
* The RCU synchronization finishes. At this point,
trace_buffered_event_disable() has the exclusive access to all
trace_buffered_event variables except trace_buffered_event[A] because
trace_buffered_event_cnt[A] is never incremented and if the buffer is
currently unused, remains set to 0.
* A different task Y is scheduled on CPU A and hits a trace event. The
code in trace_event_buffer_lock_reserve() sees that
trace_buffered_event_cnt[A] is set to 0 and decides the use the buffer
provided by trace_buffered_event[A].
* Task X continues its execution in trace_buffered_event_disable(). The
code incorrectly frees the event buffer pointed by
trace_buffered_event[A] and resets the variable to NULL.
* Task Y writes event data to the now freed buffer and later detects the
created inconsistency.

The issue is observable since commit dea499781a11 ("tracing: Fix warning
in trace_buffered_event_disable()") which moved the call of
trace_buffered_event_disable() in __ftrace_event_enable_disable()
earlier, prior to invoking call->class->reg(.. TRACE_REG_UNREGISTER ..).
The underlying problem in trace_buffered_event_disable() is however
present since the original implementation in commit 0fc1b09ff1ff
("tracing: Use temp buffer when filtering events").

The bug is simply fixable by replacing smp_call_function_many() by
on_each_cpu_mask(), but the code has other issues as well:
* Function trace_event_buffer_lock_reserve() reads trace_buffered_event
and trace_buffered_event_cnt in reverse order than they are written in
trace_buffered_event_disable() and without any memory barrier. It
could happen that trace_event_buffer_lock_reserve() still finds
a valid pointer in trace_buffered_event which is being freed by
trace_buffered_event_disable() but later already sees the decremented
value of trace_buffered_event_cnt back to 0 and incorrectly decides to
use the provided buffer.
* Function trace_buffered_event_enable() initializes memory of each
allocated trace_buffered_event to zero but no memory barrier is
present to ensure this operation is completed before publishing its
pointer for use in trace_event_buffer_lock_reserve().
* Calling function trace_buffered_event_enable() normally requires
pairing it with trace_buffered_event_disable(). However, the function
has no error return code and in case of a failure decrements back
trace_buffered_event_ref. This results in underflow of the counter
when trace_buffered_event_disable() gets called later.

These problems can be addressed individually but it looks better to
rework the synchronization mechanism and simplify it in the first place.

Instead of using the per-CPU variable trace_buffered_event_cnt for the
exclusive access during the disable operation, introduce a new variable
trace_buffered_event_enabled to reflect the current state and
appropriately adjust the code. The variable indicates whether buffered
events are currently enabled and trace_buffered_event together with
trace_buffered_event_cnt are ok to use.

The updated synchronization mechanism fixes the mentioned problems and
avoids also sending IPIs across the system.

Fixes: 0fc1b09ff1ff ("tracing: Use temp buffer when filtering events")
Fixes: dea499781a11 ("tracing: Fix warning in trace_buffered_event_disable()")
Signed-off-by: Petr Pavlu <petr.pavlu@xxxxxxxx>
---
kernel/trace/trace.c | 66 ++++++++++++++++++--------------------------
1 file changed, 27 insertions(+), 39 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 9aebf904ff97..123acb9105dd 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2741,6 +2741,7 @@ trace_buffer_lock_reserve(struct trace_buffer *buffer,
DEFINE_PER_CPU(struct ring_buffer_event *, trace_buffered_event);
DEFINE_PER_CPU(int, trace_buffered_event_cnt);
static int trace_buffered_event_ref;
+static bool trace_buffered_event_enabled;

/**
* trace_buffered_event_enable - enable buffering events
@@ -2764,7 +2765,9 @@ void trace_buffered_event_enable(void)

WARN_ON_ONCE(!mutex_is_locked(&event_mutex));

- if (trace_buffered_event_ref++)
+ trace_buffered_event_ref++;
+
+ if (trace_buffered_event_enabled)
return;

for_each_tracing_cpu(cpu) {
@@ -2777,30 +2780,21 @@ void trace_buffered_event_enable(void)
memset(event, 0, sizeof(*event));

per_cpu(trace_buffered_event, cpu) = event;
-
- preempt_disable();
- if (cpu == smp_processor_id() &&
- __this_cpu_read(trace_buffered_event) !=
- per_cpu(trace_buffered_event, cpu))
- WARN_ON_ONCE(1);
- preempt_enable();
}

+ /*
+ * Ensure all initialization changes are visible and publish
+ * availability of trace_buffered_event.
+ */
+ smp_wmb();
+ WRITE_ONCE(trace_buffered_event_enabled, true);
return;
- failed:
- trace_buffered_event_disable();
-}

-static void enable_trace_buffered_event(void *data)
-{
- /* Probably not needed, but do it anyway */
- smp_rmb();
- this_cpu_dec(trace_buffered_event_cnt);
-}
-
-static void disable_trace_buffered_event(void *data)
-{
- this_cpu_inc(trace_buffered_event_cnt);
+ failed:
+ for_each_tracing_cpu(cpu) {
+ free_page((unsigned long)per_cpu(trace_buffered_event, cpu));
+ per_cpu(trace_buffered_event, cpu) = NULL;
+ }
}

/**
@@ -2820,33 +2814,19 @@ void trace_buffered_event_disable(void)
if (WARN_ON_ONCE(!trace_buffered_event_ref))
return;

- if (--trace_buffered_event_ref)
+ if (--trace_buffered_event_ref || !trace_buffered_event_enabled)
return;

- preempt_disable();
- /* For each CPU, set the buffer as used. */
- smp_call_function_many(tracing_buffer_mask,
- disable_trace_buffered_event, NULL, 1);
- preempt_enable();
+ WRITE_ONCE(trace_buffered_event_enabled, false);

/* Wait for all current users to finish */
synchronize_rcu();

for_each_tracing_cpu(cpu) {
+ WARN_ON_ONCE(per_cpu(trace_buffered_event_cnt, cpu) != 0);
free_page((unsigned long)per_cpu(trace_buffered_event, cpu));
per_cpu(trace_buffered_event, cpu) = NULL;
}
- /*
- * Make sure trace_buffered_event is NULL before clearing
- * trace_buffered_event_cnt.
- */
- smp_wmb();
-
- preempt_disable();
- /* Do the work on each cpu */
- smp_call_function_many(tracing_buffer_mask,
- enable_trace_buffered_event, NULL, 1);
- preempt_enable();
}

static struct trace_buffer *temp_buffer;
@@ -2883,9 +2863,16 @@ trace_event_buffer_lock_reserve(struct trace_buffer **current_rb,
* is still quicker than no copy on match, but having
* to discard out of the ring buffer on a failed match.
*/
- if ((entry = __this_cpu_read(trace_buffered_event))) {
+ if (READ_ONCE(trace_buffered_event_enabled)) {
int max_len = PAGE_SIZE - struct_size(entry, array, 1);

+ /*
+ * Ensure per-CPU variables trace_buffered_event(_cnt)
+ * are read after trace_buffered_event_enabled, pairs
+ * with smp_wmb() in trace_buffered_event_enable().
+ */
+ smp_rmb();
+
val = this_cpu_inc_return(trace_buffered_event_cnt);

/*
@@ -2906,6 +2893,7 @@ trace_event_buffer_lock_reserve(struct trace_buffer **current_rb,
* handle the failure in that case.
*/
if (val == 1 && likely(len <= max_len)) {
+ entry = __this_cpu_read(trace_buffered_event);
trace_event_setup(entry, type, trace_ctx);
entry->array[0] = len;
/* Return with preemption disabled */
--
2.35.3