Re: [PATCH 1/2] tracing: Simplify and fix "buffered event" synchronization

From: Petr Pavlu
Date: Wed Nov 29 2023 - 04:22:38 EST


On 11/28/23 16:27, Steven Rostedt wrote:
> On Tue, 28 Nov 2023 16:05:00 +0100
> Petr Pavlu <petr.pavlu@xxxxxxxx> wrote:
>
>> On 11/27/23 18:41, Steven Rostedt wrote:
>>> On Mon, 27 Nov 2023 16:12:47 +0100
>>> Petr Pavlu <petr.pavlu@xxxxxxxx> wrote:
>>>
>>>> The following warning appears when using buffered events:
>>>> [ 203.556451] WARNING: CPU: 53 PID: 10220 at kernel/trace/ring_buffer.c:3912 ring_buffer_discard_commit+0x2eb/0x420
>>>
>>> Hmm, I don't have a waring on line 3912, do you have extra code (debugging)
>>> in your version?
>>
>> The series is based on 2cc14f52aeb7 (tag: v6.7-rc3). It is the following
>> code and RB_WARN_ON():
>
> Interesting.
>
>> 3895 void ring_buffer_discard_commit(struct trace_buffer *buffer,
>> 3896 struct ring_buffer_event *event)
>> 3897 {
>> [...]
>> 3908 * This must only be called if the event has not been
>> 3909 * committed yet. Thus we can assume that preemption
>> 3910 * is still disabled.
>> 3911 */
>> 3912 RB_WARN_ON(buffer, !local_read(&cpu_buffer->committing));
>> 3913
>> 3914 rb_decrement_entry(cpu_buffer, event);
>> 3915 if (rb_try_to_discard(cpu_buffer, event))
>> 3916 goto out;
>>
>> https://github.com/torvalds/linux/blob/2cc14f52aeb78ce3f29677c2de1f06c0e91471ab/kernel/trace/ring_buffer.c#L3912
>>
>>>> [...]
>>>> [ 203.670690] CPU: 53 PID: 10220 Comm: stress-ng-sysin Tainted: G E 6.7.0-rc2-default #4 56e6d0fcf5581e6e51eaaecbdaec2a2338c80f3a
>
> I was looking at the above 6.7.0-rc2-default, which made me somewhat confused.
>
>
>>>> [ 203.670704] Hardware name: Intel Corp. GROVEPORT/GROVEPORT, BIOS GVPRCRB1.86B.0016.D04.1705030402 05/03/2017
>>>> [ 203.670709] RIP: 0010:ring_buffer_discard_commit+0x2eb/0x420
>>>> [ 203.735721] Code: 4c 8b 4a 50 48 8b 42 48 49 39 c1 0f 84 b3 00 00 00 49 83 e8 01 75 b1 48 8b 42 10 f0 ff 40 08 0f 0b e9 fc fe ff ff f0 ff 47 08 <0f> 0b e9 77
>
>>>> * Function trace_buffered_event_enable() initializes memory of each
>>>> allocated trace_buffered_event to zero but no memory barrier is
>>>> present to ensure this operation is completed before publishing its
>>>> pointer for use in trace_event_buffer_lock_reserve().
>>>
>>> Yes, a wmb() may be helpful, but as the reader will see either the
>>> allocated page or NULL, and can handle both cases. I decided not to add it
>>> because it really doesn't matter if we do.
>>
>> My concern is about the earlier call "memset(event, 0, sizeof(*event));"
>> in trace_buffered_event_enable() which writes the obtained event page
>> prior to publishing it via "per_cpu(trace_buffered_event, cpu) = event;".
>> I think another CPU running trace_event_buffer_lock_reserve() could see
>> already the valid event pointer, write to the event page and only then
>> the memory could see the memset() write coming from
>> trace_buffered_event_enable(). The event data would then become
>> corrupted.
>
> OK, I'll buy that. Feel free to send an patch that adds that wmb with a
> comment explaining the above.

Sure, I can do that. A nice thing about my proposed patch is that it
requires only one smp_wmb() in this case, while the current code will
require adding this barrier in the allocation loop prior to writing
trace_buffered_event.

>>>
>>>> * Calling function trace_buffered_event_enable() normally requires
>>>> pairing it with trace_buffered_event_disable(). However, the function
>>>> has no error return code and in case of a failure decrements back
>>>> trace_buffered_event_ref. This results in underflow of the counter
>>>> when trace_buffered_event_disable() gets called later.
>>>
>>> Not an issue.
>>>
>>> Even on failure the ref count will be greater than zero, where
>>> trace_buffered_event_disable() handles it properly, and the freeing can
>>> handle no pages being allocated.
>>>
>>> That is:
>>>
>>> free_page((unsigned long)per_cpu(trace_buffered_event, cpu));
>>>
>>> would be the same as: free_page((unsigned long)NULL);
>>>
>>> Which is perfectly fine to do.
>>
>> The situation that I have on my mind is the following:
>> * The counter trace_buffered_event_ref is at 0.
>> * The soft mode gets enabled for some event and
>> trace_buffered_event_enable() is called. The function increments
>> trace_buffered_event_ref to 1 and starts allocating event pages.
>> * The allocation fails for some page and trace_buffered_event_disable()
>> is called for cleanup.
>> * Function trace_buffered_event_disable() decrements
>> trace_buffered_event_ref back to 0, recognizes that it was the last
>> use of buffered events and frees all allocated pages.
>> * The control goes back to trace_buffered_event_enable() which returns.
>> The caller of trace_buffered_event_enable() has no information that
>> the function actually failed.
>> * Some time later, the soft mode is disabled for the same event.
>> Function trace_buffered_event_disable() is called. It warns on
>> "WARN_ON_ONCE(!trace_buffered_event_ref)" and returns.
>>
>> The end result is only an ugly warning but I think it is still good to
>> fix.
>
> Ah, I missed the call to trace_buffered_event_disable() in the failure
> path. Actually, since the trace_buffered_event is only an optimization, I
> think the best thing to do is to just return without the clean up,
> expecting that the other caller will do the clean up. We could even not
> even stop the loop!
>
> That is, just have:
>
> void trace_buffered_event_enable(void)
> {
> struct ring_buffer_event *event;
> struct page *page;
> int cpu;
>
> WARN_ON_ONCE(!mutex_is_locked(&event_mutex));
>
> if (trace_buffered_event_ref++)
> return;
>
> for_each_tracing_cpu(cpu) {
> page = alloc_pages_node(cpu_to_node(cpu),
> GFP_KERNEL | __GFP_NORETRY, 0);
>
> /* This is just an optimization and can handle failures */
> if (!page) {
> pr_err("Failed to create event buffer\n");
> break;
> }
>
> event = page_address(page);
> memset(event, 0, sizeof(*event));
>
> per_cpu(trace_buffered_event, cpu) = event;
>
> preempt_disable();
> if (cpu == smp_processor_id() &&
> __this_cpu_read(trace_buffered_event) !=
> per_cpu(trace_buffered_event, cpu))
> WARN_ON_ONCE(1);
> preempt_enable();
> }
> }

The current behavior is that if the allocation in one call to
trace_buffered_event_enable() fails then another call to the same
function results in a new attempt to allocate the event pages.

The code above changes it. As long trace_buffered_event_ref doesn't go
back to 0 and then 1 again, no new allocation attempt is made. It means
that trace_buffered_event can be left in this "half-initialized" state
for a while.

This is possibly ok, it is an unlikely corner case anyway. My patch
nonetheless preserves the current behavior of retrying the allocation.

>
>>>
>>> Now, the reason you found the crash was from the first issue you mentioned.
>>> That is, we didn't disable the current CPU and if we migrated, things would
>>> get really messed up. The only fix here is to make sure all CPUs have their
>>> trace_buffered_event_cnt get incremented.
>>
>> The potential race that I see is the following:
>> * Function trace_buffered_event_disable() is called on CPU 0. It
>> increments trace_buffered_event_cnt on each CPU and waits via
>> synchronize_rcu() for each user of trace_buffered_event to complete.
>
>> * After synchronize_rcu() is already finished, the situation is that all
>> counters trace_buffered_event_cnt are at 1 and all pointers
>> trace_buffered_event are still valid.
>
>> * At this point, on a different CPU 1, the execution reaches
>> trace_event_buffer_lock_reserve(). The function calls
>> preempt_disable_notrace() and enters an RCU read-side critical section
>> but that is not important because all RCU stuff is already performed
>> and completed in trace_buffered_event_disable(). The function proceeds
>> and reads a still valid pointer from trace_buffered_event[CPU1] into
>> the local variable "entry". However, it doesn't yet read
>> trace_buffered_event_cnt[CPU1] which happens later.
>
>> * Function trace_buffered_event_disable() continues. It frees
>> trace_buffered_event[CPU1] and decrements
>> trace_buffered_event_cnt[CPU1] back to 0.
>
>> * Function trace_event_buffer_lock_reserve() continues. It reads and
>> increments trace_buffered_event_cnt[CPU1] from 0 to 1. This makes it
>> believe that it can use the "entry" that it already obtained but the
>> pointer is now invalid and any access results in a use-after-free.
>
> OK, so you are saying we need another synchronize_rcu() after we set the
> trace_buffered_event to NULL?
>
>
> void trace_buffered_event_disable(void)
> {
> int cpu;
>
> WARN_ON_ONCE(!mutex_is_locked(&event_mutex));
>
> if (WARN_ON_ONCE(!trace_buffered_event_ref))
> return;
>
> if (--trace_buffered_event_ref)
> return;
>
> preempt_disable();
> /* For each CPU, set the buffer as used. */
> smp_call_function_many(tracing_buffer_mask,
> disable_trace_buffered_event, NULL, 1);
> preempt_enable();
>
> /* Wait for all current users to finish */
> synchronize_rcu();
>
> for_each_tracing_cpu(cpu) {
> free_page((unsigned long)per_cpu(trace_buffered_event, cpu));
> per_cpu(trace_buffered_event, cpu) = NULL;
> }
> /*
> * Make sure trace_buffered_event is NULL before clearing
> * trace_buffered_event_cnt.
> */
> - smp_wmb();
> + synchronize_rcu();
>
> preempt_disable();
> /* Do the work on each cpu */
> smp_call_function_many(tracing_buffer_mask,
> enable_trace_buffered_event, NULL, 1);
> preempt_enable();
> }
>
> I agree with that.

Yes, I believe this should address this potential race condition.

An alternative would be instead to update
trace_event_buffer_lock_reserve() as follows:

if (this_cpu_inc_return(trace_buffered_event_cnt) == 1) {
smp_rmb();
if ((entry = __this_cpu_read(trace_buffered_event))) {
[...]

That saves the synchronize_rcu() call but additionally modifies
trace_buffered_event_cnt even if trace_buffered_event is currently NULL.

Another alternative is the approach taken by my patch which avoids more
RCU work and unnecessary memory modifications.

I'd be interested if you could have a look again at what I'm proposing
in my patch. I think it simplifies the code while addressing these
problems as well. However, if you have reservations about that approach
then it is ok, I can fix the found problems individually as discussed.

Thanks,
Petr