Re: [PATCH] ring-buffer: Fix slowpath of interrupted event

From: Google
Date: Tue Dec 19 2023 - 09:38:50 EST


On Mon, 18 Dec 2023 23:07:12 -0500
Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:

> From: "Steven Rostedt (Google)" <rostedt@xxxxxxxxxxx>
>
> To synchronize the timestamps with the ring buffer reservation, there are
> two timestamps that are saved in the buffer meta data.
>
> 1. before_stamp
> 2. write_stamp
>
> When the two are equal, the write_stamp is considered valid, as in, it may
> be used to calculate the delta of the next event as the write_stamp is the
> timestamp of the previous reserved event on the buffer.
>
> This is done by the following:
>
> /*A*/ w = current position on the ring buffer
> before = before_stamp
> after = write_stamp
> ts = read current timestamp
>
> if (before != after) {
> write_stamp is not valid, force adding an absolute
> timestamp.

(additional comment)
This happens if this caller interrupts another reservation process's B to E.
(thus the original one only updates "before_stamp", but "write_stamp" is old.)

> }
>
> /*B*/ before_stamp = ts
>
> /*C*/ write = local_add_return(event length, position on ring buffer)
>
> if (w == write - event length) {
> /* Nothing interrupted between A and C */
> /*E*/ write_stamp = ts;
> delta = ts - after
> /*
> * If nothing interrupted again,
> * before_stamp == write_stamp and write_stamp
> * can be used to calculate the delta for
> * events that come in after this one.
> */

(additional comment)
If something interrupts between C and E, the write_stamp goes backward
because interrupted one updates the before_stamp and write_stamp with
new timestamp. But in this case, write_stamp(=ts) != before_stamp(=new ts).
Thus anyway the next entry will use absolute time stamp forcibly.

> } else {
>
> /*
> * The slow path!
> * Was interrupted between A and C.
> */
>
> This is the place that there's a bug. We currently have:
>
> after = write_stamp
> ts = read current timestamp
>
> /*F*/ if (write == current position on the ring buffer &&
> after < ts && cmpxchg(write_stamp, after, ts)) {
>
> delta = ts - after;
>
> } else {
> delta = 0;
> }
>
> The assumption is that if the current position on the ring buffer hasn't
> moved between C and F, then it also was not interrupted, and that the last
> event written has a timestamp that matches the write_stamp. That is the
> write_stamp is valid.
>
> But this may not be the case:
>
> If a task context event was interrupted by softirq between B and C.
>
> And the softirq wrote an event that got interrupted by a hard irq between
> C and E.
>
> and the hard irq wrote an event (does not need to be interrupted)
>
> We have:
>
> /*B*/ before_stamp = ts of normal context
>
> ---> interrupted by softirq
>
> /*B*/ before_stamp = ts of softirq context
>
> ---> interrupted by hardirq
>
> /*B*/ before_stamp = ts of hard irq context
> /*E*/ write_stamp = ts of hard irq context
>
> /* matches and write_stamp valid */
> <----
>
> /*E*/ write_stamp = ts of softirq context
>
> /* No longer matches before_stamp, write_stamp is not valid! */
>
> <---
>
> w != write - length, go to slow path
>
> // Right now the order of events in the ring buffer is:
> //
> // |-- softirq event --|-- hard irq event --|-- normal context event --|
> //
>
> after = write_stamp (this is the ts of softirq)
> ts = read current timestamp
>
> if (write == current position on the ring buffer [true] &&
> after < ts [true] && cmpxchg(write_stamp, after, ts) [true]) {
>
> delta = ts - after [Wrong!]
>
> The delta is to be between the hard irq event and the normal context
> event, but the above logic made the delta between the softirq event and
> the normal context event, where the hard irq event is between the two. This
> will shift all the remaining event timestamps on the sub-buffer
> incorrectly.
>
> The write_stamp is only valid if it matches the before_stamp. The cmpxchg
> does nothing to help this.

That's right. Even if someone interrupts between A and C, we can write
the ts to write_stamp. In this case, write_stamp(old) != before_stamp(new)
so the next entry will forcibly record the absolute timestamp.
In this case, what we need to do here is using the absolute timestamp instead
of delta.

>
> Instead, the following logic can be done to fix this:
>
> before = before_stamp
> ts = read current timestamp
> before_stamp = ts
>
> after = write_stamp
>
> if (write == current position on the ring buffer &&
> after == before && after < ts) {
>
> delta = ts - after
>
> } else {
> delta = 0;
> }

Ah, this is a good idea. Instead of using the old timestamp, use
delta = 0, thus it will reuse the interrupted timestamp if someone
interrupts between A and C.

Yeah the above works, but my question is, do we really need this
really slow path? I mean;

> if (w == write - event length) {
> /* Nothing interrupted between A and C */
> /*E*/ write_stamp = ts;
> delta = ts - after

} else {
/*
Something interrupted between A and C, which should record
a new entry before this reserved entry with newer timestamp.
we reuse it.
*/
ts = after = write_stamp;
delta = 0;
}

Isn't this enough?

Thank you,

>
> The above will only use the write_stamp if it still matches before_stamp
> and was tested to not have changed since C.
>
> As a bonus, with this logic we do not need any 64-bit cmpxchg() at all!
>
> This mean the 32-bit rb_time_t workaround can finally be removed. But
> that's for a later time.
>
> Link: https://lore.kernel.org/linux-trace-kernel/20231218175229.58ec3daf@xxxxxxxxxxxxxxxxxx/
>
> Cc: stable@xxxxxxxxxxxxxxx
> Fixes: dd93942570789 ("ring-buffer: Do not try to put back write_stamp")
> Signed-off-by: Steven Rostedt (Google) <rostedt@xxxxxxxxxxx>
> ---
> kernel/trace/ring_buffer.c | 79 ++++++++++++--------------------------
> 1 file changed, 24 insertions(+), 55 deletions(-)
>
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 5a114e752f11..e7055f52afe0 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -700,48 +700,6 @@ rb_time_read_cmpxchg(local_t *l, unsigned long expect, unsigned long set)
> return local_try_cmpxchg(l, &expect, set);
> }
>
> -static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set)
> -{
> - unsigned long cnt, top, bottom, msb;
> - unsigned long cnt2, top2, bottom2, msb2;
> - u64 val;
> -
> - /* Any interruptions in this function should cause a failure */
> - cnt = local_read(&t->cnt);
> -
> - /* The cmpxchg always fails if it interrupted an update */
> - if (!__rb_time_read(t, &val, &cnt2))
> - return false;
> -
> - if (val != expect)
> - return false;
> -
> - if ((cnt & 3) != cnt2)
> - return false;
> -
> - cnt2 = cnt + 1;
> -
> - rb_time_split(val, &top, &bottom, &msb);
> - msb = rb_time_val_cnt(msb, cnt);
> - top = rb_time_val_cnt(top, cnt);
> - bottom = rb_time_val_cnt(bottom, cnt);
> -
> - rb_time_split(set, &top2, &bottom2, &msb2);
> - msb2 = rb_time_val_cnt(msb2, cnt);
> - top2 = rb_time_val_cnt(top2, cnt2);
> - bottom2 = rb_time_val_cnt(bottom2, cnt2);
> -
> - if (!rb_time_read_cmpxchg(&t->cnt, cnt, cnt2))
> - return false;
> - if (!rb_time_read_cmpxchg(&t->msb, msb, msb2))
> - return false;
> - if (!rb_time_read_cmpxchg(&t->top, top, top2))
> - return false;
> - if (!rb_time_read_cmpxchg(&t->bottom, bottom, bottom2))
> - return false;
> - return true;
> -}
> -
> #else /* 64 bits */
>
> /* local64_t always succeeds */
> @@ -755,11 +713,6 @@ static void rb_time_set(rb_time_t *t, u64 val)
> {
> local64_set(&t->time, val);
> }
> -
> -static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set)
> -{
> - return local64_try_cmpxchg(&t->time, &expect, set);
> -}
> #endif
>
> /*
> @@ -3610,20 +3563,36 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
> } else {
> u64 ts;
> /* SLOW PATH - Interrupted between A and C */
> - a_ok = rb_time_read(&cpu_buffer->write_stamp, &info->after);
> - /* Was interrupted before here, write_stamp must be valid */
> +
> + /* Save the old before_stamp */
> + a_ok = rb_time_read(&cpu_buffer->before_stamp, &info->before);
> RB_WARN_ON(cpu_buffer, !a_ok);
> +
> + /*
> + * Read a new timestamp and update the before_stamp to make
> + * the next event after this one force using an absolute
> + * timestamp. This is in case an interrupt were to come in
> + * between E and F.
> + */
> ts = rb_time_stamp(cpu_buffer->buffer);
> + rb_time_set(&cpu_buffer->before_stamp, ts);
> +
> + barrier();
> + /*E*/ a_ok = rb_time_read(&cpu_buffer->write_stamp, &info->after);
> + /* Was interrupted before here, write_stamp must be valid */
> + RB_WARN_ON(cpu_buffer, !a_ok);
> barrier();
> - /*E*/ if (write == (local_read(&tail_page->write) & RB_WRITE_MASK) &&
> - info->after < ts &&
> - rb_time_cmpxchg(&cpu_buffer->write_stamp,
> - info->after, ts)) {
> - /* Nothing came after this event between C and E */
> + /*F*/ if (write == (local_read(&tail_page->write) & RB_WRITE_MASK) &&
> + info->after == info->before && info->after < ts) {
> + /*
> + * Nothing came after this event between C and F, it is
> + * safe to use info->after for the delta as it
> + * matched info->before and is still valid.
> + */
> info->delta = ts - info->after;
> } else {
> /*
> - * Interrupted between C and E:
> + * Interrupted between C and F:
> * Lost the previous events time stamp. Just set the
> * delta to zero, and this will be the same time as
> * the event this event interrupted. And the events that
> --
> 2.42.0
>


--
Masami Hiramatsu (Google) <mhiramat@xxxxxxxxxx>