Re: [PATCH] ring-buffer: Simplify reservation with try_cmpxchg() loop

From: Steven Rostedt
Date: Fri Jan 19 2024 - 10:36:46 EST


On Fri, 19 Jan 2024 09:40:27 -0500
Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> wrote:

> On 2024-01-18 18:12, Steven Rostedt wrote:
> > From: "Steven Rostedt (Google)" <rostedt@xxxxxxxxxxx>
> >
> > Instead of using local_add_return() to reserve the ring buffer data,
> > Mathieu Desnoyers suggested using local_cmpxchg(). This would simplify the
> > reservation with the time keeping code.
>
> I admire the effort of trying to simplify the Ftrace ring buffer by bringing
> over ideas that worked well for LTTng. :-) As reviewer of the tracing subsystem,
> I certainly welcome the simplifications.
>

The idea itself wasn't new. It was you convincing me that
local_add_return() is no faster than local_cmpxchg(). As I would have done
it this way from the start if I wasn't dead set against adding any new
cmpxchg() in the fast path.

Testing showed that local_add_return() is pretty much just as bad, so the
added complexity to avoid just slapping in a cmpxchg() was useless.

> > Although, it does not get rid of the double time stamps (before_stamp and
> > write_stamp), using cmpxchg() does get rid of the more complex case when
> > an interrupting event occurs between getting the timestamps and reserving
> > the data, as when that happens, it just tries again instead of dealing
> > with it.
>
> I understand that the reason why you need the before/after stamps and their
> associated complexity is because the Ftrace ring buffer ABI encodes event
> timestamps as delta from the previous event within the buffer as a mean of
> compressing the timestamp fields. If the delta cannot be represented in a
> given number of bits, then it inserts a 64-bit timestamp (not sure if that
> one is absolute or a delta from previous event).

There's both. An extended timestamp, which is added when the delta is too
big, and that too is just a delta from the previous event. And there is the
absolute timestamp as well. I could always just use the absolute one. That
event came much later.

>
> This timestamp encoding as delta between events introduce a strong
> inter-dependency between consecutive (nested) events, and is the reason
> why you are stuck with all this timestamp before/after complexity.
>
> The Common Trace Format specifies (and LTTng implements) a different way
> to achieve the same ring buffer space-savings achieved with timestamp deltas
> while keeping the timestamps semantically absolute from a given reference,
> hence without all the before/after timestamp complexity. You can see the
> clock value decoding procedure in the CTF2 SPEC RC9 [1] document. The basic

That points to this:

---------------------8<-------------------------
6.3. Clock value update procedure
To update DEF_CLK_VAL from an unsigned integer field F having the unsigned integer value V and the class C:

Let L be an unsigned integer initialized to, depending on the type property of C:

"fixed-length-unsigned-integer"
The value of the length property of C.

"variable-length-unsigned-integer"
S ×7, where S is the number of bytes which F occupies with the data stream.

Let MASK be an unsigned integer initialized to 2L − 1.

Let H be an unsigned integer initialized to DEF_CLK_VAL & ~MASK, where “&” is the bitwise AND operator and “~” is the bitwise NOT operator.

Let CUR be an unsigned integer initialized to DEF_CLK_VAL & MASK, where “&” is the bitwise AND operator.

Set DEF_CLK_VAL to:

If V ≥ CUR
H + V

Else
H + MASK + 1 + V
--------------------->8-------------------------

There's a lot of missing context there, so I don't see how it relates.


> idea on the producer side is to record the low-order bits of the current
> timestamp in the event header (truncating the higher order bits), and
> fall back on a full 64-bit value if the number of low-order bits overflows
> from the previous timestamp is more than 1, or if it is impossible to figure
> out precisely the timestamp of the previous event due to a race. This achieves
> the same space savings as delta timestamp encoding without introducing the
> strong event inter-dependency.

So when an overflow happens, you just insert a timestamp, and then events
after that is based on that?

>
> The fact that Ftrace exposes this ring buffer binary layout as a user-space
> ABI makes it tricky to move to the Common Trace Format timestamp encoding.
> There are clearly huge simplifications that could be made by moving to this
> scheme though. Is there any way to introduce a different timestamp encoding
> scheme as an extension to the Ftrace ring buffer ABI ? This would allow us to
> introduce this simpler scheme and gradually phase out the more complex delta
> encoding when no users are left.

I'm not sure if there's a path forward. The infrastructure can easily swap
in and out a new implementation. That is, there's not much dependency on
the way the ring buffer works outside the ring buffer itself.

If we were to change the layout, it would likely require a new interface
file to read. The trace_pipe_raw is the only file that exposes the current
ring buffer. We could create a trace_out_raw or some other named file that
has a completely different API and it wouldn't break any existing API.

Although, if we want to change the "default" way, it may need some other
knobs or something, which wouldn't be hard.

Now I have to ask, what's the motivation for this. The code isn't that
complex anymore. Yes it still has the before/after timestamps, but the
most complexity in that code was due to what happens in the race of
updating the reserved data. But that's no longer the case with the
cmpxchg(). If you look at this patch, that entire else statement was
deleted. And that deleted code was what made me sick to my stomach ;-)
Good riddance!

-- Steve