Re: [PATCH] ring-buffer: Simplify reservation with try_cmpxchg() loop

From: Mathieu Desnoyers
Date: Thu Jan 25 2024 - 16:18:58 EST


On 2024-01-20 08:47, Steven Rostedt wrote:
On Fri, 19 Jan 2024 20:49:36 -0500
Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> wrote:

Let's say we have the following ktime_get() values (monotonic timestamp value) for
a sequence of events:

Timestamp (Hex) Encoding in the trace

Packet header timestamp begin 0x12345678 64-bit: 0x12345678

Event 1 0x12345678 16-bit: 0x5678
(When decoded, same value as previous timestamp, no overflow)
Event 2 0x12347777 16-bit: 0x7777
(When decoded, going from "0x5678" to "0x7777" does not overflow 16-bit)
Event 3 0x12350000 16-bit: 0x0000
(When decoded, going from "0x7777" to "0x0000" overflow 16-bit exactly once
which allows the trace reader to reconstruct timestamp 0x12350000 from the
previous timestamp and the 16-bit timestamp encoding.)
Event 4 0x12370000 64-bit: 0x12370000
(Encoding over 16-bit not possible because going from 0x12350000 to
0x12370000 would overflow 16-bit twice, which cannot be detected
by a trace reader. Therefore use the full 64-bit timestamp in the
"large" event header representation.)

I think that's basically what I said, but you are just looking at it
differently ;-) Or should I say, you are using bits for optimization.

Based on your explanation below, we are really talking about different things
here. Let me try to reply to your explanation to try to show where what I am
doing completely differs from what you have in mind. This will help explain
how I handle 16-bit overflow as well.

The events are based off of the last injected timestamp.

Incorrect. There is no "injected timestamp". There is only a concept
of the "current timestamp" as we either write to or read from the
event stream. I will take the point of view of the trace reader for
the rest of the discussion.

The above example,
starts with an timestamp injection into the packet header: 0x12345678, with
the lsb 16bits ignore.

Wrong again. The 16 least significant bits are not ignored. The "current timestamp"
is really 0x12345678 when the packet header is read.

In the packet header you have 0x12345678 in the first event you have
0x5678 how does that get you the timestamp? If that event had 0x8888,
when the reader reads this packet, it would take the header 0x12345678
chop off (ignore) the 5678, and add the 8888, right?

We need to consider not only what happens when the 16 low bits increase, but
also what happens when they end up with a value smaller than the previous
16 low bits.

As a summary from our video meeting discussion:

There are 3 cases we care about here:

packet header timestamp: 0x12345678

followed by either:

A) first event delta from packet header timestamp is 0: 16-bit value 0x5678

B) first event delta from packet header timestamp is <= 0xFFFF:
B.1) 16-bit value example 0x5699 (no 16-bit overflow from previous value)
B.2) 16-bit value example 0x2222 (exactly one 16-bit overflow from previous value)

C) first event delta from packet header timestamp is larger than 0xFFFF, which would
cause the low-order 16 bits to have more than one 16-bit overflow from the previous
value. The tracer detects this and uses a full 64-bit timestamp instead of the compact
16 bits.

[...]





But how do you detect the overflow? That last timestamps to know if
the tsc overflowed or not needs to be saved and compared. I would
assume you have a similar race that I have.

Yes, I save a "last timestamp" per buffer, but the race does not
matter because of the order in which it is saved wrt local cmpxchg
updating the reserved position.

The algorithm looks like:

do {
- read current reserved position (old pos)
- read time
- compute new reserved position (new pos)
} while (cmpxchg(reserved pos, old pos, new pos) != old pos);

[A]

save_last_tsc()

So the last_tsc that is saved is from the timestamp read before the
cmpxchg.

Yes.



If interrupted at [A] by another trace producer, it will compare with
an older "last tsc" than the tsc of the event physically located just
before the nested event. This stale "last tsc" has a value which is
necessarily lower than the one we would be saving with the
save_last_tsc immediately afterwards, which means in the worse case
we end up using a full 64-bit timestamp when in fact we could use a
more compact representation. But this race is rare and therefore it
does not matter for size.

That's equivalent to me "injecting" an absolute value for the same race.

Yes.



The fact that I only need this last_tsc value for the sake of
optimization, and not for computation of a time delta from a previous
injected timestamp, makes it possible to handle the race gracefully
without requiring anything more than a single last_tsc value per
buffer and a single comparison for 16-bit overflow.

If you have:

do {
- read current reserved position (old pos)
- read time
- compute new reserved position (new pos)
} while (cmpxchg(reserved pos, old pos, new pos) != old pos);

[A]

And here you have an interrupt that runs for several minutes (yeah bad
code ;-) And it does the same thing

do {
- read current reserved position (old pos)
- read time
- compute new reserved position (new pos)
} while (cmpxchg(reserved pos, old pos, new pos) != old pos);

save_last_tsc()

But that last_tsc from before the cmpxchg is much later than the one it
interrupted.



save_last_tsc()

Ah, so if you put the clock back here, it will just cause the next
event to inject a full timestamp again because the delta is too big.

Yes.


This is how you are able to avoid the "before/after" logic I have, as
the race is automatically detected. The least significant bits of the
timestamp is ignored for the event delta calculation.

Not quite, as I explained at the beginning of this email. All bits from the
previous timestamp, including its low bits, are useful to know how many
overflows happened since the last tsc.

And if a race
happens where the interrupting event saves a later timestamp and comes
back here, if the interrupted event writes the older timestamp, it just
causes that delta calculation to overflow again and you inject another
64bit timestamp into the buffer.

This part is correct: in the race you describe, we end up with the
possibility of bringing the last_tsc backwards, which can only cause
the tracer to use the full 64-bit timestamp when in fact it could use
the compact representation. But it's rare and should not matter in
practice.

And by the way this algorithm is designed to work with preemption/migration
enabled as well, not just interrupts. So the race can come from a thread
running concurrently on another CPU and it should work as well.

[...]


Going through a transition of changing it could end up being just as
complex. I'm not sure the complexity in that transition is better than
the complexity of the current code, as this code has been there for 15
years, and I know of at least 2 other projects that depend on this
format as is.

I agree with you that it's not clear-cut whether introducing this change
would be a benefit at this stage considering the extra complexity of
extending the ABI while keeping backward compatibility.

But it's something we can keep in mind if we ever have to do major ABI
extensions for other reasons.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com