Re: [PATCH printk v3 09/14] printk: Wait for all reserved records with pr_flush()

From: Petr Mladek
Date: Wed Jan 31 2024 - 06:37:21 EST


On Thu 2023-12-14 22:47:56, John Ogness wrote:
> Currently pr_flush() will only wait for records that were
> available to readers at the time of the call (using
> prb_next_seq()). But there may be more records (non-finalized)
> that have following finalized records. pr_flush() should wait
> for these to print as well. Particularly because any trailing
> finalized records may be the messages that the calling context
> wants to ensure are printed.
>
> Add a new ringbuffer function prb_next_reserve_seq() to return
> the sequence number following the most recently reserved record.
> This guarantees that pr_flush() will wait until all current
> printk() messages (completed or in progress) have been printed.
>
> --- a/kernel/printk/printk_ringbuffer.c
> +++ b/kernel/printk/printk_ringbuffer.c
> @@ -1986,6 +1986,119 @@ u64 prb_first_seq(struct printk_ringbuffer *rb)
> return seq;
> }
>
> +/**
> + * prb_next_reserve_seq() - Get the sequence number after the most recently
> + * reserved record.
> + *
> + * @rb: The ringbuffer to get the sequence number from.
> + *
> + * This is the public function available to readers to see what sequence
> + * number will be assigned to the next reserved record.
> + *
> + * Note that depending on the situation, this value can be equal to or
> + * higher than the sequence number returned by prb_next_seq().
> + *
> + * Context: Any context.
> + * Return: The sequence number that will be assigned to the next record
> + * reserved.
> + */
> +u64 prb_next_reserve_seq(struct printk_ringbuffer *rb)
> +{
> + struct prb_desc_ring *desc_ring = &rb->desc_ring;
> + unsigned long last_finalized_id;
> + atomic_long_t *state_var;
> + u64 last_finalized_seq;
> + unsigned long head_id;
> + struct prb_desc desc;
> + unsigned long diff;
> + struct prb_desc *d;
> + int err;
> +
> + /*
> + * It may not be possible to read a sequence number for @head_id.
> + * So the ID of @last_finailzed_seq is used to calculate what the
> + * sequence number of @head_id will be.
> + */
> +
> +try_again:
> + last_finalized_seq = desc_last_finalized_seq(rb);
> +
> + /*
> + * @head_id is loaded after @last_finalized_seq to ensure that it is
> + * at or beyond @last_finalized_seq.
> + *
> + * Memory barrier involvement:
> + *
> + * If desc_last_finalized_seq:A reads from
> + * desc_update_last_finalized:A, then
> + * prb_next_reserve_seq:A reads from desc_reserve:D.
> + *
> + * Relies on:
> + *
> + * RELEASE from desc_reserve:D to desc_update_last_finalized:A
> + * matching
> + * ACQUIRE from desc_last_finalized_seq:A to prb_next_reserve_seq:A
> + *
> + * Note: desc_reserve:D and desc_update_last_finalized:A can be
> + * different CPUs. However, the desc_update_last_finalized:A CPU
> + * (which performs the release) must have previously seen
> + * desc_read:C, which implies desc_reserve:D can be seen.

Huh, it must have been a huge effort to pull all the pieces together.
It took me long time to check and understand it. And it looks right
to me.

The most tricky part is desc_reserve:D. It is a supper complicated
barrier which guarantees many things. But I think that there are
many more write barriers before the allocated descriptor reaches
finalized state. So that it should be easier to prove the correctness
here by other easier barriers.

To make it clear. I am all for keeping the above precise description
as is. I just think about adding one more human friendly note which
might help future generations to understand the correctness an easier way.
Something like:

* Note: The above description might be hard to follow because
* desc_reserve:D is a multi-purpose barrier. But this is
* just the first barrier which makes sure that @head_id
* is updated before the reserved descriptor gets finalized
* and updates @last_finalized_seq.
*
* There are more release barriers in between, especially,
* desc_reserve:F and desc_update_last_finalized:A. Also these make
* sure that the desc_last_finalized_seq:A acquire barrier allows
* to read @head_id related to @last_finalized_seq or newer.

In fact, the desc_update_last_finalized:A release and
desc_last_finalized_seq:A acquire barriers are enough to prove
that we read here @head_id related to @last_finalized_seq or newer.

Or maybe it is exactly what you described and my "human friendly"
description is too naive. I am still not completely familiar
with the "Memory barrier involvement:" and "Relies on:"
format.

> + */
> + head_id = atomic_long_read(&desc_ring->head_id); /* LMM(prb_next_reserve_seq:A) */
> +
> + d = to_desc(desc_ring, last_finalized_seq);
> + state_var = &d->state_var;
> +
> + /* Extract the ID, used to specify the descriptor to read. */
> + last_finalized_id = DESC_ID(atomic_long_read(state_var));
> +
> + /* Ensure @last_finalized_id is correct. */
> + err = desc_read_finalized_seq(desc_ring, last_finalized_id, last_finalized_seq, &desc);
> +
> + if (err == -EINVAL) {
> + if (last_finalized_seq == 0) {
> + /*
> + * @last_finalized_seq still contains its initial
> + * value. Probably no record has been finalized yet.
> + * This means the ringbuffer is not yet full and the
> + * @head_id value can be used directly (subtracting
> + * off the id value corresponding to seq=0).
> + */
> +
> + /*
> + * Because of hack#2 of the bootstrapping phase, the
> + * @head_id initial value must be handled separately.
> + */
> + if (head_id == DESC0_ID(desc_ring->count_bits))
> + return 0;
> +
> + /*
> + * The @head_id is initialized such that the first
> + * increment will yield the first record (seq=0).
> + * Therefore use the initial value +1 as the base to
> + * subtract from @head_id.
> + */
> + last_finalized_id = DESC0_ID(desc_ring->count_bits) + 1;

It took me long time to understand the above code and comments. I
wonder if the following looks easier even for you:

if (err == -EINVAL) {
if (last_finalized_seq == 0) {
/*
* No record has been finalized or even reserved yet.
*
* The @head_id is initialized such that the first
* increment will yield the first record (seq=0).
* Handle it separately to avoid a negative @diff below.
*/
if (head_id == DESC0_ID(desc_ring->count_bits))
return 0;

/* One or more descriptors are already reserved. Use
* the descriptor ID of the first one (@seq=0) for
* the @diff below.
*/
last_finalized_id = DESC0_ID(desc_ring->count_bits) + 1;


> + } else {
> + /* Record must have been overwritten. Try again. */
> + goto try_again;
> + }
> + }
> +
> + /*
> + * @diff is the number of records beyond the last record available
> + * to readers.
> + */

This is kind of obvious from the code. Also it is not true when the
first record has not been finalized yet. The following might
be more useful:

/* Diff of known descriptor IDs to compure releted sequence nubmers. */

> + diff = head_id - last_finalized_id;
> +
> + /*
> + * @head_id points to the most recently reserved record, but this
> + * function returns the sequence number that will be assigned to the
> + * next (not yet reserved) record. Thus +1 is needed.
> + */
> + return (last_finalized_seq + diff + 1);
> +}
> +
> /*
> * Non-blocking read of a record.
> *

BTW: It came to my mind whether the logic might be more straightforward
if we store desc_ring->next_finalized_seq instead of @last_finalized_seq.

Even the initial value vould be valid. Also the value is
used only in prb_next_reserve_seq() and prb_next_seq() where
we need to start with this value anyway.

Note that prb_next_seq() actually does not need to try _prb_read_valid()
for @last_finalized_seq. It should always be valid unless
the record has been already overwritten. In which case,
there should be a newer readable record.

Well, it is just an idea. I am not completely sure that it
would make the logic easier to follow. The current code is perfectly fine.


The patch is correct. Feel free to add:

Reviewed-by: Petr Mladek <pmladek@xxxxxxxx>

Well, it would be nice to update the comments if you liked the proposals.

Best Regards,
Petr