Re: [PATCH printk v3 02/14] printk: Adjust mapping for 32bit seq macros

From: John Ogness
Date: Mon Jan 15 2024 - 05:52:56 EST


On 2024-01-12, Petr Mladek <pmladek@xxxxxxxx> wrote:
>> > Reported-by: Francesco Dolcini <francesco@xxxxxxxxxx>
>> > Reported-by: kernel test robot <oliver.sang@xxxxxxxxx>
>> > Closes: https://lore.kernel.org/oe-lkp/202311171611.78d41dbe-oliver.sang@xxxxxxxxx
>> > Reported-by: kernel test robot <oliver.sang@xxxxxxxxx>
>> > Closes: https://lore.kernel.org/oe-lkp/202311161555.3ee16fc9-oliver.sang@xxxxxxxxx
>> > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx>
>> > Signed-off-by: John Ogness <john.ogness@xxxxxxxxxxxxx>
>>
>> Great catch! It must have been complicated to debug this.
>>
>> Reviewed-by: Petr Mladek <pmladek@xxxxxxxx>
>
> That said, I am a bit nervous that a bug like this might cause
> workqueue stall and panic() the kernel.
>
> At least, this is how I read
> https://lore.kernel.org/oe-lkp/202311171611.78d41dbe-oliver.sang@xxxxxxxxx/

Note that the bug is reported for code that mainline will never
have. This patch "fixes" the problem before it is a problem. From the
perspective of mainline, the problem never existed. Perhaps it is
inappropriate to list these with the Closes tag.

> It looks like it caused some loop and refcout overlow or so.
> But I might be wrong.
>
> I would like to better understand this and check if we could prevent
> it somehow.

The code in question can be found here:

https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/kernel/printk/printk_ringbuffer.c?id=768c33033e41ad195a9304ebb183afb730d5ae98#n2031

The URL above points to prb_next_reserve_seq(), which is in an infinite
loop.

desc_last_finalized_seq() is returning a huge garbage value due to the
32bit sequence value 1 being incorrectly mapped to 0xffffffff00000001
(which is 18446744069414584321). When desc_read_finalized_seq() is
called (line 2082), it obviously returns -EINVAL because there is no
record with this huge sequence number.

The code interrupts -EINVAL to mean that the record has been
overwritten, and so it tries again with an updated last finalized
sequence number => infinite loop.

With this patch applied, the 32bit value 1 is correctly mapped to
sequence number 1 and all is OK.

The problem was that desc_last_finalized_seq() returned a sequence
number that was higher than the highest record. That was the bug. As
long as desc_last_finalized_seq() always returns a sequence number that
exists or used to exist, the retry loop is fine. And since the tail
record is always in the finalized state, there will always be a
finalized record available.

John