Re: [PATCH printk v2 06/11] printk: nbcon: Wire up nbcon console atomic flushing

From: Petr Mladek
Date: Tue Sep 26 2023 - 08:14:53 EST


On Mon 2023-09-25 15:43:03, John Ogness wrote:
> On 2023-09-22, Petr Mladek <pmladek@xxxxxxxx> wrote:
> >> console_flush_on_panic() - Called from several call sites to
> >> trigger ringbuffer dumping in an urgent situation.
> >>
> >> console_flush_on_panic() - Typically the panic() function will
> >
> > This is a second description of console_flush_of_panic() which
> > looks like a mistake.
>
> Oops. The first one should not be there.
>
> >> take care of atomic flushing the nbcon consoles on
> >> panic. However, there are several users of
> >> console_flush_on_panic() outside of panic().
> >
> > The generic panic() seems to use console_flush_on_panic() correctly
> > at the very end.
> >
> > Hmm, I see that console_flush_on_panic() is called also in
>
> [...]
>
> > Anyway, we should make clear that console_flush_on_panic() might break
> > the system and should be called as the last attempt to flush consoles.
> > The above arch-specific users are worth review.
>
> In an upcoming series you will see that console_flush_on_panic() only
> takes the console lock if there are legacy consoles. Ideally, eventually
> there will only be nbcon consoles, so your concern would disappear.

The legacy consoles have two risk levels:

1. post->lock is ignored after bust_spinlocks()
2. even console_lock is ignored in console_flush_on_panic()

The nbcon consoles have only one risk level:

1. unsafe takeover is allowed

First, I thought that we wanted to allow the unsafe takeover in
console_flush_on_panic(). In that case, this function would
be dangerous even for nbcon consoles.

Now, I remember that we wanted to allow it only before entering
the infinite loop (blinking diodes). In this case,
console_flush_on_panic() would be really safe for nbcon consoles.


> And if those users continue to use legacy consoles, then the risks will
> be the same as now.
>
> >> * Return: The previous priority that needs to be fed into
> >> * the corresponding nbcon_atomic_exit()
> >> * Context: Any context. Disables migration.
> >> + *
> >> + * When within an atomic printing section, no atomic printing occurs. This
> >> + * is to allow all emergency messages to be dumped into the ringbuffer before
> >> + * flushing the ringbuffer.
> >
> > The comment sounds like it is an advantage. But I think that it would be
> > a disadvantage.
>
> Please explain. At LPC2022 we agreed it was an advantage and it is
> implemented on purpose using the atomic printing sections.

I am sorry but I do not remember the details. Do you remember
the motivation, please?

In each case, we can't just say that this works by design
because someone somewhere agreed on it. We must explain
why this is better and I do not see it at the moment.

I am terribly sorry if I agreed with this and I disagree now.
I have never been good in life discussion because there is
no enough time to think about all consequences.

Anyway, the proposed behavior (agreed on LPC2022) seems
to contradict the original plan from LPC 2019, see
https://lore.kernel.org/all/87k1acz5rx.fsf@xxxxxxxxxxxxx/
Namely:

--- cut ---
3. Rather than defining emergency _messages_, we define an emergency
_state_ where the kernel wants to flush the messages immediately before
dying. Unlike oops_in_progress, this state will not be visible to
anything outside of the printk infrastructure.

4. When in emergency state, the kernel will use a new console callback
write_atomic() to flush the messages in whatever context the CPU is in
at that moment. Only consoles that implement the NMI-safe write_atomic()
will be able to flush in this state.
--- cut ---

We wanted to flush it ASAP.

I wonder if we discussed some limitations where the messages
could not be flushed immediately. Maybe, we discussed a scenario
when there are many pending messages which would delay
flushing the emergency ones. But we need to flush them anyway.

Now, I do not see any real advantage to first store all messages
and flush them later in the same context.

OK, flushign them immediately might cause delay when flushing the
first emergency one. But storing all might cause overwrinting
the first emergency messages.

I hope that my proposed change would actually make things easier
and will not affect that much the upcoming patchsets.

Best Regards,
Petr