Re: panic context: was: Re: [PATCH printk v2 04/11] printk: nbcon: Provide functions to mark atomic write sections

From: Dave Young
Date: Mon Oct 16 2023 - 05:00:11 EST


[Added more people in cc]

On 10/08/23 at 12:19pm, John Ogness wrote:
> Hi Petr,
>
> On 2023-10-06, Petr Mladek <pmladek@xxxxxxxx> wrote:
> >> During the demo at LPC2022 we had the situation that there was a large
> >> backlog when a WARN was hit. With current mainline the first line of the
> >> WARN is put into the ringbuffer and then the entire backlog is flushed
> >> before storing the rest of the WARN into the ringbuffer. At the time it
> >> was obvious that we should finish storing the WARN message and then
> >> start flushing the backlog.
> >
> > This talks about the "emergency" context (WARN/OOPS/watchdog).
> > The system might be in big troubles but it would still try to continue.
> >
> > Do we really want to defer the flush also for panic() context?
>
> We can start flushing right after the backtrace is in the
> ringbuffer. But flushing the backlog _before_ putting the backtrace into
> the ringbuffer was not desired because if there is a large backlog, the
> machine may not survive to get the backtrace out. And in that case it
> won't even be in the ringbuffer to be used by other debugging
> tools.
>
> > I ask because I was not on LPC 2022 in person and I do not remember
> > all details.
>
> The LPC2022 demo/talk was recorded:
>
> https://www.youtube.com/watch?v=TVhNcKQvzxI
>
> At 55:55 is where the situation occurred and triggered the conversation,
> ultimately leading to this new feature.
>
> You may also want to reread my summary:
>
> https://lore.kernel.org/lkml/875yheqh6v.fsf@xxxxxxxxxxxxxxxxxxxxx
>
> as well as Linus' follow-up message:
>
> https://lore.kernel.org/lkml/CAHk-=wieXPMGEm7E=Sz2utzZdW1d=9hJBwGYAaAipxnMXr0Hvg@xxxxxxxxxxxxxx
>
> > But it is tricky in panic(), see 8th patch at
> > https://lore.kernel.org/r/20230919230856.661435-9-john.ogness@xxxxxxxxxxxxx
> >
> > + nbcon_atomic_exit() is called only in one code path.
>
> Correct. When panic() is complete and the machine goes into its infinite
> loop. This is also the point where it will attempt an unsafe flush, if
> it could not get the messages out yet.
>
> > + nbcon_atomic_flush_all() is used in other paths. It looks like
> > a "Whack a mole" game to me.
>
> Several different outputs occur during panic(). The flush is everywhere
> where something significant has been put into the ringbuffer and now it
> would be OK to flush it.
>
> > + messages are never emitted by printk kthread either because
> > CPUs are stopped or the kthread is not allowed to get the lock
>
> Correct.
>
> > I see only one positive of the explicit flush. The consoles would
> > not delay crash_exec() and the crash dump might be closer to
> > the point where panic() was called.
>
> It's only about getting the critical messages into the ringbuffer before
> flushing. And since various things can go wrong during the many actions
> within panic(), it makes sense to flush in between those actions.
>
> > Otherwise I see only negatives => IMHO, we want to flush atomic
> > consoles synchronously from printk() in panic().
> >
> > Does anyone really want explicit flushes in panic()?
>
> So far you are the only one speaking against it. I expect as time goes
> on it will get even more complex as it becomes tunable (also something
> we talked about during the demo).

Flush consoles in panic kexec case sounds not good, but I have no
deep understanding about the atomic printk series, added kexec list and
reviewers in cc.

>
> John
>

Thanks
Dave