Re: [PATCH printk v2 06/11] printk: nbcon: Wire up nbcon console atomic flushing

From: Petr Mladek
Date: Fri Sep 22 2023 - 13:41:59 EST

Next message: Alex Deucher: "Re: [PATCH 1/9] drm/amd/pm: Annotate struct smu10_voltage_dependency_table with __counted_by"
Previous message: Doug Anderson: "Re: [RFC PATCH] of: device: Support 2nd sources of probeable but undiscoverable devices"
In reply to: John Ogness: "[PATCH printk v2 06/11] printk: nbcon: Wire up nbcon console atomic flushing"
Next in thread: John Ogness: "[PATCH printk v2 07/11] printk: nbcon: Wire up nbcon into console_flush_all()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed 2023-09-20 01:14:51, John Ogness wrote:
> From: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
>
> Perform nbcon console atomic flushing at key call sites:
>
> nbcon_atomic_exit() - When exiting from the outermost atomic
> printing section.

IMHO, it would not help because it all depends
whether nbcon_context_try_acquire() succeeds or now, see below.

> If the priority was NBCON_PRIO_PANIC,
> attempt a second flush allowing unsafe hostile
> takeovers.

I was first scared that this would be called by any printk()
in panic(). But it seems that nbcon_atomic_exit() is called
only one after disabling printk(). It might deserve a comment
where it is supposed to be used. See a proposal below.

> console_flush_on_panic() - Called from several call sites to
> trigger ringbuffer dumping in an urgent situation.
>
> console_flush_on_panic() - Typically the panic() function will

This is a second description of console_flush_of_panic() which
looks like a mistake.

> take care of atomic flushing the nbcon consoles on
> panic. However, there are several users of
> console_flush_on_panic() outside of panic().

The generic panic() seems to use console_flush_on_panic() correctly
at the very end.

Hmm, I see that console_flush_on_panic() is called also in
arch/loongarch/kernel/reset.c and arch/powerpc/kernel/traps.c.

The loongarch code uses it in halt(). IMHO, it would be
better to use the normal flush. But maybe they accept the risk.
I know nothing about this architecture and who uses it.

The function defined on powerpc is then used in:

+ arch/powerpc/platforms/powernv/opal.c:

IMHO, it should be replaced there by normal flush. It seems
to call the generic panic() later.

+ arch/powerpc/platforms/ps3/setup.c:

This seems to be used as a panic notifier ppc_panic_platform_handler().
I am not completely sure but it does not look like the final flush.

+ arch/powerpc/platforms/pseries/setup.c:

Same as ps3/setup.c. Also "just" a panic notifier.

Anyway, we should make clear that console_flush_on_panic() might break
the system and should be called as the last attempt to flush consoles.
The above arch-specific users are worth review.

> --- a/kernel/printk/nbcon.c
> +++ b/kernel/printk/nbcon.c
> @@ -1092,6 +1092,11 @@ void nbcon_atomic_flush_all(void)
> * Return: The previous priority that needs to be fed into
> * the corresponding nbcon_atomic_exit()
> * Context: Any context. Disables migration.
> + *
> + * When within an atomic printing section, no atomic printing occurs. This
> + * is to allow all emergency messages to be dumped into the ringbuffer before
> + * flushing the ringbuffer.

The comment sounds like it is an advantage. But I think that it would be
a disadvantage.

Fortunately, this is not true. Even the atomic context tries
to flush the messages immediately when it is able to get
the per-cpu lock. It happes via console_flush_all() called
from console_unlock().

> + * The actual atomic printing occurs when exiting
> + * the outermost atomic printing section.
> */
> enum nbcon_prio nbcon_atomic_enter(enum nbcon_prio prio)
> {
> @@ -1130,8 +1135,13 @@ void nbcon_atomic_exit(enum nbcon_prio prio, enum nbcon_prio prev_prio)
> {
> struct nbcon_cpu_state *cpu_state;
>
> + __nbcon_atomic_flush_all(false);

IMHO, this would not help. Either this context was able to acquire the
lock and flush each message directly. Or it would fail to get the lock
even here.

> +
> cpu_state = nbcon_get_cpu_state();
>
> + if (cpu_state->prio == NBCON_PRIO_PANIC)
> + __nbcon_atomic_flush_all(true);

It would deserve a comment that nbcon_atomic_exit() is used
in panic() to calm down the consoles completely, similar effect
as setting the variable "suppress_printk".

> +
> /*
> * Undo the nesting of nbcon_atomic_enter() at the CPU state
> * priority.
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 6ef33cefa4d0..419c0fabc481 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -3159,6 +3159,8 @@ void console_flush_on_panic(enum con_flush_mode mode)
> console_srcu_read_unlock(cookie);
> }
>
> + nbcon_atomic_flush_all();
> +
> console_flush_all(false, &next_seq, &handover);

It seems that console_flush_all() tries to flush nbcon conosoles
as well. And nbcon_atomic_flush_all() is explicitely called
close above this flush_on_panic(). This is from panic()
after this patchset is applied.

void panic(const char *fmt, ...)
{
[...]
nbcon_atomic_flush_all();

console_unblank();

/*
* We may have ended up stopping the CPU holding the lock (in
* smp_send_stop()) while still having some valuable data in the console
* buffer. Try to acquire the lock then release it regardless of the
* result. The release will also print the buffers out. Locks debug
* should be disabled to avoid reporting bad unlock balance when
* panic() is not being callled from OOPS.
*/
debug_locks_off();
console_flush_on_panic(CONSOLE_FLUSH_PENDING);

By other words, we try to flush nbcon consoles 3 times almost
immediately after each other. I believe that there is a motivation
to do so. Anyway, I want to make sure that it was on purpose.

It would deserve some comment what is the purpose. Otherwise, people
would tend to remove the "redundant" calls.

> }
>
> @@ -3903,6 +3905,10 @@ void defer_console_output(void)
>
> void printk_trigger_flush(void)
> {
> + migrate_disable();
> + nbcon_atomic_flush_all();
> + migrate_enable();

Can this be actually called in NMI?

> +
> defer_console_output();
> }

This function would deserve some description because it looks strange.
There are three names which are contradicting each other:

+ trigger_flush: it sounds like it tells someone to do the flush

+ nbcon_atomic_flush_all: does the flush immediately

+ defer_console_output: sounds like it prevents the current context
to flush the consoles immediately

We should either choose better names and/or add comments:

/**
* console_flush_or_trigger() - Make sure that the consoles will get flushed.
*
* Try to flush consoles when possible or queue flushing consoles like
* in the deferred printk.
*
* Context: Can be used in any context (including NMI?)
*/
void printk_flush_or_trigger(void)
{
/*
* Try to flush consoles which do not depend on console_lock()
* and support safe atomic_write()
*/
migrate_disable();
nbcon_atomic_flush_all();
migrate_enable();

/* Try flushing legacy consoles or queue the deferred handling. */
if (!in_nmi() && console_trylock())
console_unlock();
else
defer_console_output();
}

All in all. I feel a bit overwhelmed by the many *flush*() functions at
the moment. Some flush only nbcon consoles. Some flush all. Some
are using only the safe takeover/handover and some allow also
harsh takeover. Some ignore port->lock because of bust_spinlock().
Some are ignoring console_lock. They are called on different
locations. The nbcon variants are called explicitly and also
inside the generic *flush*() functions.

It is partly because it is Friday evening. Anyway, I need to think
more about it.

Best Regards,
Petr

Next message: Alex Deucher: "Re: [PATCH 1/9] drm/amd/pm: Annotate struct smu10_voltage_dependency_table with __counted_by"
Previous message: Doug Anderson: "Re: [RFC PATCH] of: device: Support 2nd sources of probeable but undiscoverable devices"
In reply to: John Ogness: "[PATCH printk v2 06/11] printk: nbcon: Wire up nbcon console atomic flushing"
Next in thread: John Ogness: "[PATCH printk v2 07/11] printk: nbcon: Wire up nbcon into console_flush_all()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]