Re: [PATCH printk v2 2/5] printk: Add NMI safety to console_flush_on_panic() and console_unblank()

From: Petr Mladek
Date: Tue Jul 11 2023 - 11:44:17 EST


On Mon 2023-07-10 15:51:21, John Ogness wrote:
> The printk path is NMI safe because it only adds content to the
> buffer and then triggers the delayed output via irq_work. If the
> console is flushed or unblanked on panic (from NMI context) then it
> can deadlock in down_trylock_console_sem() because the semaphore is
> not NMI safe.

<thinking loudly>

Just to be sure. The semaphore is not NMI safe because even the
trylock takes an internal spin lock. Am I right, please?

Alternative solution would be to make down_trylock() NMI safe
by using raw_spin_trylock_irqsave() for the internal lock.

But this actually would not solve the whole problem. If the NMI safe
down_trylock() succeeded then up() would need to be called
in NMI as well. And up() really needs to take the spin lock
which might get blocked in the meantime.


Another question is whether we want to call c->unblank()
in NMI even when down_trylock() was NMI safe. It seems that it
is implemented only for struct console vt_console_driver.
I am pretty sure that it takes more internal locks which
are not NMI safe either.

On the other hand, if we would risk calling c->write() then
we might risk calling c->unblank() either.


Finally, it is not only about NMI. Any locks might cause a deadlock
in panic() in any context. It is because other CPUs are stopped
and might block some locks.

</thinking loudly>


In my opinion, we should handle c->unblank() in panic() the same way
as c->write() in panic().

I suggest to create

void __console_unblank(void)
{
struct console *c;
int cookie;

cookie = console_srcu_read_lock();
for_each_console_srcu(c) {
if ((console_srcu_read_flags(c) & CON_ENABLED) && c->unblank)
c->unblank();
}
console_srcu_read_unlock(cookie);
}

and call this in console_flush_on_panic() without the console_lock().

We still need to take the lock during Oops when the system tries
to continue. In this case, the NMI check makes perfect sense.
NMI might cause a deadlock. Other contexts should be safe
because the CPUs are not stopped.


> Avoid taking the console lock when flushing in panic. To prevent
> other CPUs from taking the console lock while flushing, have
> console_lock() block and console_trylock() fail for non-panic CPUs
> during panic.

I really like the trick that console_lock() and console_trylock()
would start failing on non-panic CPUs. It should prevent races
when the other CPUs were not stopped for some reasons.

I am still slightly afraid to do this even before stopping other CPUs.
But I do not have any real scenario where it might be a problem.
And it is only about console_lock() which should preferably be
available for the panic-CPU. Also we should _not_ rely on the other
CPUs during panic() anyway. So, it should be fine after all.


Well, would you mind to split this into two patches?

1st patch would split __console_unblank(), call it from
console_flush_on_panic() after the trylock().

Also it would add the NMI check into the original
console_unblank() which would still be called in
bust_spinlocks().

The commit message should explain the motivation
(primary the internal spinlock in the semaphore
implementation). Also it should explain why only NMI
is a problem when called in the Oops path.
And why the locks are problem in any context
when called in panic() after CPUs were stopped.


2nd patch would prevent taking console_lock on non-panic CPUs.
And it would remove console_trylock()/console_unlock() from
console_flush_on_panic().

The commit message should explain the motivation
(the internal spinlock in the semaphore code).
Also it would solve a problem with a potential
double unlock. And it should mention that it should
not be worse then before when the trylock() result
was ignored.

IMHO, both patches has a potential to cause regressions.
And it is better to do it in smaller steps.

Best Regards,
Petr