Re: [PATCH printk v2 2/5] printk: Add NMI safety to console_flush_on_panic() and console_unblank()

From: Petr Mladek
Date: Thu Jul 13 2023 - 10:43:27 EST

Next message: Michael Kelley (LINUX): "RE: [PATCHv3 0/3] x86/tdx: Fix one more load_unaligned_zeropad() issue"
Previous message: David Hildenbrand: "Re: [PATCH v1 6/9] selftests/mm: va_high_addr_switch should skip unsupported arm64 configs"
In reply to: John Ogness: "Re: [PATCH printk v2 2/5] printk: Add NMI safety to console_flush_on_panic() and console_unblank()"
Next in thread: Sergey Senozhatsky: "Re: [PATCH printk v2 2/5] printk: Add NMI safety to console_flush_on_panic() and console_unblank()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed 2023-07-12 23:17:49, John Ogness wrote:
> On 2023-07-11, Petr Mladek <pmladek@xxxxxxxx> wrote:
> > Just to be sure. The semaphore is not NMI safe because even the
> > trylock takes an internal spin lock. Am I right, please?
>
> Yes, that is one of the reasons. Sergey mentioned another (waking a task
> on up()).

I see.

> > Alternative solution would be to make down_trylock() NMI safe
> > by using raw_spin_trylock_irqsave() for the internal lock.
>
> NMI contexts are only allowed to take raw spinlocks if those spinlocks
> are only used from NMI context. Otherwise you could have deadlock:
>
> raw_spin_lock()
> --- NMI ---
> raw_spin_lock()
>
> Using a trylock does not avoid the deadlock danger.
>
> > Another question is whether we want to call c->unblank()
> > in NMI even when down_trylock() was NMI safe. It seems that it
> > is implemented only for struct console vt_console_driver.
> > I am pretty sure that it takes more internal locks which
> > are not NMI safe either.
>
> Yes, it does. As an example, it calls mod_timer(), which is also not NMI
> safe. Clearly the unblank() callback must not be called in NMI context.
>
> > Finally, it is not only about NMI. Any locks might cause a deadlock
> > in panic() in any context. It is because other CPUs are stopped
> > and might block some locks.
>
> With the atomic/threaded model this is not true. The port ownership can
> be safely taken over from stopped CPUs.

Right. But it would mean using the special lock also in c->unblank()
code. And it is only tty console which is one of the most complicated
consoles.

> > In my opinion, we should handle c->unblank() in panic() the same way
> > as c->write() in panic().
>
> I do not agree. Clearly unblank() is not NMI safe. Also, in current
> mainline code, console_unblank() will already give up if the trylock
> failed (rather than ignoring the lock, like write() does). So
> console_unblank() might as well also give up if in NMI context.

You are right.

OK, could we at least improve the commit message, please?

Something like:

<proposal>
console_sem() is not NMI safe even when using down_trylock(). Both
down_trylock() and up() are using an interal spinlock. up() might even
call wake_up_process() with another locks.

It is even worse in the panic() code path where the locks might be blocked
by stopped CPUs.

The sepaphore is used in two code paths, in console_unblank() and when
flushing console messages. On the low level, it is needed when calling
c->unblank() and c->write() callbacks.

Both code paths are not safe in panic() but they are handled differently.
c->unblank() is called only when console_trylock() succeeded. c->write()
is called in console_flush_on_panic() even when console_trylock().
The risk of a deadlock in c->write() callbacks is reduced by using trylock()
for the internal locks when oops_in_progess variable is set.

Reduce the risk of deadlocks caused the console semapthore by:

+ bailing out from console_unblank() in NMI
+ not taking the console_sem() in console_flush_on_panic() at all

Simple removal of console_trylock() in console_flush_on_panic() would
cause that other CPUs might still be able to take it and race.
The problem is avoided by checking panic_in_progress() in console_lock()
and console_trylock(). They will never succeed on non-panic CPUs.

The change is a preparation step for introducing printk kthreads and
atomic console write callbacks. It would make the panic() code path
completely safe for consoles without c->unblank() callback.
</proposal>

Wait, the last paragraph is not true. console_trylock() is still
called in console_unblank() in non-NMI context. But the lock
might be blocked by a stopped CPU.

It can be solved by checking whether there is any registered console
with c->unblank() callback first. As a result, console_trylock()
would be called only when a tty console is registered. The panic() path
really might be completely safe where only safe consoles are
registered.

It would make sense to do separate patch for console_unblank()
and console_flush_on_panic().

Of course, we might also improve console_unblank() later. But I
would still like to improve the commit message. You know, the original
commit title and message is talking about NMI. But the patch
has effects even in non-NMI context.

Best Regards,
Petr

Next message: Michael Kelley (LINUX): "RE: [PATCHv3 0/3] x86/tdx: Fix one more load_unaligned_zeropad() issue"
Previous message: David Hildenbrand: "Re: [PATCH v1 6/9] selftests/mm: va_high_addr_switch should skip unsupported arm64 configs"
In reply to: John Ogness: "Re: [PATCH printk v2 2/5] printk: Add NMI safety to console_flush_on_panic() and console_unblank()"
Next in thread: Sergey Senozhatsky: "Re: [PATCH printk v2 2/5] printk: Add NMI safety to console_flush_on_panic() and console_unblank()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]