Re: [resend][bug] low-probability console lockups since 5.19

From: Petr Mladek
Date: Thu Sep 29 2022 - 06:12:48 EST


On Thu 2022-09-29 10:29:05, Conor Dooley wrote:
> On Thu, Sep 29, 2022 at 11:06:01AM +0200, Thorsten Leemhuis wrote:
> > Hi Conor
> >
> > On 28.09.22 18:55, Conor Dooley wrote:
> > > On Fri, Sep 23, 2022 at 05:24:17PM +0100, Conor Dooley wrote:
> > >>
> > >> Been bisecting a bug that is causing a boot failure in my CI & have
> > >> ended up here.. The bug in question is a low(ish) probability lock up
> > >> of the serial console, I would estimate about 1-in-5 chance on the
> > >> boards I could actually trigger it on which it has taken me so long
> > >> to realise that this was an actual problem. Thinking back on it, there
> > >> were other failures that I would retroactively attribute to this
> > >> problem too, but I had earlycon disabled
> >
> > There is one thing I wonder when skimming this thread: was there maybe
> > some other change somewhere in the kernel between the introduction and
> > the revert of the printk console kthreads patches that is the real
> > culprit here that makes existing, older races easier to hit? But I guess
> > in the end that would be very hard to find and it's easier to fix the
> > problem in the console driver... :-/
>
> Entirely possible that something arrived in the middle, yeah. I've done
> 100s of reboots on that interim section, albeit with the threaded
> printers enabled, as I restarted the bisection several times & never hit
> this failure then.

Interesting. I wonder if the used console was fixed during the window
when the kthreads were enabled.

> I don't know anything about console/printk/serial drivers unfortunately
> so I will almost certainly not be able to find the problem by
> inspection. I'd rather submit patches than send reports, but I really
> really need some help here. I looked at the two patterns Petr suggested,
> but the former I am not sure applies since the issue is present even
> when earlycon is disabled & the latter appears (to my untrained eye) to
> be accounted for in the 8250 driver.

The problem with the missing port->lock is visible only when the
early console is enabled. But It is really hard to hit without
the kthreads.

The problem with enabled IRQs was visible only with kthreads. The
original code called console->write() callback already with IRQs
disabled.

The kthreads called console->write() callback with IRQs enabled.
It made sense. They need to be disabled only when really needed
and the tested drivers did this correctly.

Best Reagrds,
Petr