Re: Serial related oops

From: Russell King
Date: Mon Feb 19 2007 - 09:35:48 EST


On Tue, Feb 20, 2007 at 02:24:42PM +0000, Frederik Deweerdt wrote:
> On Mon, Feb 19, 2007 at 01:45:39PM +0000, Russell King wrote:
> > On Tue, Feb 20, 2007 at 01:29:09PM +0000, Frederik Deweerdt wrote:
> > > (Sorry for the resend, I forgot to cc the list)
> > > Hi Russell,
> > >
> > > It seems that the following change in drivers/serial/8250.c
> > >
> > > +
> > > + /*
> > > + * Do a quick test to see if we receive an
> > > + * interrupt when we enable the TX irq.
> > > + */
> > > + serial_outp(up, UART_IER, UART_IER_THRI);
> > > + lsr = serial_in(up, UART_LSR);
> > > + iir = serial_in(up, UART_IIR);
> > > + serial_outp(up, UART_IER, 0);
> > > +
> > > + if (lsr & UART_LSR_TEMT && iir & UART_IIR_NO_INT) {
> > > + if (!(up->capabilities & UART_BUG_TXEN)) {
> > > + up->capabilities |= UART_BUG_TXEN;
> > > + pr_debug("ttyS%d - enabling bad tx status workarounds\n",
> > > + port->line);
> > > + }
> > > + } else {
> > > + up->capabilities &= ~UART_BUG_TXEN;
> > > + }
> > > +
> > >
> > > that was introduced in 2.6.12[1], is causing oopses on some hardware. In
> > > particular Jose Goncalves reported[2] an oops in 2.6.16.38 reproducible
> >
> > I don't see that. The oops your referring to is a NULL pointer
> > dereference. The only dereferences the above code does is via
> > 'up' and 'port' both of which are provably always non-null here.
>
> Neither did I, but introducing printk's through the function, we narrowed
> the problem to this part of the code. And removing it makes the problem
> go away. We inserted 37 printk's in the function body, and Jose bisected
> those until the problem went away.

Well, there's still little clue about why this is causing a NULL pointer
dereference. The only thing I can think is that somehow performing
this test is causing a power glitch to your CPU, causing its registers
to get corrupted, and which results in it doing a NULL pointer deref.

Are you saying that the NULL pointer occurred while executing this code?
If not, where does the NULL pointer occur?

> > No, it's only runtime because you can't tell which ports might be
> > affected, and you might have a mixture of ports which are affected
> > and those which aren't.
> Hmm, ok. And what about a CONFIG_I_KNOW_MY_SERIAL_IS_BROKEN option?

Andrew's said no (in that the thread you refer to) and suggested an
alternative, I've said no, how many more 'no's do you need to turn
you away from the wrong approach?

> > > PS: CCing Andrew and Zang Roy-r61911 as they seemed to discuss this in
> > > http://lkml.org/lkml/2006/6/13/21
> >
> > I don't see any reference to this problem there.
>
> Sorry, I suck, I got that mixed with that one:
> http://lkml.org/lkml/2006/12/26/63
> "probing for UART_BUG_TXEN in 8250 driver leads to weird effects on some
> ARM boards"

The "weird effects" were never quantified, so that's one of the reasons
I ignored that report (another being is that I stopped being the serial
maintainer a while ago, and now serial is maintainerless.)

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/