Re: Serial related oops

From: Jose Goncalves
Date: Mon Feb 19 2007 - 11:31:35 EST


Russell King wrote:
> On Tue, Feb 20, 2007 at 02:48:14PM +0000, Frederik Deweerdt wrote:
>
>> (trimmed tie-fei.zang from the CC, added by mistake)
>> On Mon, Feb 19, 2007 at 02:35:20PM +0000, Russell King wrote:
>>
>>>> Neither did I, but introducing printk's through the function, we narrowed
>>>> the problem to this part of the code. And removing it makes the problem
>>>> go away. We inserted 37 printk's in the function body, and Jose bisected
>>>> those until the problem went away.
>>>>
>>> Well, there's still little clue about why this is causing a NULL pointer
>>> dereference. The only thing I can think is that somehow performing
>>> this test is causing a power glitch to your CPU, causing its registers
>>> to get corrupted, and which results in it doing a NULL pointer deref.
>>>
>> That may be the case, indeed.
>>

But if the problem was a power glitch I should get Oops with or without
printk() inserted, shouldn't I?

>>> Are you saying that the NULL pointer occurred while executing this code?
>>> If not, where does the NULL pointer occur?
>>>
>> The thing is, the NULL pointer deref dissapeared as soon as we
>> instrumented (printk'ed) the code. So it's seems to be triggered by
>> check+timing+hardware.
>>
>
> So to summarise, we have some code somewhere which is causing a NULL
> pointer deref in uart_startup(). If we remove some code, the NULL
> pointer deref stops happening.
>
> And that's about the sum total of the information we know. We don't
> know precisely where the NULL pointer deref occurs, and we don't know
> what's causing it.
>
> It doesn't sound like there's much understanding of the problem at hand. ;(
>
>
>>> Andrew's said no (in that the thread you refer to) and suggested an
>>> alternative, I've said no, how many more 'no's do you need to turn
>>> you away from the wrong approach?
>>>
>> One is usually sufficient once I've understood :). I missed the module
>> option approach. Is it ok with you? If yes, I'll put up a patch to do
>> this.
>>
>
> I guess so, but how does the user know whether they need this enabled or
> disabled?
>
>
>> The problem appears to be reproducible on Jose's hardware within 2-3 days.
>>

In a kernel without instrumentation I get problems within a 1 day period.

>> If you see other tests to be performed...
>>
>
> Maybe adding some delays in that bit of code? I'm sure you've already
> thought of that though. Since no one has a proper understanding of the
> problem, the only suggestions possible are mere shots in the dark.
>

I'm no kernel expert, but it's not possible to trace what is the
instruction that is causing the NULL pointer dereference? The kernel
dump does not show this?

I have no clue on what is causing this problem but, what I know, is that
I can always reproduce it, and it always happens in the same code
section of serial8250_startup().

Regards,
José Gonçalves
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/