Re: 2.4.22-pre lockups (now decoded oops for pre10)

From: Willy Tarreau (willy@w.ods.org)
Date: Wed Aug 06 2003 - 07:45:16 EST


> Hm, the hardware may not be that widespread. I guess not many people are really
> using SMP, 64 bit PCI network, 3 GB RAM, 3ware RAID5 and serverworks board
> altogether in one box. I can't fight the impression it has something to do with
> locking issues. It doesn't look exactly like a hardware problem, you would not
> expect crashes on the same type of code then.

Well, it depends... I once had an overclocked CPU which died only in one
case, it was a car simulator, and it always crashed exactly on the same race,
at the same position in the round ! I even knew that if I could pass that
position, it was ok for another round ! So I later used that game as a
reliability test when I was not sure about the origin of a crash :-)
It seems as a particular sequence of data and/or code could reliably trigger it
although parallel makes never hurt it.

> The question is: what additional information is needed to find the underlying
> problem?

Perhaps cache poisonning could help. Alan has already used this technique
extensively in the past, and might still have a patch which could apply to your
kernel without too many changes. Alan ?

On the other hand, you could also do it by hand, but it's a little hard. You
have to pick every place there's a free, and write particular data before the
free, if possible, data which can identify who has freed the page.

Then after the next crash, you can identify who used the page last. It can
sometimes lead you to some driver missing a lock. But that's not certain.

Cheers,
Willy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Thu Aug 07 2003 - 22:00:33 EST