Re: 2.4.22-pre lockups (now decoded oops for pre10)

From: Stephan von Krawczynski
Date: Wed Aug 13 2003 - 11:08:23 EST


On Wed, 13 Aug 2003 19:30:09 +0400
Oleg Drokin <green@xxxxxxxxxxx> wrote:

> Hello!
>
> On Wed, Aug 13, 2003 at 05:12:24PM +0200, Stephan von Krawczynski wrote:
>
> > Well, that's exactly the reason why I am awaiting some more days of
> > up-and-running ext3. After how many days will you be convinced that a
> > random memory corruption should have hit the ext3 system that bad, that it
> > should have crashed?
>
> Well, I'd prefer that you spend time to figure out at which exact
> 2.4.21-pre version the crashes in reiserfs started to appear. ;)

Well, Oleg, I'd love to, but there is an immanent problem with that. If
I check pre-X and it crashes, everything is fine, because I have a certain
result of the test. If it does not crash within 3 days, then I have a problem.
How long do I wait before stating the pre is good? It could take months to test
10 pre's ... That cannot be the way to find out what is going on.
On the other hand:
- no UP kernel ever crashed. So we can at least talk about an SMP-race.
- 2.4.20 does not crash
- 2.4.21 does crash
If we can add "ext3 does not crash" to the list, then I really hope we can use
some brain and give good selection of patches between 2.4.20 and 2.4.21 that
may cause the troubles.
How many suspects do we have? We can at least begin to create a list of things
that went in between .20 and .21, or not?
If possible I can then patch out all of them and retry. So there is much less
time spent for testing.
I mean, have you looked at the length of this thread already?

> > I can add another week if you want me to, just tell me. The only thing I
> > don't want is that any doubts are left after testing ...
>
> It would be interesting to look at fsck results on the fs after some time of
> testing.

You mean I should do an fsck on sunday?

> Probably it would be easier for you to make it crash (if there are crash
> possibility at all) if you enable JBD debugging.

I have never seen this in real life. Is it possible to turn this on when
handling >100 GB of data or will some debug output flood the box?

Regards,
Stephan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/