Re: Possible dcache BUG

From: Linus Torvalds
Date: Thu Aug 05 2004 - 11:28:33 EST




On Thu, 5 Aug 2004, Denis Vlasenko wrote:
>
> It is not a BUG().

Oh yes it is. The 2.6.8-rc2-mm2 report definitely was a BUG().

Earlier ones may not have been, but on the other hand, earlier ones may
not have had the BUG()-check for corrupted list_del() usage - it's not in
the standard kernel, and I don't know when it was added to -mm. (We used
to have it a _long_ time ago, but then we removed it because there were no
reports of problems).

> It's an oops (dereferencing a d_op pointer with value 0x00000900+14
> IIRC, Gene has complete disassembly with location of that event).

.. and that must be because of some kind of pointer corruption, where the
dentry was either free'd twice or the dentry simply isn't a dentry at all,
it just got to be used as such because of some bug.

> It is not reproducible on request, but happens for him from time
> to time in the same place with the same bogus value of d_op.

I've followed the discussion. You may not have noticed that the last one
was different. (And I _think_ it may hav ebeen the first time Gene did a
-mm kernel, so I do believe that the list_del() debugging was the thing
that caught it).

Anyway, one other thing that makes me worry is the fact that Gene
apparently has a K7. One of the things AMD has gotten wrong several times
is prefetching, and it so happens that the dcache code is one of the users
of the prefetch instruction. prude_dcache() in particular.

So I'm also entertaining the notion that there's an actual prefetch data
corruption, not just the known AMD bug with occasional spurious page
faults. Who else has seen the problem? What CPU's are involved?

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/