Re: 2.4: kernel BUG at inode.c:334!

From: Jaco Kroon
Date: Tue Mar 30 2004 - 09:54:00 EST


Hello

We were having similar problems on two oldish machines (about 5 years old, old prolines) and since the dpt_i2o driver isn't ported officially yet we were stuck with 2.4 for some time before we decided to just stuff it and switch to 2.6 using some patch for dpt_i2o. Now we are still having the same problem, but less regularly - seems to die shortly after we run a addusers script that causes intensive io on /, which is using ext3. Unfortunately the stack traces doesn't get sent to a log file (how can I quickly rig this?) and both machines are production machines -> ie, it goes down and we run for all we are worth to hit that reset button.

I do however have a small machine at home that seem to be giving similar problems, but I'm not sure. I can't get stack traces in this case at all (APM kicks in and I can't get it back out after it crashes). I've now recompiled with full kernel debugging (everything under kernel hacking) and the only thing I get in the kernel logs are ??? suppressed messages from the kernel. It still dies. It also has periods where it just slows down to a stop (doesn't respond to pings for up to a minute at a time). Usually dies whilst compiling (heavy disk io).

One of the production machines and my machine at home currently runs 2.6.4 and the other 2.4.25.

So this seems to be a more general problem (My co-worker suspects ext3 - since this bug report started with xfs that might not be the case). The only pattern we are seeing between all of these is that they serve as nfs servers (but on mine at home it still dies, even when not serving nfs - it still is a nfs client when it dies though), are not the newest and greatest machines and all of them use ext3 as their root file system. Oh, also, usually shortly after, or during, intensive disk io - which match up with what Mika mentioned. I've also tried disabling IO-APIC (which we're not even sure is supported, but APIC is), as well as pre-empting.

We don't suspect nfs on the production machines anymore since we managed to trash the nfs exported dir for about an hour (keeping the server at load average 8.5) which makes use of reiserfs - we might've been lucky though. In almost all the cases these exports are relatively big though, and I noticed there is a problem there as well (We don't get the magical 1000 number quite yet).

Is there anything else I should/can take a look at? Is there any other way in which I can help find the problem? If I can just get somewhere to start ... (The patch below doesn't apply to 2.6 as far as I can see).

Apologies for the essay.

Jaco

Marcelo Tosatti wrote:

On Fri, Mar 26, 2004 at 04:40:00PM +0100, Fredrik Steen wrote:


On [040326 16:20] Marcelo Tosatti <marcelo.tosatti@xxxxxxxxxxxx> wrote:


On Thu, Mar 25, 2004 at 09:32:22AM -0800, Martin J. Bligh wrote:
> http://bugme.osdl.org/show_bug.cgi?id=2367

This is the second bug report of "BUG at inode.c:334" I have seen. The other one reported by Mika Fischer.

Its indeed not valid for I_LOCK or I_FREEING inode's to be on the superblock dirty list. I cannot see how this is happening.

Martin, Mika, can you please apply the attached patch and rerun the tests?

It might give a bit more clue. Thanks.

--- fs/inode.c.orig 2004-03-26 12:30:01.961087616 -0300


[...]

I ran the patch and got this:
inode->i_istate:f
Kernel BUG at inode.c:340!
[...]



Hi Fredik,

It seems Trond already figured it out, we are erroneously moving locked inodes to the dirty list. He attached the following patch in the bugzilla to fix the problem. Can you please give it a try?

--- linux-2.4.26-up/fs/inode.c.orig 2004-03-19 17:12:46.000000000 -0500
+++ linux-2.4.26-up/fs/inode.c 2004-03-26 13:01:23.000000000 -0500
@@ -319,7 +319,8 @@ void refile_inode(struct inode *inode)
if (!inode)
return;
spin_lock(&inode_lock);
- __refile_inode(inode);
+ if (!(inode->i_state & I_LOCK))
+ __refile_inode(inode);
spin_unlock(&inode_lock);
}


===========================================This message and attachments are subject to a disclaimer. Please refer to www.it.up.ac.za/documentation/governance/disclaimer/ for full details.

Hierdie boodskap en aanhangsels is aan 'n vrywaringsklousule onderhewig. Volledige besonderhede is by www.it.up.ac.za/documentation/governance/disclaimer/ beskikbaar.
===========================================

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/