Re: EXT4-fs error, kernel BUG

From: Theodore Ts'o
Date: Tue Aug 05 2014 - 08:51:33 EST

Next message: Matt Fleming: "Re: 3.12 to 3.13 boot regression bisected - still applies to 3.16"
Previous message: Wei Liu: "Re: [PATCH net-next 2/2] xen-netback: Turn off the carrier if the guest is not able to receive"
In reply to: martin f krafft: "EXT4-fs error, kernel BUG"
Next in thread: martin f krafft: "Re: EXT4-fs error, kernel BUG"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, Aug 05, 2014 at 12:34:36PM +0200, martin f krafft wrote:
> Dear kernel people,
>
> Yesterday, I encountered something weird on one of our NAS machines:
>
> Aug 4 20:09:40 julia kernel: [342873.007709] EXT4-fs error (device dm-6): ext4_ext_check_inode:481: inode #30414321: comm du: pblk 0 bad header/extent: invalid extent entries - magic f30a, entries 1, max 4(4), depth 0(0)
>
> but a fsck -f of the filesystem revealed no problems.

One likely cause of this issue is that the hardware hiccuped on a
read, and returned garbage, which is what triggered the "EXT4-fs
error" message (which is really a report of a detect file system
inconsistency). A common cause of this is the block address getting
corrupted, so that the hard drive read the correct data from the wrong
location.

The other likely cause is that you are using something like RAID1, and
the one of copies of the disk block really is corrupted, and the
kernel read the bad version of the block, but fsck managed to read the
good version of the block.

It's possible that this was caused by a memory corruption, but it
wouldn't have been high on my suspect list. Still, if this is a new
machine, it might not be a bad idea to run memtest86+ for 24-48 hours.

> So I set up another filesystem and tried to copy over the data from
> /dev/dm-6, using tar.
>
> Shortly afterwards, there a wall message like
>
> BUG: soft lockup - CPU#0 stuck for 23s! [kswapd0:28]

>From the stack traces, it looks like the system was thrashing trying
to free memory to make forward progess. (i.e., due to high memory
pressure). Exactly why this happened is not something I can determine
from the strack traces, sorry. It could be that soft lockup happened,
you had more processes running, or that some of the processes (samba?
apache?) were using more memory, and this was a factor. Why the OOM
killer didn't kill the processes I can't tell you.

> Is there anything in the following back traces that would help me
> identify the source of the problem with greater confidence?

Sorry, that's about how that can be divined from your kernel stack
traces.

It might be worth checking the system logs for any suspicious error
messages beyond just the EXT4-fs error message, but you may have done
that already.

Good luck,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Matt Fleming: "Re: 3.12 to 3.13 boot regression bisected - still applies to 3.16"
Previous message: Wei Liu: "Re: [PATCH net-next 2/2] xen-netback: Turn off the carrier if the guest is not able to receive"
In reply to: martin f krafft: "EXT4-fs error, kernel BUG"
Next in thread: martin f krafft: "Re: EXT4-fs error, kernel BUG"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]