Re: [BUG]: Ext2 Corruption in test10pre3 (incl. Oops)

From: Linus Torvalds (torvalds@transmeta.com)
Date: Tue Oct 17 2000 - 13:22:12 EST


Hi,

On Tue, 17 Oct 2000, Udo A. Steinberg wrote:
>
> It seems that we were all wrong in assuming that ext2 was fixed
> wrt. filesystem corruption. test10pre3 once again has the potential
> to eat files (not sure about earlier versions).
>
> I finally managed to capture an oops (by hand), so bear with me that
> I didn't typo anywhere.
>
> Find attached the decoded oops:
>
> Kernel bug at ll_rw_blk.c: 713!

Ok, so it claims that we're doing IO on an unmapped buffer. So far so
good..

> invalid operand: 0000
> CPU: 0
> EIP: 0010:[<c0184546>]
> Using defaults from ksymoops -t elf32-i386 -a i386
> EFLAGS: 00010282
> eax: 0000001f ebx: 00cc0008 ecx: c4433500 edx: 00000007
> esi: c2c650c0 edi: c02fd160 ebp: 00000000 esp: cd28dd90
> ds: 0018 es: 0018 ss: 0018
> Process netscape (pid: 6456, stackpage=cd28d000)
> Stack: c0250fc5 c0251262 000002c9 c2c650c0 00000008 0000000c 00cc0008
> 0000d00a 00000000 cee143c0 c02fd170 c0300ac0 c02fd178 c02fd170
> 00000000 00000008 00cc0008 00000000 c0183f24 000000fe c0184b84
> c02fd160 00000000 c2c650c0
>
> Call Trace: [<c0250fc5>] [<c0251262>] [<c0183f24>] [<c0184b84>]
> [<c0184d53>] [<c012fa31>] [<c014efde>] [<c014f240>]
> [<c014f6af>] [<c021e87e>] [<c01523af>] [<c0138db7>]
> [<c0138f59>] [<c012d6bb>] [<c012d9d8>] [<c010a9d7>]
> Code: 0f 0b 83 c4 0c 90 0f b7 4e 14 66 89 4c 24 16 0f b6 46 15 8b
>
> >>EIP; c0184546 <__make_request+a6/630> <=====
> Trace; c0183f24 <blk_get_queue+34/50>
> Trace; c0184b84 <generic_make_request+b4/120>
> Trace; c0184d53 <ll_rw_block+163/1e0>
> Trace; c012fa31 <bread+31/70>
> Trace; c014efde <read_inode_bitmap+3e/90>
> Trace; c014f240 <load_inode_bitmap+210/230>
> Trace; c014f6af <ext2_new_inode+29f/700>

[ etc ]

and the above is a perfectly fine backtrace, makes tons of sense, looks
good.

HOWEVER. What doesn't make any sense at all is that bread() calls getblk()
to find the buffer, which in turn certainly makes sure that the buffer it
tries to read is mapped. In fact, there are two paths to the read: one
finds the buffer off the hash queue, and the other creates it. The one
that creates the buffer explicitly marks it BH_Mapped, so the only
apparent source of problems would be the hash queue.

Except for the fact that the only thing that adds buffers to the hash
queue is __insert_into_queues(), and the only thing that calls THAT is
getblk() itself - again after having marked the buffer mapped.

In short, the debug trace looks fine, but it also looks completely
incomprehensible. The only thing that would strike me is
 - memory corruption
 - somebody calls "unmap_buffer()" in a buffer that is hashed. Which we
   used to have as a bug, but we definitely don't do that any more.
 - we have buffer head list corruption going on.

Now, I don't see any recent code that has touched anything like this,
which obviously doesn't mean anything at all. It might be a very old bug
that just hasn't reared its head before now.

Al, do you see anything wrong?

Udo, any idea what you are doing differently than anybody else to see
this thing? Any special usage patterns that seem to bring on the trouble?

                Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Mon Oct 23 2000 - 21:00:11 EST