Re: Possible dcache BUG

From: Marcelo Tosatti
Date: Thu Aug 19 2004 - 14:54:58 EST



Gene,

That is:

/*
* The buffer's backing address_space's private_lock must be held
*/
static inline void __remove_assoc_queue(struct buffer_head *bh)
{
BUG_ON(bh->b_assoc_buffers.next == NULL); <----------
BUG_ON(bh->b_assoc_buffers.prev == NULL);
list_del_init(&bh->b_assoc_buffers);
}

Viro, Linus, Andrew, dont you have any idea what could cause such mapping->b_assoc_mapping
corruption?

I can't see how that could be caused by flaky hardware.

Maybe we should include those BUGs into the official kernel, or -mm's tree?


On Thu, Aug 19, 2004 at 05:41:13AM -0400, Gene Heskett wrote:
> On Tuesday 17 August 2004 07:57, Nick Piggin wrote:
> >Gene Heskett wrote:
> >> On Tuesday 17 August 2004 00:58, Nick Piggin wrote:
> >>>Gene Heskett wrote:
> >>>>Reboot time I guess :(((
> >>>
> >>>All your low memory has been used by dentry and inode caches. This
> >>>isn't very
> >>>interesting because this would be no doubt caused by something
> >>>oopsing while holding the shrinker semaphore as Andrew pointed
> >>> out.
> >>>
> >>>What is interesting is that first Oops message (I wonder if you
> >>>don't have bad hardware though, I don't think anyone else is
> >>> seeing it).
> >>
> >> What 'first Oops message'? One I posted before?
> >
> >Well, the first Oops that your running kernel raises. Usually you
> >don't bother about subsequent oopses and misbehaviour because the
> >first one can cause the system to go into a funny state - this is
> >a prime example.
> >
> >> That comment caused me to go back in the log to well above where I
> >> had been channel surfing with tvtime, and I did find an Oops:
> >>
> >> Aug 16 21:15:46 coyote kernel: Unable to handle kernel NULL
> >> pointer dereference at virtual address 00000000 Aug 16 21:15:46
> >> coyote kernel: printing eip:
> >> Aug 16 21:15:46 coyote kernel: c015c8db
> >> Aug 16 21:15:46 coyote kernel: *pde = 00000000
> >> Aug 16 21:15:46 coyote kernel: Oops: 0002 [#1]
> >> Aug 16 21:15:46 coyote kernel: Modules linked in: tuner tvaudio
> >> bttv video_buf btcx_risc eeprom snd_seq_oss snd_seq _midi_event
> >> snd_seq snd_pcm_oss snd_mixer_oss snd_bt87x snd_intel8x0
> >> snd_ac97_codec snd_pcm snd_timer snd_page_allo c snd_mpu401_uart
> >> snd_rawmidi snd_seq_device snd forcedeth sg Aug 16 21:15:46 coyote
> >> kernel: CPU: 0
> >> Aug 16 21:15:46 coyote kernel: EIP: 0060:[<c015c8db>] Not
> >> tainted Aug 16 21:15:46 coyote kernel: EFLAGS: 00210206
> >> (2.6.8-rc4) Aug 16 21:15:46 coyote kernel: EIP is at
> >> prune_icache+0x6b/0x1b0 Aug 16 21:15:46 coyote kernel: eax:
> >> 00000000 ebx: dffe0fd0 ecx: d3eb8b80 edx: c0341660 Aug 16
> >> 21:15:46 coyote kernel: esi: dffe0fc8 edi: 0000005a ebp:
> >> d3eb8b94 esp: d3eb8b74 Aug 16 21:15:46 coyote kernel: ds: 007b
> >> es: 007b ss: 0068 Aug 16 21:15:46 coyote kernel: Process yum
> >> (pid: 30892, threadinfo=d3eb8000 task=cf6bf7b0) Aug 16 21:15:46
> >> coyote kernel: Stack: dffe0448 00000000 00000059 dffe0450 df58d0d0
> >> 00000080 00000000 d3eb8000 Aug 16 21:15:46 coyote kernel:
> >> d3eb8ba0 c015ca5f 00000080 d3eb8bd4 c0135b14 00000080 000000d2
> >> 0108bf00 Aug 16 21:15:46 coyote kernel: 00000000 00021087
> >> 00000080 00000000 f7ffea20 0000000a d3eb8c50 00000000 Aug 16
> >> 21:15:46 coyote kernel: Call Trace:
> >> Aug 16 21:15:46 coyote kernel: [<c01044ef>] show_stack+0x7f/0xa0
> >> Aug 16 21:15:46 coyote kernel: [<c0104688>]
> >> show_registers+0x158/0x1b0 Aug 16 21:15:46 coyote kernel:
> >> [<c01047e6>] die+0x66/0xd0 Aug 16 21:15:46 coyote kernel:
> >> [<c01109de>] do_page_fault+0x28e/0x548 Aug 16 21:15:46 coyote
> >> kernel: [<c010415d>] error_code+0x2d/0x38 Aug 16 21:15:46 coyote
> >> kernel: [<c015ca5f>] shrink_icache_memory+0x3f/0x50 Aug 16
> >> 21:15:46 coyote kernel: [<c0135b14>] shrink_slab+0x134/0x170 Aug
> >> 16 21:15:46 coyote kernel: [<c0136954>]
> >> try_to_free_pages+0xa4/0x160 Aug 16 21:15:46 coyote kernel:
> >> [<c012fc23>] __alloc_pages+0x1b3/0x320 Aug 16 21:15:46 coyote
> >> kernel: [<c0139a8f>] do_anonymous_page+0x5f/0x180 Aug 16 21:15:46
> >> coyote kernel: [<c0139c11>] do_no_page+0x61/0x310 Aug 16 21:15:46
> >> coyote kernel: [<c013a097>] handle_mm_fault+0xd7/0x160 Aug 16
> >> 21:15:46 coyote kernel: [<c01108a0>] do_page_fault+0x150/0x548
> >> Aug 16 21:15:46 coyote kernel: [<c010415d>] error_code+0x2d/0x38
> >> Aug 16 21:15:46 coyote kernel: [<c012c279>]
> >> do_generic_mapping_read+0x129/0x430 Aug 16 21:15:46 coyote kernel:
> >> [<c012c836>] __generic_file_aio_read+0x1b6/0x1f0 Aug 16 21:15:46
> >> coyote kernel: [<c012c8c2>] generic_file_aio_read+0x52/0x70 Aug
> >> 16 21:15:46 coyote kernel: [<c0145898>] do_sync_read+0x78/0xa0
> >> Aug 16 21:15:46 coyote kernel: [<c014598a>] vfs_read+0xca/0x140
> >> Aug 16 21:15:46 coyote kernel: [<c0145c2b>] sys_read+0x4b/0x80
> >> Aug 16 21:15:46 coyote kernel: [<c0103f61>]
> >> sysenter_past_esp+0x52/0x71 Aug 16 21:15:46 coyote kernel: Code:
> >> 89 10 a1 60 16 34 c0 89 58 04 89 03 c7 43 04 60 16 34 c0 89
> >>
> >> yum did a segfault about that time. yum is nice code, when
> >> it fscking works, which is maybe half the time on 2 different
> >> FC2 machines here now.
> >
> >Although an Oops is always the kernel's (or bad hardware's) fault.
> >So in this case you can let yum off the hook :)
> >
> >> So we're back to the dentry_cache thing... Duh, NO!, this is in
> >> prune_icache, not prune_dcache, presumably slightly different.
> >
> >Yeah, both are going to cause cache shrinking to stop working.
> >
> >> As far as bad hardware is concerned, warranty time is running out.
> >> I need something plausible to take back to tcwo as a good reason
> >> for requesting a 'blanket rma' on the whole thing, would they
> >> please send me another.
> >
> >Not too sure really. At this stage keep trying patches that you get
> >sent :P
>
> I just had another but this ones a bit different:
>
> Aug 19 04:22:11 coyote kernel: ------------[ cut here ]------------
> Aug 19 04:22:11 coyote kernel: kernel BUG at fs/buffer.c:805!
> Aug 19 04:22:11 coyote kernel: invalid operand: 0000 [#1]
> Aug 19 04:22:11 coyote kernel: Modules linked in: eeprom snd_seq_oss
> snd_seq_midi_event snd_seq snd_pcm_oss snd_mixer_oss snd_bt87x
> snd_intel8x0 snd_ac97_codec snd_pcm snd_timer snd_page_alloc
> snd_mpu401_uart snd_rawmidi snd_seq_device snd forcedeth sg
> Aug 19 04:22:11 coyote kernel: CPU: 0
> Aug 19 04:22:11 coyote kernel: EIP: 0060:[<c0147d77>] Not
> tainted
> Aug 19 04:22:11 coyote kernel: EFLAGS: 00010246 (2.6.8-rc4)
> Aug 19 04:22:11 coyote kernel: EIP is at
> remove_inode_buffers+0x77/0x90
> Aug 19 04:22:11 coyote kernel: eax: 00000000 ebx: d7de519c ecx:
> d7deb99c edx: d7deb974
> Aug 19 04:22:11 coyote kernel: esi: d7de50c8 edi: 00000001 ebp:
> c198bedc esp: c198becc
> Aug 19 04:22:11 coyote kernel: ds: 007b es: 007b ss: 0068
> Aug 19 04:22:11 coyote kernel: Process kswapd0 (pid: 66,
> threadinfo=c198b000 task=c1978050)
> Aug 19 04:22:11 coyote kernel: Stack: d7de50c8 d7de50d0 d7de50c8
> 00000057 c198bf04 c015c985 d7de50c8 00000000
> Aug 19 04:22:11 coyote kernel: 00000057 d7de5290 e50ac0d0
> 00000080 00000000 c198b000 c198bf10 c015ca5f
> Aug 19 04:22:11 coyote kernel: 00000080 c198bf44 c0135b14
> 00000080 000000d0 01779600 00000000 0002d1f3
> Aug 19 04:22:11 coyote kernel: Call Trace:
> Aug 19 04:22:11 coyote kernel: [<c01044ef>] show_stack+0x7f/0xa0
> Aug 19 04:22:11 coyote kernel: [<c0104688>]
> show_registers+0x158/0x1b0
> Aug 19 04:22:11 coyote kernel: [<c01047e6>] die+0x66/0xd0
> Aug 19 04:22:12 coyote kernel: [<c0104bc3>] do_invalid_op+0xb3/0xc0
> Aug 19 04:22:12 coyote kernel: [<c010415d>] error_code+0x2d/0x38
> Aug 19 04:22:12 coyote kernel: [<c015c985>] prune_icache+0x115/0x1b0
> Aug 19 04:22:12 coyote kernel: [<c015ca5f>]
> shrink_icache_memory+0x3f/0x50
> Aug 19 04:22:12 coyote kernel: [<c0135b14>] shrink_slab+0x134/0x170
> Aug 19 04:22:12 coyote kernel: [<c0136bb9>] balance_pgdat+0x1a9/0x1f0
> Aug 19 04:22:12 coyote kernel: [<c0136cbf>] kswapd+0xbf/0xd0
> Aug 19 04:22:12 coyote kernel: [<c01023f1>]
> kernel_thread_helper+0x5/0x14
> Aug 19 04:22:12 coyote kernel: Code: 0f 0b 25 03 e5 0b 30 c0 eb c4 31
> ff eb de 0f 0b 36 04 e5 0b
>
> The system is still up but its 100 megs into swap so I'm going to
> reboot without changing anything. Is this one traceable?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/