Re: [OOPS] amrestore dies in kmem_cache_free 2.6.16.18 - cannotrestore backups!

From: Mike Christie
Date: Sat May 27 2006 - 11:22:37 EST

Next message: Martin Langhoff: "Re: [SCRIPT] chomp: trim trailing whitespace"
Previous message: Jens Axboe: "Re: .17rc5 cfq slab corruption."
In reply to: Chuck Ebbert: "Re: [OOPS] amrestore dies in kmem_cache_free 2.6.16.18 - cannot restore backups!"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Chuck Ebbert wrote:
> In-Reply-To: <aa4c40ff0605231824j55c998c3oe427dec2404afba0@xxxxxxxxxxxxxx>
>
> On Tue, 23 May 2006 18:24:14 -0700, James Lamanna wrote:
>
>> So I was able to recreate this problem on a vanilla 2.6.16.18 with the
>> following oops..
>> I'd say this is a serious regression since I cannot restore backups
>> anymore (I could with 2.6.14.x, but that kernel series had other
>> issues...)
>
>> Unable to handle kernel paging request at ffff82bc81000030 RIP: <ffffffff801657d9>{kmem_cache_free+82}
>> PGD 0
>> Oops: 0000 [1] SMP
>> CPU 1
>> Modules linked in:
>> Pid: 5814, comm: amrestore Not tainted 2.6.16.18 #2
>> RIP: 0010:[<ffffffff801657d9>] <ffffffff801657d9>{kmem_cache_free+82}
>> RSP: 0018:ffff81007d4afcd8 EFLAGS: 00010086
>> RAX: ffff82bc81000000 RBX: ffff81004119d800 RCX: 000000000000001e
>> RDX: ffff81000000c000 RSI: 0000000000000000 RDI: 00000007f0000000
>> RBP: ffff81007ff0c800 R08: 0000000000000000 R09: 0000000000000400
>> R10: 0000000000000000 R11: ffffffff8014b3d6 R12: ffff810041311480
>> R13: 0000000000000400 R14: 0000000000000400 R15: ffff81007e676748
>> FS: 00002b7f39708020(0000) GS:ffff810041173bc0(0000) knlGS:0000000000000000
>> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> CR2: ffff82bc81000030 CR3: 000000007de09000 CR4: 00000000000006e0
>> Process amrestore (pid: 5814, threadinfo ffff81007d4ae000, task ffff81007e2f8ae0)
>> Stack: 0000000000000000 0000000000000246 ffff8100413c9bc0 ffff81007ff0c800
>> ffff8100413c9bc0 ffffffff8016dfdc ffff8100413c9bc0 ffff81007fe25408
>> 00000000ffffffea ffffffff803187e7
>> Call Trace: <ffffffff8016dfdc>{bio_free+48} <ffffffff803187e7>{scsi_execute_async+640}
>> <ffffffff8035d8d2>{st_do_scsi+422} <ffffffff8035d6e2>{st_sleep_done+0}
>> <ffffffff80362950>{st_read+855} <ffffffff8013e1ca>{autoremove_wake_function+0}
>> <ffffffff80169d7c>{vfs_read+171} <ffffffff8016a0af>{sys_read+69}
>> <ffffffff8010a93e>{system_call+126}
>>
>> Code: 48 8b 48 30 0f b7 51 28 65 8b 04 25 30 00 00 00 39 c2 0f 84
>> RIP <ffffffff801657d9>{kmem_cache_free+82} RSP <ffff81007d4afcd8>
>> CR2: ffff82bc81000030
>
> First of all, to really see what is happening you need to recompile your kernel
> after adding some debug options:
>
> Kernel Hacking --->
> [*] Kernel debugging
> [*] Debug memory allocations
> [*] Compile the kernel with frame pointers
>
> (Frame pointers won't give an exact trace but they'll prevent the tail merging
> that makes it so hard to follow.)
>
> Then reproduce the error and send the oops and any new error messages you see.
> Don't send the whole boot log and .config again -- we have them already.
>
> The bug is happening here, in __cache_free, in code that's only included
> on NUMA machines:
>
> static inline void __cache_free(struct kmem_cache *cachep, void *objp)
> {
> struct array_cache *ac = cpu_cache_get(cachep);
>
> check_irq_off();
> objp = cache_free_debugcheck(cachep, objp, __builtin_return_address(0));
>
> /* Make sure we are not freeing a object from another
> * node to the array cache on this cpu.
> */
> #ifdef CONFIG_NUMA
> {
> struct slab *slabp;
> slabp = virt_to_slab(objp); <==== OOPS
> if (unlikely(slabp->nodeid != numa_node_id())) {
> struct array_cache *alien = NULL;
> int nodeid = slabp->nodeid;
>
>
> Tracing through the nested inline functions, we have:
>
> static inline struct slab *virt_to_slab(const void *obj)
> {
> struct page *page = virt_to_page(obj);
> return page_get_slab(page); <==== OOPS
> }
>
> static inline struct slab *page_get_slab(struct page *page)
> {
> return (struct slab *)page->lru.prev; <==== OOPS
> }
>
>
> virt_to_page() returned a struct page * that pointed to unmapped memory.
>
>
> This all came from scsi_execute_async, possibly through this path:
>
> scsi_execute_async
> scsi_rq_map_sg: some kind of error occurred?
> bio_endio
> bio->bi_end_io ==> scsi_bi_end_io
> bio_put
> bio->bi_destructor ==> bio_fs_destructor
> bio_free
> mempool_free
> kmem_cache_free
>
> scsi_execute_async and scsi_rq_map_sg were rewritten last December, so may have
> new bugs.
>
>

Sorry for the late reply. I have been traveling.

Maybe I messed up on the bounce code usage. Are you using st's direct IO
feature?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Martin Langhoff: "Re: [SCRIPT] chomp: trim trailing whitespace"
Previous message: Jens Axboe: "Re: .17rc5 cfq slab corruption."
In reply to: Chuck Ebbert: "Re: [OOPS] amrestore dies in kmem_cache_free 2.6.16.18 - cannot restore backups!"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]