Re: iommu_alloc failure and panic

From: Mark Haverkamp
Date: Fri Jan 27 2006 - 18:59:11 EST

Next message: Ravikiran G Thirumalai: "Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated"
Previous message: Andrew Morton: "Re: [Patch] sched: new sched domain for representing multi-core"
In reply to: Anton Blanchard: "Re: iommu_alloc failure and panic"
Next in thread: Anton Blanchard: "Re: iommu_alloc failure and panic"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Sat, 2006-01-28 at 10:48 +1100, Anton Blanchard wrote:
> Hi,
>
> > I have been testing large numbers of storage devices. I have 16000 scsi
> > LUNs configured. (4000 fiberchannel LUNS seen 4 times). They are
> > configured as 4000 multipath devices using device mapper. I have 100 of
> > those devices configured as a logical volume using LVM2. Once I start
> > bonnie++ using one of those logical volumes I start seeing iommu_alloc
> > failures and then a panic. I don't have this issue when accessing the
> > individual devices or the individual multipath devices. Only when
> > conglomerated into a logical volume.
> >
> > Here is the syslog output:
> >
> > Jan 26 14:56:41 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c0000000d7967000 npages 10
>
> This stuff should be OK since the lpfc driver should handle the iommu
> filling up. Im guessing since you have so many LUNs you can get a whole
> lot of outstanding IO, enough to fill up the TCE table.
>
> > DAR: C000000600711710
>
> > NIP [C00000000000F7D0] .validate_sp+0x30/0xc0
> > LR [C00000000000FA2C] .show_stack+0xec/0x1d0
> > Call Trace:
> > [C0000000076DBBC0] [C00000000000FA18] .show_stack+0xd8/0x1d0 (unreliable)
> > [C0000000076DBC60] [C000000000433838] .schedule+0xd78/0xfb0
> > [C0000000076DBDE0] [C000000000079FC0] .worker_thread+0x1b0/0x1c0
>
> This is interesting, an oops in validate_sp which is a pretty boring
> function. We have seen this in the past when you overflow your kernel
> stack. The bottom of the kernel stack contains the threadinfo struct and
> in that we have the task_cpu() data.
>
> When we overflow the stack and overwrite the threadinfo we end up with
> a crazy number for task_cpu() and then oops when doing
> hardirq_ctx[task_cpu(p)]. Can you turn on DEBUG_STACKOVERFLOW and
> DEBUG_STACK_USAGE and see if you get any stack overflow warnings?

It looks like they are already on:

CONFIG_DEBUG_STACKOVERFLOW=y
CONFIG_DEBUG_STACK_USAGE=y

> The
> second option will also allow you to do sysrq t and dump the most stack
> each process has used at that point.

The machine is frozen after the panic.

Mark.

>
> Anton
--
Mark Haverkamp <markh@xxxxxxxx>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Ravikiran G Thirumalai: "Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated"
Previous message: Andrew Morton: "Re: [Patch] sched: new sched domain for representing multi-core"
In reply to: Anton Blanchard: "Re: iommu_alloc failure and panic"
Next in thread: Anton Blanchard: "Re: iommu_alloc failure and panic"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]