Re: 2.6.24-rc8-mm1 Kernel oops will running kernbench

From: Kamalesh Babulal
Date: Fri Jan 18 2008 - 04:35:22 EST


Paul Mackerras wrote:
> Andrew Morton writes:
>
>> On Fri, 18 Jan 2008 14:06:00 +0530 Kamalesh Babulal <kamalesh@xxxxxxxxxxxxxxxxxx> wrote:
>>
>>> Hi Andrew,
>>>
>>> Following oops was seen while running kernbench on one of test machine
>>> (power4+ box). I tried reproducing the oops but was unsuccessful.
>>> I will try to reproduce the oops with debug info compiled.
>>>
>>>
>>> Oops: Kernel access of bad area, sig: 11 [#1]
>>> SMP NR_CPUS=32 NUMA pSeries
>>> Modules linked in:
>>> NIP: 0000000000004570 LR: 000000000fc42dc0 CTR: 0000000000000000
>>> REGS: c00000077b6bf8c0 TRAP: 0300 Not tainted (2.6.24-rc8-mm1-autotest)
>>> MSR: 8000000000001000 <ME> CR: 28022422 XER: 00000000
>>> DAR: c00000077b6bfce0, DSISR: 000000000a000000
>>> TASK = c000000773164c40[19588] 'as' THREAD: c00000077b6bc000 CPU: 1
>>> GPR00: 0000000000004000 c00000077b6bfb40 0000000000007346 000000000000d032
>>> GPR04: 000000000000043a 0000000000000000 000000000000000c 0000000000000004
>>> GPR08: 000000000fd278c8 0000000048022424 c00000077b6bfe30 0000998be2321500
>>> GPR12: 8000000000001030 c0000000005f6280 0000000010030000 0000000010030000
>>> GPR16: 0000000010030000 0000000010050000 000000001006aac0 0000000010053cd0
>>> GPR20: 0000000000000000 0000000000000fe0 0000000010050000 0000000010050000
>>> GPR24: 0000000000000ff8 0000000000000fe8 0000000000000062 000000000fd27490
>>> GPR28: 000000000fd274c8 0000000010099420 000000000fd25ff4 000000001009a400
>>> NIP [0000000000004570] 0x4570
>>> LR [000000000fc42dc0] 0xfc42dc0
>>> Call Trace:
>>> [c00000077b6bfb40] [c00000077b292000] 0xc00000077b292000 (unreliable)
>>> Instruction dump:
>>> 48000000 XXXXXXXX XXXXXXXX XXXXXXXX 41820008 XXXXXXXX XXXXXXXX XXXXXXXX
>>> 48000010 XXXXXXXX XXXXXXXX XXXXXXXX f92101a0 XXXXXXXX XXXXXXXX XXXXXXXX
>>>
>> odd. Where did the stack trace go?

Only this much was captured in the serial console.
>
> It's there, it's just really really short (one line). The link
> register is in userspace and the stack pointer looks to be right at
> the top of a kernel stack area.
>
> The trap was a data access exception which is very odd given that the
> machine is in real mode (MMU off) with the pc at 0x4570. Actually it
> looks like the machine probably got a data access exception somewhere
> (probably in userspace, probably a page fault or similar) and then got
> another exception before it had finished saving the state from the
> first exception.
>
> Kamalesh, do you still have the vmlinux? If so could you disassemble
> the area from say 0x4500 to 0x4600, and find out what is the closest
> symbol before 0xc000000000004570 from System.map, and show us those?
>
> Paul.
> --
I tried reproducing the problem and was successful with following trace
in which the pc is at 0x4570 as the above one

Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=32 NUMA pSeries
Modules linked in:
NIP: 0000000000004570 LR: 000000000ff0288c CTR: 000000000ff013e0
REGS: c00000077e61f8c0 TRAP: 0300 Not tainted (2.6.24-rc8-mm1-autotest)
MSR: 8000000000001000 <ME> CR: 28000422 XER: 00000000
DAR: c00000077e61fce0, DSISR: 000000000a000000
TASK = c00000077207f880[23480] 'cc1' THREAD: c00000077e61c000 CPU: 3
GPR00: 0000000000004000 c00000077e61fb40 0000000000000088 000000000000d032
GPR04: 0000000000000088 000000000000030c 00000000fefefeff 000000007f7f7f7f
GPR08: 0000000000008000 0000000044000428 c00000077e61fe30 0000998be2321500
GPR12: 8000000000001030 c0000000005f6680 0000000010030000 0000000010030000
GPR16: 00000000105b0000 00000000105b0000 0000000010440000 00000000105b0000
GPR20: 00000000105b0000 00000000105b0000 00000000105b0000 00000000105b0000
GPR24: 00000000105b0000 00000000105b0000 00000000105b0000 00000000ffa11b24
GPR28: 0000000000000000 00000000ffffffff 000000000ffebff4 000000000ffec408
NIP [0000000000004570] 0x4570
LR [000000000ff0288c] 0xff0288c
Call Trace:
[c00000077e61fb40] [c00000077e61fcf0] 0xc00000077e61fcf0 (unreliable)
[c00000077e61fbd0] [0000000010440000] 0x10440000
Instruction dump:
48000000 XXXXXXXX XXXXXXXX XXXXXXXX 41820008 XXXXXXXX XXXXXXXX XXXXXXXX
48000010 XXXXXXXX XXXXXXXX XXXXXXXX f92101a0 XXXXXXXX XXXXXXXX XXXXXXXX

The disassembled vmlinux from 0x4500 to 0x4600

c000000000004500: f9 4d 01 68 std r10,360(r13)
c000000000004504: 48 02 89 f9 bl c00000000002cefc <.slb_allocate_realmode>
c000000000004508: e9 4d 01 68 ld r10,360(r13)
c00000000000450c: e8 6d 01 60 ld r3,352(r13)
c000000000004510: 81 2d 01 5c lwz r9,348(r13)
c000000000004514: 7d 48 03 a6 mtlr r10
c000000000004518: 71 8a 00 02 andi. r10,r12,2
c00000000000451c: 41 82 00 28 beq- c000000000004544 <unrecov_slb>
c000000000004520: 7d 38 01 20 mtocrf 128,r9
c000000000004524: 7d 30 11 20 mtocrf 1,r9
c000000000004528: e9 2d 01 20 ld r9,288(r13)
c00000000000452c: e9 4d 01 28 ld r10,296(r13)
c000000000004530: e9 6d 01 30 ld r11,304(r13)
c000000000004534: e9 8d 01 38 ld r12,312(r13)
c000000000004538: e9 ad 01 40 ld r13,320(r13)
c00000000000453c: 4c 00 00 24 rfid
c000000000004540: 48 00 00 00 b c000000000004540 <.slb_miss_realmode+0x48>

c000000000004544 <unrecov_slb>:
c000000000004544: 71 8a 40 00 andi. r10,r12,16384
c000000000004548: 7c 2a 0b 78 mr r10,r1
c00000000000454c: 38 21 fd 10 addi r1,r1,-752
c000000000004550: 41 82 00 08 beq- c000000000004558 <unrecov_slb+0x14>
c000000000004554: e8 2d 01 a8 ld r1,424(r13)
c000000000004558: 2c a1 00 00 cmpdi cr1,r1,0
c00000000000455c: 40 84 00 08 bge- cr1,c000000000004564 <unrecov_slb+0x20>
c000000000004560: 48 00 00 10 b c000000000004570 <unrecov_slb+0x2c>
c000000000004564: 38 20 41 00 li r1,16640
c000000000004568: b0 2d 01 c8 sth r1,456(r13)
c00000000000456c: 4b ff fb 18 b c000000000004084 <bad_stack>
c000000000004570: f9 21 01 a0 std r9,416(r1)
c000000000004574: f9 61 01 70 std r11,368(r1)
c000000000004578: f9 81 01 78 std r12,376(r1)
c00000000000457c: f9 41 00 00 std r10,0(r1)
c000000000004580: f8 01 00 70 std r0,112(r1)
c000000000004584: f9 41 00 78 std r10,120(r1)
c000000000004588: 41 82 00 24 beq- c0000000000045ac <unrecov_slb+0x68>
c00000000000458c: 7d 35 4a a6 mfspr r9,309
c000000000004590: 7d 2c 42 e6 mftb r9
c000000000004594: e9 4d 01 e0 ld r10,480(r13)
c000000000004598: f9 2d 01 e0 std r9,480(r13)
c00000000000459c: 7d 4a 48 50 subf r10,r10,r9
c0000000000045a0: e9 2d 01 d0 ld r9,464(r13)
c0000000000045a4: 7d 29 52 14 add r9,r9,r10
c0000000000045a8: f9 2d 01 d0 std r9,464(r13)
c0000000000045ac: f8 41 00 80 std r2,128(r1)
c0000000000045b0: f8 61 00 88 std r3,136(r1)
c0000000000045b4: f8 81 00 90 std r4,144(r1)
c0000000000045b8: f8 a1 00 98 std r5,152(r1)
c0000000000045bc: f8 c1 00 a0 std r6,160(r1)
c0000000000045c0: f8 e1 00 a8 std r7,168(r1)
c0000000000045c4: f9 01 00 b0 std r8,176(r1)
c0000000000045c8: e9 2d 01 20 ld r9,288(r13)
c0000000000045cc: e9 4d 01 28 ld r10,296(r13)
c0000000000045d0: f9 21 00 b8 std r9,184(r1)
c0000000000045d4: f9 41 00 c0 std r10,192(r1)
c0000000000045d8: e9 2d 01 30 ld r9,304(r13)
c0000000000045dc: e9 4d 01 38 ld r10,312(r13)
c0000000000045e0: e9 6d 01 40 ld r11,320(r13)
c0000000000045e4: f9 21 00 c8 std r9,200(r1)
c0000000000045e8: f9 41 00 d0 std r10,208(r1)
c0000000000045ec: f9 61 00 d8 std r11,216(r1)
c0000000000045f0: e8 4d 00 10 ld r2,16(r13)
c0000000000045f4: 7d 28 02 a6 mflr r9
c0000000000045f8: f9 21 01 90 std r9,400(r1)
c0000000000045fc: 7d 49 02 a6 mfctr r10
c000000000004600: f9 41 01 88 std r10,392(r1)

The closet symbol form the system.map file

c000000000004280 T data_access_common
c000000000004400 T instruction_access_common
c0000000000044f8 T .slb_miss_realmode
c000000000004544 t unrecov_slb
c000000000004680 T hardware_interrupt_common

Let me know if you need more information.
--
Thanks & Regards,
Kamalesh Babulal,
Linux Technology Center,
IBM, ISTL.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/