[BUG JFS] fs/jfs/jfs_txnmgr.c:940 assert(mp->nohomeok > 0) and fs/jfs/jfs_metapage.c:752assert(mp->count)

From: Mihai Moldovan
Date: Mon Aug 01 2011 - 09:18:56 EST


Hi,

yesterday while compiling LibreOffice on my amd64 box running Linux
3.0.0, those bugs happened:

[692827.021788] BUG at fs/jfs/jfs_txnmgr.c:940 assert(mp->nohomeok > 0)
[692827.021804] ------------[ cut here ]------------
[692827.021821] kernel BUG at fs/jfs/jfs_txnmgr.c:940!
[692827.021832] invalid opcode: 0000 [#1] SMP
[692827.021846] CPU 0
[692827.021851] Modules linked in: xt_state xt_dscp xt_hashlimit
xt_conntrack xt_DSCP ath9k i915 mac80211 ath9k_common ath9k_hw
drm_kms_helper ath cfg80211 cfbcopyarea e1000e video rfkill backlight
cfbimgblt cfbfillrect
[692827.021934]
[692827.021940] Pid: 697, comm: jfsCommit Not tainted 3.0.0-OSS4.2
#1 /DQ45CB
[692827.021967] RIP: 0010:[<ffffffff81252835>] [<ffffffff81252835>]
txUnlock+0xc2/0x21a
[692827.021991] RSP: 0018:ffff880231901e20 EFLAGS: 00010286
[692827.022003] RAX: 000000000000004e RBX: ffffc900080aef00 RCX:
000000000000ef51
[692827.022021] RDX: 0000000000000000 RSI: 0000000000000046 RDI:
ffffffff81ef0e60
[692827.022038] RBP: ffff880231901e60 R08: 0000000000000046 R09:
00000000fffffffe
[692827.022054] R10: ffff8802b1901d3f R11: 0000000000000000 R12:
ffff8801a8fca380
[692827.022070] R13: ffff88022e05b600 R14: ffffc900081db3e0 R15:
000000005063059c
[692827.022087] FS: 0000000000000000(0000) GS:ffff88023bc00000(0000)
knlGS:0000000000000000
[692827.022105] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[692827.022118] CR2: 00007f98337ce000 CR3: 0000000003f36000 CR4:
00000000000406f0
[692827.022135] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[692827.022152] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[692827.022168] Process jfsCommit (pid: 697, threadinfo
ffff880231900000, task ffff880232746140)
[692827.022185] Stack:
[692827.022191] 2e84880231901e60 ffff88022e05b718 ffffffff816a41f3
ffff880232746140
[692827.022214] ffff880231901e80 ffff88023332b300 cccccccccccccccd
ffffc900080aef00
[692827.022238] ffff880231901ee0 ffffffff81254de7 0000000000010bc0
ffff88022e05b600
[692827.022261] Call Trace:
[692827.022274] [<ffffffff816a41f3>] ? _raw_spin_lock_irqsave+0xd/0x2a
[692827.022290] [<ffffffff81254de7>] jfs_lazycommit+0x13d/0x24e
[692827.022305] [<ffffffff8103c53e>] ? try_to_wake_up+0x1c5/0x1c5
[692827.022318] [<ffffffff81254caa>] ? txCommit+0x33b/0x33b
[692827.022332] [<ffffffff810595da>] kthread+0x7d/0x85
[692827.022346] [<ffffffff816aa614>] kernel_thread_helper+0x4/0x10
[692827.022361] [<ffffffff8105955d>] ? kthread_worker_fn+0x137/0x137
[692827.022376] [<ffffffff816aa610>] ? gs_change+0xb/0xb
[692827.022387] Code: ff 41 8b 44 24 68 85 c0 7f 23 48 c7 c1 df 5a b1 81
ba ac 03 00 00 48 c7 c6 cb 5a b1 81 48 c7 c7 3a 4c b1 81 31 c0 e8 69 fd
44 00 <0f> 0b ff c8 41 89 44 24 68 85 c0 75 0a 49 8b 7c 24 58 e8 9b 68
[692827.022519] RIP [<ffffffff81252835>] txUnlock+0xc2/0x21a
[692827.022533] RSP <ffff880231901e20>
[692827.073901] ---[ end trace b4cb55fb5d49efbd ]---
[693433.340865] ERROR: (device md1): XT_GETPAGE: xtree page corrupt
[693433.366922] ERROR: (device md1): XT_GETPAGE: xtree page corrupt
[693433.366925] ERROR: (device md1): XT_GETPAGE: xtree page corrupt
[693433.366931] ERROR: (device md1): XT_GETPAGE: xtree page corrupt
[693433.366933] ERROR: (device md1): XT_GETPAGE: xtree page corrupt
[693433.366935] ERROR: (device md1): XT_GETPAGE: xtree page corrupt
[693433.366937] ERROR: (device md1): XT_GETPAGE: xtree page corrupt
[693433.366939] ERROR: (device md1): XT_GETPAGE: xtree page corrupt
[693433.366941] ERROR: (device md1): XT_GETPAGE: xtree page corrupt
[693433.366943] ERROR: (device md1): XT_GETPAGE: xtree page corrupt
[693433.366945] BUG at fs/jfs/jfs_metapage.c:752 assert(mp->count)
[693433.366955] ------------[ cut here ]------------
[693433.366970] kernel BUG at fs/jfs/jfs_metapage.c:752!
[693433.366981] invalid opcode: 0000 [#2] SMP
[693433.366993] CPU 2
[693433.366998] Modules linked in: xt_state xt_dscp xt_hashlimit
xt_conntrack xt_DSCP ath9k i915 mac80211 ath9k_common ath9k_hw
drm_kms_helper ath cfg80211 cfbcopyarea e1000e video rfkill backlight
cfbimgblt cfbfillrect
[693433.367069]
[693433.367075] Pid: 10230, comm: shutdown Tainted: G D
3.0.0-OSS4.2 #1 /DQ45CB
[693433.367098] RIP: 0010:[<ffffffff8124ffb6>] [<ffffffff8124ffb6>]
release_metapage+0x5f/0xf7
[693433.367122] RSP: 0018:ffff8801a7a41ae8 EFLAGS: 00010296
[693433.367134] RAX: 0000000000000049 RBX: ffff8801a8fca380 RCX:
0000000000000aa3
[693433.367149] RDX: 0000000000000000 RSI: 0000000000000046 RDI:
ffffffff81ef0e60
[693433.367165] RBP: ffff8801a7a41af8 R08: 0000000000000046 R09:
00000000fffffffe
[693433.367180] R10: ffff880227a41a07 R11: 0000000000000000 R12:
ffffea0007051da8
[693433.367196] R13: ffff88022e901000 R14: 0000000000000001 R15:
000000000000034b
[693433.367212] FS: 00007fcbd4f3f700(0000) GS:ffff88023bd00000(0000)
knlGS:0000000000000000
[693433.367229] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[693433.367242] CR2: 00007fcbd4abb980 CR3: 00000001fbe6d000 CR4:
00000000000406e0
[693433.367258] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[693433.367273] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[693433.367289] Process shutdown (pid: 10230, threadinfo
ffff8801a7a40000, task ffff88023328e140)
[693433.367307] Stack:
[693433.367312] ffff8801a8fca380 0000000000000000 ffff8801a7a41b98
ffffffff8124dcb2
[693433.367334] ffff8801a7a41b78 ffffffff81253dbc ffff880232cdbdc0
0000000200000002
[693433.367356] ffff880231483400 00000001812b07bf ffff8801a7a41cc0
0000402000000000
[693433.367378] Call Trace:
[693433.367386] [<ffffffff8124dcb2>] dtInsert+0x18c/0x19e
[693433.367399] [<ffffffff81253dbc>] ? txLock+0x1bd/0x5ab
[693433.367411] [<ffffffff8123e835>] jfs_create+0x153/0x27f
[693433.367425] [<ffffffff810e120f>] ? do_lookup+0x46/0x28f
[693433.367439] [<ffffffff816a41e4>] ? _raw_spin_lock+0x9/0xb
[693433.367452] [<ffffffff810e9119>] ? __d_lookup+0x91/0xcb
[693433.367465] [<ffffffff810e18e9>] vfs_create+0x6b/0x8d
[693433.367478] [<ffffffff810e1b2e>] do_last+0x223/0x46b
[693433.367491] [<ffffffff810e2ef7>] path_openat+0xc2/0x334
[693433.367503] [<ffffffff810e31a1>] do_filp_open+0x38/0x84
[693433.367517] [<ffffffff81306fbd>] ? strncpy_from_user+0x34/0x37
[693433.367532] [<ffffffff810ec9c0>] ? alloc_fd+0x6f/0x11a
[693433.367545] [<ffffffff810d6f3f>] do_sys_open+0x6f/0x101
[693433.367557] [<ffffffff810d6fec>] sys_open+0x1b/0x1d
[693433.367570] [<ffffffff816a953b>] system_call_fastpath+0x16/0x1b
[693433.367583] Code: 56 de ff 48 8b 43 28 48 85 c0 75 21 48 c7 c1 d6 59
b1 81 ba f0 02 00 00 48 c7 c6 33 59 b1 81 48 c7 c7 3a 4c b1 81 e8 e8 25
45 00 <0f> 0b 48 ff c8 48 89 43 28 48 85 c0 75 75 83 7b 68 00 75 6f 48
[693433.367707] RIP [<ffffffff8124ffb6>] release_metapage+0x5f/0xf7
[693433.367723] RSP <ffff8801a7a41ae8>
[693433.373041] ---[ end trace b4cb55fb5d49efbe ]---

The system was in an unusable state afterwards - new network connections
couldn't be established while already running ones continued to work
fine. Also, neither reboot (segfaulted the first time, won't return the
second time) or init 6 did work and I was forced to do a hard
shutdown/reboot.

I'm pretty sure that the system would have locked up completely after
some time if I hadn't rebooted after the incident, as happened several
times, but I've never been able to get a trace of what happened so far.

As this bug happened at pretty high disk load (whilst compiling), I
guess JFS has a few really nasty bugs.

This is actually not the first bug I encounter with JFS. I've noticed, that:
- performance degrades a lot with files > 50GB for some reason. This
is especially annoying if copying full disk images. It starts up fine
with maximum speed and drops to a few MB/s as the file gets bigger and
bigger.
- on older Kernels I've seen the jfsCommit threads causing a *lot* CPU
load when working with big files. It got worse and worse the bigger the
file became until the system hung completely (even AFTER the file was
transferred successfully and a sync called.) It started with as low as
6% CPU load and increased steadily, even on multiple jfsCommit threads.
I haven't tested this on 3.0.0 yet though, but every older version I ran
had this kind of problem.

All those problems may or may not be linked together. Something seems to
be really wrong.

Best regards,


Mihai

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature