NULL pointer dereference in process_one_work

From: baiyaowei
Date: Fri Nov 24 2017 - 02:33:05 EST


Hi,tj and jiangshan,

I build a ceph storage pool to run some benchmarks with 3.10 kernel.
Occasionally, when the cpus' load is very high, some nodes crash with
message below.

[292273.612014] BUG: unable to handle kernel NULL pointer dereference at
0000000000000008
[292273.612057] IP: [<ffffffff8109d4b1>] process_one_work+0x31/0x470
[292273.612087] PGD 0
[292273.612099] Oops: 0000 [#1] SMP
[292273.612117] Modules linked in: rbd(OE) bcache(OE) ip_vs xfs
xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4
iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
xt_conntrack nf_conntrack ipt_REJECT tun bridge stp llc ebtable_filter
ebtables ip6table_filter ip6_tables iptable_filter bonding
intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul
ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper
cryptd mxm_wmi iTCO_wdt iTCO_vendor_support dcdbas ipmi_devintf pcspkr
ipmi_ssif mei_me sg lpc_ich mei sb_edac ipmi_si mfd_core edac_core
ipmi_msghandler shpchp wmi acpi_power_meter nfsd auth_rpcgss nfs_acl
lockd grace sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif
crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit
drm_kms_helper
[292273.612495] crct10dif_pclmul crct10dif_common ttm crc32c_intel drm
ahci nvme bnx2x libahci i2c_core libata mdio libcrc32c megaraid_sas ptp
pps_core dm_mirror dm_region_hash dm_log dm_mod
[292273.612580] CPU: 16 PID: 353223 Comm: kworker/16:2 Tainted: G
OE ------------ 3.10.0-327.el7.x86_64 #1
[292273.612620] Hardware name: Dell Inc. PowerEdge R730xd/0WCJNT, BIOS
2.4.3 01/17/2017
[292273.612655] task: ffff8801f55e6780 ti: ffff882a199b0000 task.ti:
ffff882a199b0000
[292273.612685] RIP: 0010:[<ffffffff8109d4b1>] [<ffffffff8109d4b1>]
process_one_work+0x31/0x470
[292273.612721] RSP: 0018:ffff882a199b3e28 EFLAGS: 00010046
[292273.612743] RAX: 0000000000000000 RBX: ffff88088b273028 RCX:
ffff882a199b3fd8
[292273.612771] RDX: 0000000000000000 RSI: ffff88088b273028 RDI:
ffff88088b273000
[292273.612799] RBP: ffff882a199b3e60 R08: 0000000000000000 R09:
0000000000000770
[292273.612827] R10: ffff8822a3bb1f80 R11: ffff8822a3bb1f80 R12:
ffff88088b273000
[292273.612855] R13: ffff881fff313fc0 R14: 0000000000000000 R15:
ffff881fff313fc0
[292273.612883] FS: 0000000000000000(0000) GS:ffff881fff300000(0000)
knlGS:0000000000000000
[292273.612914] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[292273.612937] CR2: 00000000000000b8 CR3: 000000000194a000 CR4:
00000000003407e0
[292273.612965] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[292273.612994] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[292273.613021] Stack:
[292273.613031] 00000000ff313fd8 0000000000000000 ffff881fff313fd8
000188088b273030
[292273.613069] ffff8801f55e6780 ffff88088b273000 ffff881fff313fc0
ffff882a199b3ec0
[292273.613108] ffffffff8109e4cc ffff882a199b3fd8 ffff882a199b3fd8
ffff8801f55e6780
[292273.613146] Call Trace:
[292273.613160] [<ffffffff8109e4cc>] worker_thread+0x21c/0x400
[292273.613185] [<ffffffff8109e2b0>] ? rescuer_thread+0x400/0x400
[292273.613212] [<ffffffff810a5aef>] kthread+0xcf/0xe0
[292273.613234] [<ffffffff810a5a20>] ?
kthread_create_on_node+0x140/0x140
[292273.613263] [<ffffffff81645858>] ret_from_fork+0x58/0x90
[292273.613287] [<ffffffff810a5a20>] ?
kthread_create_on_node+0x140/0x140
[292273.614303] Code: 48 89 e5 41 57 41 56 45 31 f6 41 55 41 54 49 89 fc
53 48 89 f3 48 83 ec 10 48 8b 06 4c 8b 6f 48 48 89 c2 30 d2 a8 04 4c 0f
45 f2 <49> 8b 46 08 44 8b b8 00 01 00 00 41 c1 ef 05 44 89 f8 83 e0 01
[292273.617971] RIP [<ffffffff8109d4b1>] process_one_work+0x31/0x470
[292273.620011] RSP <ffff882a199b3e28>
[292273.621940] CR2: 0000000000000008

Some crash messsages:

crash> sys
KERNEL: /usr/lib/debug/lib/modules/3.10.0-327.el7.x86_64/vmlinux
DUMPFILE: vmcore [PARTIAL DUMP]
CPUS: 32
DATE: Wed Oct 18 05:21:14 2017
UPTIME: 3 days, 09:07:25
LOAD AVERAGE: 221.70, 222.22, 224.96
TASKS: 3115
NODENAME: node121
RELEASE: 3.10.0-327.el7.x86_64
VERSION: #1 SMP Thu Nov 19 22:10:57 UTC 2015
MACHINE: x86_64 (2099 Mhz)
MEMORY: 255.9 GB
PANIC: "BUG: unable to handle kernel NULL pointer dereference at
0000000000000008"
crash> bt
PID: 353223 TASK: ffff8801f55e6780 CPU: 16 COMMAND: "kworker/16:2"
#0 [ffff882a199b3af0] machine_kexec at ffffffff81051beb
#1 [ffff882a199b3b50] crash_kexec at ffffffff810f2542
#2 [ffff882a199b3c20] oops_end at ffffffff8163e1a8
#3 [ffff882a199b3c48] no_context at ffffffff8162e2b8
#4 [ffff882a199b3c98] __bad_area_nosemaphore at ffffffff8162e34e
#5 [ffff882a199b3ce0] bad_area_nosemaphore at ffffffff8162e4b8
#6 [ffff882a199b3cf0] __do_page_fault at ffffffff81640fce
#7 [ffff882a199b3d48] do_page_fault at ffffffff81641113
#8 [ffff882a199b3d70] page_fault at ffffffff8163d408
[exception RIP: process_one_work+49]
RIP: ffffffff8109d4b1 RSP: ffff882a199b3e28 RFLAGS: 00010046
RAX: 0000000000000000 RBX: ffff88088b273028 RCX: ffff882a199b3fd8
RDX: 0000000000000000 RSI: ffff88088b273028 RDI: ffff88088b273000
RBP: ffff882a199b3e60 R8: 0000000000000000 R9: 0000000000000770
R10: ffff8822a3bb1f80 R11: ffff8822a3bb1f80 R12: ffff88088b273000
R13: ffff881fff313fc0 R14: 0000000000000000 R15: ffff881fff313fc0
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#9 [ffff882a199b3e68] worker_thread at ffffffff8109e4cc
#10 [ffff882a199b3ec8] kthread at ffffffff810a5aef
#11 [ffff882a199b3f50] ret_from_fork at ffffffff81645858
crash> dis process_one_work
0xffffffff8109d480 <process_one_work>: nopl 0x0(%rax,%rax,1) [FTRACE
NOP]
0xffffffff8109d485 <process_one_work+5>: push %rbp
0xffffffff8109d486 <process_one_work+6>: mov %rsp,%rbp
0xffffffff8109d489 <process_one_work+9>: push %r15
0xffffffff8109d48b <process_one_work+11>: push %r14
0xffffffff8109d48d <process_one_work+13>: xor %r14d,%r14d
0xffffffff8109d490 <process_one_work+16>: push %r13
0xffffffff8109d492 <process_one_work+18>: push %r12
0xffffffff8109d494 <process_one_work+20>: mov %rdi,%r12
0xffffffff8109d497 <process_one_work+23>: push %rbx
0xffffffff8109d498 <process_one_work+24>: mov %rsi,%rbx
0xffffffff8109d49b <process_one_work+27>: sub $0x10,%rsp
0xffffffff8109d49f <process_one_work+31>: mov (%rsi),%rax
0xffffffff8109d4a2 <process_one_work+34>: mov 0x48(%rdi),%r13
0xffffffff8109d4a6 <process_one_work+38>: mov %rax,%rdx
0xffffffff8109d4a9 <process_one_work+41>: xor %dl,%dl
0xffffffff8109d4ab <process_one_work+43>: test $0x4,%al
0xffffffff8109d4ad <process_one_work+45>: cmovne %rdx,%r14
0xffffffff8109d4b1 <process_one_work+49>: mov 0x8(%r14),%rax
0xffffffff8109d4b5 <process_one_work+53>: mov 0x100(%rax),%r15d
0xffffffff8109d4bc <process_one_work+60>: shr $0x5,%r15d
0xffffffff8109d4c0 <process_one_work+64>: mov %r15d,%eax
0xffffffff8109d4c3 <process_one_work+67>: and $0x1,%eax
0xffffffff8109d4c6 <process_one_work+70>: testb $0x80,0x58(%rdi)
0xffffffff8109d4ca <process_one_work+74>: mov %al,-0x30(%rbp)
0xffffffff8109d4cd <process_one_work+77>: jne
0xffffffff8109d4da <process_one_work+90>
crash> *work_struct ffff88088b273028
struct work_struct {
data = {
counter = 0
},
entry = {
next = 0xffff88088b273030,
prev = 0xffff88088b273030
},
func = 0xffff8801f55e6780
}
crash> *worker ffff88088b273000
struct worker {
{
entry = {
next = 0x0,
prev = 0x0
},
hentry = {
next = 0x0,
pprev = 0x0
}
},
current_work = 0x0,
current_func = 0x0,
current_pwq = 0x0,
desc_valid = false,
scheduled = {
next = 0xffff88088b273030,
prev = 0xffff88088b273030
},
task = 0xffff8801f55e6780,
pool = 0xffff881fff313fc0,
last_active = 4586712874,
flags = 1,
id = 2,
desc =
"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000",
rescue_wq = 0x0
}
crash>

It looks like the work parameter is mistakenly pointing into the worker
parameter. I noticed there was one similar case like this before, but not
exactly same.

https://lists.gt.net/linux/kernel/2022989

I wonder how this situation could happen. When process_one_work is called,
the work being processed shall be on worker_pool->worklist, but not
here.

crash> *worker_pool 0xffff881fff313fc0
struct worker_pool {
lock = {
{
rlock = {
raw_lock = {
{
head_tail = 4142331618,
tickets = {
head = 63202,
tail = 63206
}
}
}
}
}
},
cpu = 16,
node = 0,
id = 32,
flags = 0,
worklist = {
next = 0xffff881fff313fd8,
prev = 0xffff881fff313fd8
},
nr_workers = 3,
nr_idle = 2,
idle_list = {
next = 0xffff880245a18280,
prev = 0xffff88078d3c5780
},
...

Maybe we'd add if(get_work_pwq(work)) in process_one_work or add
BUG_ON like this:

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index dde6298..82a92e0 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2012,11 +2012,15 @@ static void process_one_work(struct worker
*worker, struct work_struct *work)
__releases(&pool->lock)
__acquires(&pool->lock)
{
- struct pool_workqueue *pwq = get_work_pwq(work);
+ struct pool_workqueue *pwq;
struct worker_pool *pool = worker->pool;
- bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE;
+ bool cpu_intensive;
int work_color;
struct worker *collision;
+
+ BUG_ON(!(pwq = get_work_pwq(work)));
+ bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE;
+
#ifdef CONFIG_LOCKDEP

I really appreciate your any ideas and please let me know if you want
any more information from the crashed system.

Thanks,
Yaowei,Bai