Re: timer code oops when calling mod_delayed_work

From: Jeff Layton
Date: Thu Oct 29 2015 - 13:58:57 EST


On Thu, 29 Oct 2015 10:31:13 -0400
Jeff Layton <jlayton@xxxxxxxxxxxxxxx> wrote:

> Our QA guys hit the following warning and oops today on a relatively
> recent upstream kernel. They were running a fairly busy NFSv3 I/O test
> against this server:
>
> [ 71.074180] NFSD: starting 90-second grace period (net ffffffff81cf3b40)
> [ 221.146802] Ebtables v2.0 unregistered
> [ 318.909835] ------------[ cut here ]------------
> [ 318.909846] WARNING: CPU: 22 PID: 3402 at kernel/workqueue.c:1444 __queue_delayed_work+0xe7/0x190()
> [ 318.909848] Modules linked in: bridge stp llc intel_rapl iosf_mbi x86_pkg_temp_thermal coretemp kvm_intel kvm ipmi_ssif crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support joydev mei_me mei sb_edac shpchp ipmi_si lpc_ich i2c_i801 edac_core tpm_tis wmi ipmi_msghandler tpm acpi_power_meter acpi_pad nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c mlx4_en vxlan ip6_udp_tunnel udp_tunnel ast drm_kms_helper ttm crc32c_intel igb drm mlx4_core ptp nvme pps_core dca i2c_algo_bit fjes [last unloaded: iptable_security]
> [ 318.909889] CPU: 22 PID: 3402 Comm: nfsd Not tainted 4.3.0-0.rc7.git0.1.fc24.x86_64 #1
> [ 318.909891] Hardware name: Quanta D51B/D51BP (1G LOM), BIOS S2BP3A05 01/29/2015
> [ 318.909892] 0000000000000000 00000000319846b5 ffff880322d0f440 ffffffff813a48cf
> [ 318.909895] 0000000000000000 ffff880322d0f478 ffffffff8109f122 0000000000000400
> [ 318.909898] ffff88085f48fc00 ffffffffa03623e0 ffff88085f48fc00 000000000001d4c0
> [ 318.909901] Call Trace:
> [ 318.909906] [<ffffffff813a48cf>] dump_stack+0x44/0x55
> [ 318.909909] [<ffffffff8109f122>] warn_slowpath_common+0x82/0xc0
> [ 318.909913] [<ffffffff8109f26a>] warn_slowpath_null+0x1a/0x20
> [ 318.909915] [<ffffffff810b67a7>] __queue_delayed_work+0xe7/0x190
> [ 318.909917] [<ffffffff810b70de>] mod_delayed_work_on+0x5e/0xa0
> [ 318.909931] [<ffffffffa0333394>] prune_cache_entries+0x94/0xb0 [nfsd]
> [ 318.909936] [<ffffffffa03333be>] nfsd_reply_cache_scan+0xe/0x10 [nfsd]
> [ 318.909941] [<ffffffff811b78e2>] shrink_slab.part.42+0x1e2/0x380
> [ 318.909944] [<ffffffff811bbfc6>] shrink_zone+0x2b6/0x2c0
> [ 318.909946] [<ffffffff811bc145>] do_try_to_free_pages+0x175/0x400
> [ 318.909949] [<ffffffff811bc4da>] try_to_free_pages+0x10a/0x150
> [ 318.909954] [<ffffffff811ae5b1>] __alloc_pages_nodemask+0x581/0x990
> [ 318.909957] [<ffffffff811b19f5>] ? account_page_dirtied+0xd5/0x160
> [ 318.909960] [<ffffffff811f7b41>] alloc_pages_current+0x91/0x100
> [ 318.909963] [<ffffffff811a45bb>] __page_cache_alloc+0xab/0xe0
> [ 318.909965] [<ffffffff811a4c94>] pagecache_get_page+0x84/0x1e0
> [ 318.909967] [<ffffffff811a4e16>] grab_cache_page_write_begin+0x26/0x40
> [ 318.909990] [<ffffffffa0221cd4>] xfs_vm_write_begin+0x34/0xe0 [xfs]
> [ 318.910005] [<ffffffffa0221a39>] ? xfs_vm_write_end+0x29/0x80 [xfs]
> [ 318.910007] [<ffffffff811a418e>] generic_perform_write+0xce/0x1c0
> [ 318.910023] [<ffffffffa022efb7>] xfs_file_buffered_aio_write+0x127/0x250 [xfs]
> [ 318.910028] [<ffffffffa032be60>] ? nfserrno+0x60/0x60 [nfsd]
> [ 318.910031] [<ffffffff812035e9>] ? __kmalloc+0x1b9/0x260
> [ 318.910034] [<ffffffff81223617>] ? rw_copy_check_uvector+0x57/0x110
> [ 318.910047] [<ffffffffa022f1d5>] xfs_file_write_iter+0xf5/0x110 [xfs]
> [ 318.910050] [<ffffffff81221e3c>] do_iter_readv_writev+0x6c/0xa0
> [ 318.910053] [<ffffffff812229bf>] do_readv_writev+0x18f/0x230
> [ 318.910065] [<ffffffffa022f0e0>] ? xfs_file_buffered_aio_write+0x250/0x250 [xfs]
> [ 318.910068] [<ffffffff813284bb>] ? security_file_open+0x8b/0x90
> [ 318.910070] [<ffffffff8121ffa4>] ? do_dentry_open+0x264/0x2f0
> [ 318.910084] [<ffffffffa022ced0>] ? xfs_extent_busy_ag_cmp+0x20/0x20 [xfs]
> [ 318.910086] [<ffffffff81222ae9>] vfs_writev+0x39/0x50
> [ 318.910092] [<ffffffffa032df4d>] nfsd_vfs_write+0xbd/0x380 [nfsd]
> [ 318.910097] [<ffffffffa0330568>] nfsd_write+0xd8/0xf0 [nfsd]
> [ 318.910103] [<ffffffffa03367fc>] nfsd3_proc_write+0xbc/0x150 [nfsd]
> [ 318.910107] [<ffffffffa0328e68>] nfsd_dispatch+0xb8/0x200 [nfsd]
> [ 318.910128] [<ffffffffa02d39dc>] svc_process_common+0x40c/0x650 [sunrpc]
> [ 318.910137] [<ffffffffa02d4b14>] svc_process+0xf4/0x180 [sunrpc]
> [ 318.910142] [<ffffffffa03288bf>] nfsd+0xef/0x160 [nfsd]
> [ 318.910146] [<ffffffffa03287d0>] ? nfsd_destroy+0x60/0x60 [nfsd]
> [ 318.910149] [<ffffffff810bd6a8>] kthread+0xd8/0xf0
> [ 318.910152] [<ffffffff810bd5d0>] ? kthread_worker_fn+0x160/0x160
> [ 318.910155] [<ffffffff8177f19f>] ret_from_fork+0x3f/0x70
> [ 318.910158] [<ffffffff810bd5d0>] ? kthread_worker_fn+0x160/0x160
> [ 318.910160] ---[ end trace db375019f3368852 ]---
> [ 318.910169] ------------[ cut here ]------------
> [ 318.910201] kernel BUG at kernel/time/timer.c:977!
> [ 318.910226] invalid opcode: 0000 [#1] SMP
> [ 318.910249] Modules linked in: bridge stp llc intel_rapl iosf_mbi x86_pkg_temp_thermal coretemp kvm_intel kvm ipmi_ssif crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support joydev mei_me mei sb_edac shpchp ipmi_si lpc_ich i2c_i801 edac_core tpm_tis wmi ipmi_msghandler tpm acpi_power_meter acpi_pad nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c mlx4_en vxlan ip6_udp_tunnel udp_tunnel ast drm_kms_helper ttm crc32c_intel igb drm mlx4_core ptp nvme pps_core dca i2c_algo_bit fjes [last unloaded: iptable_security]
> [ 318.910588] CPU: 22 PID: 3402 Comm: nfsd Tainted: G W 4.3.0-0.rc7.git0.1.fc24.x86_64 #1
> [ 318.910629] Hardware name: Quanta D51B/D51BP (1G LOM), BIOS S2BP3A05 01/29/2015
> [ 318.910664] task: ffff880322c49e00 ti: ffff880322d0c000 task.ti: ffff880322d0c000
> [ 318.910698] RIP: 0010:[<ffffffff811076a6>] [<ffffffff811076a6>] add_timer_on+0xc6/0x110
> [ 318.910736] RSP: 0018:ffff880322d0f458 EFLAGS: 00010082
> [ 318.910761] RAX: 0000000000000000 RBX: 0000000000000016 RCX: 0000000100021e85
> [ 318.910793] RDX: ffffffffa03623c0 RSI: 0000000000000016 RDI: ffffffffa03623e0
> [ 318.910826] RBP: ffff880322d0f488 R08: 000000000000000a R09: 00000000000005f2
> [ 318.910856] R10: 000000000000027c R11: 00000000000005f2 R12: ffff88085f48fc00
> [ 318.910888] R13: ffff88085fd8e080 R14: ffff88085f48fc00 R15: 000000000001d4c0
> [ 318.910922] FS: 0000000000000000(0000) GS:ffff88085fd80000(0000) knlGS:0000000000000000
> [ 318.910956] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 318.910981] CR2: 00007f9d6ea41000 CR3: 0000000001c0b000 CR4: 00000000001406e0
> [ 318.911012] Stack:
> [ 318.911025] ffff88085f48fc00 0000000000000016 ffff88085f48fc00 ffffffffa03623e0
> [ 318.911069] ffff88085f48fc00 000000000001d4c0 ffff880322d0f4c0 ffffffff810b6749
> [ 318.911111] 000000000001d4c0 ffffffffa03623c0 0000000000000001 ffffffffa03623c0
> [ 318.911153] Call Trace:
> [ 318.911170] [<ffffffff810b6749>] __queue_delayed_work+0x89/0x190
> [ 318.911201] [<ffffffff810b70de>] mod_delayed_work_on+0x5e/0xa0
> [ 318.911236] [<ffffffffa0333394>] prune_cache_entries+0x94/0xb0 [nfsd]
> [ 318.911267] [<ffffffffa03333be>] nfsd_reply_cache_scan+0xe/0x10 [nfsd]
> [ 318.911299] [<ffffffff811b78e2>] shrink_slab.part.42+0x1e2/0x380
> [ 318.911330] [<ffffffff811bbfc6>] shrink_zone+0x2b6/0x2c0
> [ 318.911358] [<ffffffff811bc145>] do_try_to_free_pages+0x175/0x400
> [ 318.911391] [<ffffffff811bc4da>] try_to_free_pages+0x10a/0x150
> [ 318.911422] [<ffffffff811ae5b1>] __alloc_pages_nodemask+0x581/0x990
> [ 318.911454] [<ffffffff811b19f5>] ? account_page_dirtied+0xd5/0x160
> [ 318.911483] [<ffffffff811f7b41>] alloc_pages_current+0x91/0x100
> [ 318.911513] [<ffffffff811a45bb>] __page_cache_alloc+0xab/0xe0
> [ 318.911543] [<ffffffff811a4c94>] pagecache_get_page+0x84/0x1e0
> [ 318.911573] [<ffffffff811a4e16>] grab_cache_page_write_begin+0x26/0x40
> [ 318.911620] [<ffffffffa0221cd4>] xfs_vm_write_begin+0x34/0xe0 [xfs]
> [ 318.911661] [<ffffffffa0221a39>] ? xfs_vm_write_end+0x29/0x80 [xfs]
> [ 318.912748] [<ffffffff811a418e>] generic_perform_write+0xce/0x1c0
> [ 318.913810] [<ffffffffa022efb7>] xfs_file_buffered_aio_write+0x127/0x250 [xfs]
> [ 318.914858] [<ffffffffa032be60>] ? nfserrno+0x60/0x60 [nfsd]
> [ 318.915715] [<ffffffff812035e9>] ? __kmalloc+0x1b9/0x260
> [ 318.916360] [<ffffffff81223617>] ? rw_copy_check_uvector+0x57/0x110
> [ 318.917014] [<ffffffffa022f1d5>] xfs_file_write_iter+0xf5/0x110 [xfs]
> [ 318.917650] [<ffffffff81221e3c>] do_iter_readv_writev+0x6c/0xa0
> [ 318.918312] [<ffffffff812229bf>] do_readv_writev+0x18f/0x230
> [ 318.918954] [<ffffffffa022f0e0>] ? xfs_file_buffered_aio_write+0x250/0x250 [xfs]
> [ 318.919575] [<ffffffff813284bb>] ? security_file_open+0x8b/0x90
> [ 318.920178] [<ffffffff8121ffa4>] ? do_dentry_open+0x264/0x2f0
> [ 318.920771] [<ffffffffa022ced0>] ? xfs_extent_busy_ag_cmp+0x20/0x20 [xfs]
> [ 318.921354] [<ffffffff81222ae9>] vfs_writev+0x39/0x50
> [ 318.921961] [<ffffffffa032df4d>] nfsd_vfs_write+0xbd/0x380 [nfsd]
> [ 318.922509] [<ffffffffa0330568>] nfsd_write+0xd8/0xf0 [nfsd]
> [ 318.923045] [<ffffffffa03367fc>] nfsd3_proc_write+0xbc/0x150 [nfsd]
> [ 318.923574] [<ffffffffa0328e68>] nfsd_dispatch+0xb8/0x200 [nfsd]
> [ 318.924111] [<ffffffffa02d39dc>] svc_process_common+0x40c/0x650 [sunrpc]
> [ 318.924633] [<ffffffffa02d4b14>] svc_process+0xf4/0x180 [sunrpc]
> [ 318.925144] [<ffffffffa03288bf>] nfsd+0xef/0x160 [nfsd]
> [ 318.925654] [<ffffffffa03287d0>] ? nfsd_destroy+0x60/0x60 [nfsd]
> [ 318.926163] [<ffffffff810bd6a8>] kthread+0xd8/0xf0
> [ 318.926668] [<ffffffff810bd5d0>] ? kthread_worker_fn+0x160/0x160
> [ 318.927189] [<ffffffff8177f19f>] ret_from_fork+0x3f/0x70
> [ 318.927685] [<ffffffff810bd5d0>] ? kthread_worker_fn+0x160/0x160
> [ 318.928210] Code: 4d 85 c0 75 e4 48 89 de 4c 89 ef e8 55 f9 ff ff 48 8b 34 24 4c 89 ef e8 19 72 67 00 48 83 c4 08 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b 48 83 7f 38 00 48 8b 45 08 0f 85 5f ff ff ff 48 89 47 38
> [ 318.929315] RIP [<ffffffff811076a6>] add_timer_on+0xc6/0x110
> [ 318.930019] RSP <ffff880322d0f458>
>
> Basically the shrinker kicked in. After we shrank a bunch of cache
> entries, we tried to mod_delayed_work on the cache cleaner job so that
> it wouldn't run too soon, and that's where the code oopsed.
>
> Is our usage of mod_delayed_work in prune_cache_entries() wrong in some
> way? Or is there some breakage in the workqueue or timer code?
>

Ok, a bit more analysis...

We did get a vmcore from this. Here is the "cache_cleaner" delayed work:

crash> p cache_cleaner
cache_cleaner = $12 = {
work = {
data = {
counter = 0xfffffffe1
},
entry = {
next = 0xffffffffa03623c8 <cache_cleaner+8>,
prev = 0xffffffffa03623c8 <cache_cleaner+8>
},
func = 0xffffffffa03333c0 <cache_cleaner_func>
},
timer = {
entry = {
next = 0x0,
pprev = 0xffff88085fd0eaf8
},
expires = 0x100021e99,
function = 0xffffffff810b66a0 <delayed_work_timer_fn>,
data = 0xffffffffa03623c0,
flags = 0x200014,
slack = 0xffffffff,
start_pid = 0x0,
start_site = 0x0,
start_comm = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
},
wq = 0xffff88085f48fc00,
cpu = 0x14
}

So the PENDING bit is set (lowest bit in data.counter), and timer->entry.pprev
pprev pointer is not NULL (so timer_pending is true). I also see that
there are several nfsd threads running the shrinker at the same time.

There is one potential problem that I see, but I'd appreciate someone
sanity checking me on this. Here is mod_delayed_work_on:

------------------[snip]------------------------
bool mod_delayed_work_on(int cpu, struct workqueue_struct *wq,
struct delayed_work *dwork, unsigned long delay)
{
unsigned long flags;
int ret;

do {
ret = try_to_grab_pending(&dwork->work, true, &flags);
} while (unlikely(ret == -EAGAIN));

if (likely(ret >= 0)) {
__queue_delayed_work(cpu, wq, dwork, delay);
local_irq_restore(flags);
}

/* -ENOENT from try_to_grab_pending() becomes %true */
return ret;
}
------------------[snip]------------------------

...and here is the beginning of try_to_grab_pending:

------------------[snip]------------------------
/* try to steal the timer if it exists */
if (is_dwork) {
struct delayed_work *dwork = to_delayed_work(work);

/*
* dwork->timer is irqsafe. If del_timer() fails, it's
* guaranteed that the timer is not queued anywhere and not
* running on the local CPU.
*/
if (likely(del_timer(&dwork->timer)))
return 1;
}

/* try to claim PENDING the normal way */
if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)))
return 0;
------------------[snip]------------------------


...so if del_timer returns true, we'll return 1 from
try_to_grab_pending without actually setting the
WORK_STRUCT_PENDING_BIT, and then will end up calling
__queue_delayed_work.

That seems wrong to me -- shouldn't we be ensuring that that bit is set
when returning 1 from try_to_grab_pending to guard against concurrent
queue_delayed_work_on calls?

--
Jeff Layton <jlayton@xxxxxxxxxxxxxxx>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/