timer code oops when calling mod_delayed_work

From: Jeff Layton
Date: Thu Oct 29 2015 - 10:31:26 EST


Our QA guys hit the following warning and oops today on a relatively
recent upstream kernel. They were running a fairly busy NFSv3 I/O test
against this server:

[ 71.074180] NFSD: starting 90-second grace period (net ffffffff81cf3b40)
[ 221.146802] Ebtables v2.0 unregistered
[ 318.909835] ------------[ cut here ]------------
[ 318.909846] WARNING: CPU: 22 PID: 3402 at kernel/workqueue.c:1444 __queue_delayed_work+0xe7/0x190()
[ 318.909848] Modules linked in: bridge stp llc intel_rapl iosf_mbi x86_pkg_temp_thermal coretemp kvm_intel kvm ipmi_ssif crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support joydev mei_me mei sb_edac shpchp ipmi_si lpc_ich i2c_i801 edac_core tpm_tis wmi ipmi_msghandler tpm acpi_power_meter acpi_pad nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c mlx4_en vxlan ip6_udp_tunnel udp_tunnel ast drm_kms_helper ttm crc32c_intel igb drm mlx4_core ptp nvme pps_core dca i2c_algo_bit fjes [last unloaded: iptable_security]
[ 318.909889] CPU: 22 PID: 3402 Comm: nfsd Not tainted 4.3.0-0.rc7.git0.1.fc24.x86_64 #1
[ 318.909891] Hardware name: Quanta D51B/D51BP (1G LOM), BIOS S2BP3A05 01/29/2015
[ 318.909892] 0000000000000000 00000000319846b5 ffff880322d0f440 ffffffff813a48cf
[ 318.909895] 0000000000000000 ffff880322d0f478 ffffffff8109f122 0000000000000400
[ 318.909898] ffff88085f48fc00 ffffffffa03623e0 ffff88085f48fc00 000000000001d4c0
[ 318.909901] Call Trace:
[ 318.909906] [<ffffffff813a48cf>] dump_stack+0x44/0x55
[ 318.909909] [<ffffffff8109f122>] warn_slowpath_common+0x82/0xc0
[ 318.909913] [<ffffffff8109f26a>] warn_slowpath_null+0x1a/0x20
[ 318.909915] [<ffffffff810b67a7>] __queue_delayed_work+0xe7/0x190
[ 318.909917] [<ffffffff810b70de>] mod_delayed_work_on+0x5e/0xa0
[ 318.909931] [<ffffffffa0333394>] prune_cache_entries+0x94/0xb0 [nfsd]
[ 318.909936] [<ffffffffa03333be>] nfsd_reply_cache_scan+0xe/0x10 [nfsd]
[ 318.909941] [<ffffffff811b78e2>] shrink_slab.part.42+0x1e2/0x380
[ 318.909944] [<ffffffff811bbfc6>] shrink_zone+0x2b6/0x2c0
[ 318.909946] [<ffffffff811bc145>] do_try_to_free_pages+0x175/0x400
[ 318.909949] [<ffffffff811bc4da>] try_to_free_pages+0x10a/0x150
[ 318.909954] [<ffffffff811ae5b1>] __alloc_pages_nodemask+0x581/0x990
[ 318.909957] [<ffffffff811b19f5>] ? account_page_dirtied+0xd5/0x160
[ 318.909960] [<ffffffff811f7b41>] alloc_pages_current+0x91/0x100
[ 318.909963] [<ffffffff811a45bb>] __page_cache_alloc+0xab/0xe0
[ 318.909965] [<ffffffff811a4c94>] pagecache_get_page+0x84/0x1e0
[ 318.909967] [<ffffffff811a4e16>] grab_cache_page_write_begin+0x26/0x40
[ 318.909990] [<ffffffffa0221cd4>] xfs_vm_write_begin+0x34/0xe0 [xfs]
[ 318.910005] [<ffffffffa0221a39>] ? xfs_vm_write_end+0x29/0x80 [xfs]
[ 318.910007] [<ffffffff811a418e>] generic_perform_write+0xce/0x1c0
[ 318.910023] [<ffffffffa022efb7>] xfs_file_buffered_aio_write+0x127/0x250 [xfs]
[ 318.910028] [<ffffffffa032be60>] ? nfserrno+0x60/0x60 [nfsd]
[ 318.910031] [<ffffffff812035e9>] ? __kmalloc+0x1b9/0x260
[ 318.910034] [<ffffffff81223617>] ? rw_copy_check_uvector+0x57/0x110
[ 318.910047] [<ffffffffa022f1d5>] xfs_file_write_iter+0xf5/0x110 [xfs]
[ 318.910050] [<ffffffff81221e3c>] do_iter_readv_writev+0x6c/0xa0
[ 318.910053] [<ffffffff812229bf>] do_readv_writev+0x18f/0x230
[ 318.910065] [<ffffffffa022f0e0>] ? xfs_file_buffered_aio_write+0x250/0x250 [xfs]
[ 318.910068] [<ffffffff813284bb>] ? security_file_open+0x8b/0x90
[ 318.910070] [<ffffffff8121ffa4>] ? do_dentry_open+0x264/0x2f0
[ 318.910084] [<ffffffffa022ced0>] ? xfs_extent_busy_ag_cmp+0x20/0x20 [xfs]
[ 318.910086] [<ffffffff81222ae9>] vfs_writev+0x39/0x50
[ 318.910092] [<ffffffffa032df4d>] nfsd_vfs_write+0xbd/0x380 [nfsd]
[ 318.910097] [<ffffffffa0330568>] nfsd_write+0xd8/0xf0 [nfsd]
[ 318.910103] [<ffffffffa03367fc>] nfsd3_proc_write+0xbc/0x150 [nfsd]
[ 318.910107] [<ffffffffa0328e68>] nfsd_dispatch+0xb8/0x200 [nfsd]
[ 318.910128] [<ffffffffa02d39dc>] svc_process_common+0x40c/0x650 [sunrpc]
[ 318.910137] [<ffffffffa02d4b14>] svc_process+0xf4/0x180 [sunrpc]
[ 318.910142] [<ffffffffa03288bf>] nfsd+0xef/0x160 [nfsd]
[ 318.910146] [<ffffffffa03287d0>] ? nfsd_destroy+0x60/0x60 [nfsd]
[ 318.910149] [<ffffffff810bd6a8>] kthread+0xd8/0xf0
[ 318.910152] [<ffffffff810bd5d0>] ? kthread_worker_fn+0x160/0x160
[ 318.910155] [<ffffffff8177f19f>] ret_from_fork+0x3f/0x70
[ 318.910158] [<ffffffff810bd5d0>] ? kthread_worker_fn+0x160/0x160
[ 318.910160] ---[ end trace db375019f3368852 ]---
[ 318.910169] ------------[ cut here ]------------
[ 318.910201] kernel BUG at kernel/time/timer.c:977!
[ 318.910226] invalid opcode: 0000 [#1] SMP
[ 318.910249] Modules linked in: bridge stp llc intel_rapl iosf_mbi x86_pkg_temp_thermal coretemp kvm_intel kvm ipmi_ssif crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support joydev mei_me mei sb_edac shpchp ipmi_si lpc_ich i2c_i801 edac_core tpm_tis wmi ipmi_msghandler tpm acpi_power_meter acpi_pad nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c mlx4_en vxlan ip6_udp_tunnel udp_tunnel ast drm_kms_helper ttm crc32c_intel igb drm mlx4_core ptp nvme pps_core dca i2c_algo_bit fjes [last unloaded: iptable_security]
[ 318.910588] CPU: 22 PID: 3402 Comm: nfsd Tainted: G W 4.3.0-0.rc7.git0.1.fc24.x86_64 #1
[ 318.910629] Hardware name: Quanta D51B/D51BP (1G LOM), BIOS S2BP3A05 01/29/2015
[ 318.910664] task: ffff880322c49e00 ti: ffff880322d0c000 task.ti: ffff880322d0c000
[ 318.910698] RIP: 0010:[<ffffffff811076a6>] [<ffffffff811076a6>] add_timer_on+0xc6/0x110
[ 318.910736] RSP: 0018:ffff880322d0f458 EFLAGS: 00010082
[ 318.910761] RAX: 0000000000000000 RBX: 0000000000000016 RCX: 0000000100021e85
[ 318.910793] RDX: ffffffffa03623c0 RSI: 0000000000000016 RDI: ffffffffa03623e0
[ 318.910826] RBP: ffff880322d0f488 R08: 000000000000000a R09: 00000000000005f2
[ 318.910856] R10: 000000000000027c R11: 00000000000005f2 R12: ffff88085f48fc00
[ 318.910888] R13: ffff88085fd8e080 R14: ffff88085f48fc00 R15: 000000000001d4c0
[ 318.910922] FS: 0000000000000000(0000) GS:ffff88085fd80000(0000) knlGS:0000000000000000
[ 318.910956] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 318.910981] CR2: 00007f9d6ea41000 CR3: 0000000001c0b000 CR4: 00000000001406e0
[ 318.911012] Stack:
[ 318.911025] ffff88085f48fc00 0000000000000016 ffff88085f48fc00 ffffffffa03623e0
[ 318.911069] ffff88085f48fc00 000000000001d4c0 ffff880322d0f4c0 ffffffff810b6749
[ 318.911111] 000000000001d4c0 ffffffffa03623c0 0000000000000001 ffffffffa03623c0
[ 318.911153] Call Trace:
[ 318.911170] [<ffffffff810b6749>] __queue_delayed_work+0x89/0x190
[ 318.911201] [<ffffffff810b70de>] mod_delayed_work_on+0x5e/0xa0
[ 318.911236] [<ffffffffa0333394>] prune_cache_entries+0x94/0xb0 [nfsd]
[ 318.911267] [<ffffffffa03333be>] nfsd_reply_cache_scan+0xe/0x10 [nfsd]
[ 318.911299] [<ffffffff811b78e2>] shrink_slab.part.42+0x1e2/0x380
[ 318.911330] [<ffffffff811bbfc6>] shrink_zone+0x2b6/0x2c0
[ 318.911358] [<ffffffff811bc145>] do_try_to_free_pages+0x175/0x400
[ 318.911391] [<ffffffff811bc4da>] try_to_free_pages+0x10a/0x150
[ 318.911422] [<ffffffff811ae5b1>] __alloc_pages_nodemask+0x581/0x990
[ 318.911454] [<ffffffff811b19f5>] ? account_page_dirtied+0xd5/0x160
[ 318.911483] [<ffffffff811f7b41>] alloc_pages_current+0x91/0x100
[ 318.911513] [<ffffffff811a45bb>] __page_cache_alloc+0xab/0xe0
[ 318.911543] [<ffffffff811a4c94>] pagecache_get_page+0x84/0x1e0
[ 318.911573] [<ffffffff811a4e16>] grab_cache_page_write_begin+0x26/0x40
[ 318.911620] [<ffffffffa0221cd4>] xfs_vm_write_begin+0x34/0xe0 [xfs]
[ 318.911661] [<ffffffffa0221a39>] ? xfs_vm_write_end+0x29/0x80 [xfs]
[ 318.912748] [<ffffffff811a418e>] generic_perform_write+0xce/0x1c0
[ 318.913810] [<ffffffffa022efb7>] xfs_file_buffered_aio_write+0x127/0x250 [xfs]
[ 318.914858] [<ffffffffa032be60>] ? nfserrno+0x60/0x60 [nfsd]
[ 318.915715] [<ffffffff812035e9>] ? __kmalloc+0x1b9/0x260
[ 318.916360] [<ffffffff81223617>] ? rw_copy_check_uvector+0x57/0x110
[ 318.917014] [<ffffffffa022f1d5>] xfs_file_write_iter+0xf5/0x110 [xfs]
[ 318.917650] [<ffffffff81221e3c>] do_iter_readv_writev+0x6c/0xa0
[ 318.918312] [<ffffffff812229bf>] do_readv_writev+0x18f/0x230
[ 318.918954] [<ffffffffa022f0e0>] ? xfs_file_buffered_aio_write+0x250/0x250 [xfs]
[ 318.919575] [<ffffffff813284bb>] ? security_file_open+0x8b/0x90
[ 318.920178] [<ffffffff8121ffa4>] ? do_dentry_open+0x264/0x2f0
[ 318.920771] [<ffffffffa022ced0>] ? xfs_extent_busy_ag_cmp+0x20/0x20 [xfs]
[ 318.921354] [<ffffffff81222ae9>] vfs_writev+0x39/0x50
[ 318.921961] [<ffffffffa032df4d>] nfsd_vfs_write+0xbd/0x380 [nfsd]
[ 318.922509] [<ffffffffa0330568>] nfsd_write+0xd8/0xf0 [nfsd]
[ 318.923045] [<ffffffffa03367fc>] nfsd3_proc_write+0xbc/0x150 [nfsd]
[ 318.923574] [<ffffffffa0328e68>] nfsd_dispatch+0xb8/0x200 [nfsd]
[ 318.924111] [<ffffffffa02d39dc>] svc_process_common+0x40c/0x650 [sunrpc]
[ 318.924633] [<ffffffffa02d4b14>] svc_process+0xf4/0x180 [sunrpc]
[ 318.925144] [<ffffffffa03288bf>] nfsd+0xef/0x160 [nfsd]
[ 318.925654] [<ffffffffa03287d0>] ? nfsd_destroy+0x60/0x60 [nfsd]
[ 318.926163] [<ffffffff810bd6a8>] kthread+0xd8/0xf0
[ 318.926668] [<ffffffff810bd5d0>] ? kthread_worker_fn+0x160/0x160
[ 318.927189] [<ffffffff8177f19f>] ret_from_fork+0x3f/0x70
[ 318.927685] [<ffffffff810bd5d0>] ? kthread_worker_fn+0x160/0x160
[ 318.928210] Code: 4d 85 c0 75 e4 48 89 de 4c 89 ef e8 55 f9 ff ff 48 8b 34 24 4c 89 ef e8 19 72 67 00 48 83 c4 08 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b 48 83 7f 38 00 48 8b 45 08 0f 85 5f ff ff ff 48 89 47 38
[ 318.929315] RIP [<ffffffff811076a6>] add_timer_on+0xc6/0x110
[ 318.930019] RSP <ffff880322d0f458>

Basically the shrinker kicked in. After we shrank a bunch of cache
entries, we tried to mod_delayed_work on the cache cleaner job so that
it wouldn't run too soon, and that's where the code oopsed.

Is our usage of mod_delayed_work in prune_cache_entries() wrong in some
way? Or is there some breakage in the workqueue or timer code?

--
Jeff Layton <jlayton@xxxxxxxxxxxxxxx>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/