Re: Kswapd in 3.2.0-rc5 is a CPU hog

From: nowhere
Date: Wed Dec 21 2011 - 05:52:52 EST


Ð ÐÑ., 21/12/2011 Ð 11:24 +0100, Michal Hocko ÐÐÑÐÑ:
> [Let's cc also some fs and xfs people]
>
> On Wed 21-12-11 14:15:21, nowhere wrote:
> > Ð ÐÑ., 21/12/2011 Ð 10:52 +0100, Michal Hocko ÐÐÑÐÑ:
> > > [Let's CC linux-mm]
> > >
> > > On Wed 21-12-11 07:10:36, Nikolay S. wrote:
> > > > Hello,
> > > >
> > > > I'm using 3.2-rc5 on a machine, which atm does almost nothing except
> > > > file system operations and network i/o (i.e. file server). And there is
> > > > a problem with kswapd.
> > >
> > > What kind of filesystem do you use?
> >
> > Well, that is XFS.
> > I have a large volume with ~200000 files, and a periodic job, which
> > checks all file's timestamps once per 30 minutes and makes actions if
> > timestamp has changed.
>
> Is it the first time you are seeing this? I am not familiar with xfs at
> all but the number of files sounds like dcache shrinker might be really
> busy...

This behavoir - yes. In 3.0 kswapd sometimes hangs in D-state, shifting
load average above 1 - that's how I noticed it.

> > > > I'm playing with dd:
> > > > dd if=/some/big/file of=/dev/null bs=8M
> > > >
> > > > I.e. I'm filling page cache.
> > > >
> > > > So when the machine is just rebooted, kswapd during this operation is
> > > > almost idle, just 5-8 percent according to top.
> > > >
> > > > After ~5 days of uptime (5 days, 2:10), the same operation demands ~70%
> > > > for kswapd:
> > > >
> > > > PID USER S %CPU %MEM TIME+ SWAP COMMAND
> > > > 420 root R 70 0.0 22:09.60 0 kswapd0
> > > > 17717 nowhere D 27 0.2 0:01.81 10m dd
> > > >
> > > > In fact, kswapd cpu usage on this operation steadily increases over
> > > > time.
> > > >
> > > > Also read performance degrades over time. After reboot:
> > > > dd if=/some/big/file of=/dev/null bs=8M
> > > > 1019+1 records in
> > > > 1019+1 records out
> > > > 8553494018 bytes (8.6 GB) copied, 16.211 s, 528 MB/s
> > > >
> > > > After ~5 days uptime:
> > > > dd if=/some/big/file of=/dev/null bs=8M
> > > > 1019+1 records in
> > > > 1019+1 records out
> > > > 8553494018 bytes (8.6 GB) copied, 29.0507 s, 294 MB/s
> > > >
> > > > Whereas raw disk sequential read performance stays the same:
> > > > dd if=/some/big/file of=/dev/null bs=8M iflag=direct
> > > > 1019+1 records in
> > > > 1019+1 records out
> > > > 8553494018 bytes (8.6 GB) copied, 14.7286 s, 581 MB/s
> > > >
> > > > Also after dropping caches, situation somehow improves, but not to the
> > > > state of freshly restarted system:
> > > > PID USER S %CPU %MEM TIME+ SWAP COMMAND
> > > > 420 root S 39 0.0 23:31.17 0 kswapd0
> > > > 19829 nowhere D 24 0.2 0:02.72 7764 dd
> > > >
> > > > perf shows:
> > > >
> > > > 31.24% kswapd0 [kernel.kallsyms] [k] _raw_spin_lock
> > > > 26.19% kswapd0 [kernel.kallsyms] [k] shrink_slab
> > > > 16.28% kswapd0 [kernel.kallsyms] [k] prune_super
> > > > 6.55% kswapd0 [kernel.kallsyms] [k] grab_super_passive
> > > > 5.35% kswapd0 [kernel.kallsyms] [k] down_read_trylock
> > > > 4.03% kswapd0 [kernel.kallsyms] [k] up_read
> > > > 2.31% kswapd0 [kernel.kallsyms] [k] put_super
> > > > 1.81% kswapd0 [kernel.kallsyms] [k] drop_super
> > > > 0.99% kswapd0 [kernel.kallsyms] [k] __put_super
> > > > 0.25% kswapd0 [kernel.kallsyms] [k] __isolate_lru_page
> > > > 0.23% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
> > > > 0.19% kswapd0 [r8169] [k] rtl8169_interrupt
> > > > 0.15% kswapd0 [kernel.kallsyms] [k] twa_interrupt
> > >
> > > Quite a lot of time spent shrinking slab (dcache I guess) and a lot of
> > > spin lock contention.
> >
> > This is slabinfo, sorted by num objects:
> >
> > xfs_inode 192941 193205 960 17 4 : tunables 0 0 0 : slabdata 11365 11365 0
> > dentry 118818 118818 192 21 1 : tunables 0 0 0 : slabdata 5658 5658 0
> > kmalloc-256 107920 107920 256 16 1 : tunables 0 0 0 : slabdata 6745 6745 0
> > kmalloc-64 59912 102656 64 64 1 : tunables 0 0 0 : slabdata 1604 1604 0
> > radix_tree_node 30618 33474 568 14 2 : tunables 0 0 0 : slabdata 2391 2391 0
> > kmalloc-96 27092 41202 96 42 1 : tunables 0 0 0 : slabdata 981 981 0
> > buffer_head 24892 63843 104 39 1 : tunables 0 0 0 : slabdata 1637 1637 0
> > kmalloc-192 23332 34503 192 21 1 : tunables 0 0 0 : slabdata 1643 1643 0
> > sysfs_dir_cache 17444 17444 144 28 1 : tunables 0 0 0 : slabdata 623 623 0
> > arp_cache 12863 14796 320 12 1 : tunables 0 0 0 : slabdata 1233 1233 0
> > kmalloc-512 11051 11440 512 16 2 : tunables 0 0 0 : slabdata 715 715 0
> > kmalloc-128 10611 13152 128 32 1 : tunables 0 0 0 : slabdata 411 411 0
> > ext4_inode_cache 9660 18018 880 18 4 : tunables 0 0 0 : slabdata 1001 1001 0
> > kmalloc-8 8704 8704 8 512 1 : tunables 0 0 0 : slabdata 17 17 0
> > ext4_io_page 6912 6912 16 256 1 : tunables 0 0 0 : slabdata 27 27 0
> > anon_vma_chain 6701 10880 48 85 1 : tunables 0 0 0 : slabdata 128 128 0
> > Acpi-Namespace 6611 8058 40 102 1 : tunables 0 0 0 : slabdata 79 79 0
> > fsnotify_event_holder 6290 6970 24 170 1 : tunables 0 0 0 : slabdata 41 41 0
> > kmalloc-1024 5813 5888 1024 16 4 : tunables 0 0 0 : slabdata 368 368 0
> > vm_area_struct 5664 5664 168 24 1 : tunables 0 0 0 : slabdata 236 236 0
> > reiser_inode_cache 3992 5198 704 23 4 : tunables 0 0 0 : slabdata 226 226 0
> > Acpi-ParseExt 3808 3808 72 56 1 : tunables 0 0 0 : slabdata 68 68 0
> > kmalloc-2048 3587 3888 2048 16 8 : tunables 0 0 0 : slabdata 243 243 0
> > proc_inode_cache 3498 3510 624 13 2 : tunables 0 0 0 : slabdata 270 270 0
> > anon_vma 3380 3640 72 56 1 : tunables 0 0 0 : slabdata 65 65 0
> > kmalloc-16 3072 3072 16 256 1 : tunables 0 0 0 : slabdata 12 12 0
> > inode_cache 3024 3024 560 14 2 : tunables 0 0 0 : slabdata 216 216 0
> > ext4_allocation_context 3000 3000 136 30 1 : tunables 0 0 0 : slabdata 100 100 0
> > nf_conntrack_ffffffff81776d40 2910 3549 312 13 1 : tunables 0 0 0 : slabdata 273 273 0
> > kmalloc-4096 2792 3136 4096 8 8 : tunables 0 0 0 : slabdata 392 392 0
> > ext4_free_data 2701 2701 56 73 1 : tunables 0 0 0 : slabdata 37 37 0
> > pid_namespace 2130 2130 2112 15 8 : tunables 0 0 0 : slabdata 142 142 0
> > mqueue_inode_cache 2124 2124 896 18 4 : tunables 0 0 0 : slabdata 118 118 0
> > jbd2_revoke_record 1664 1664 32 128 1 : tunables 0 0 0 : slabdata 13 13 0
> > kmalloc-32 1434 3072 32 128 1 : tunables 0 0 0 : slabdata 24 24 0
> > shmem_inode_cache 1400 1521 624 13 2 : tunables 0 0 0 : slabdata 117 117 0
> > xfs_ili 1139 1260 216 18 1 : tunables 0 0 0 : slabdata 70 70 0
> > nfsd4_stateids 1131 1496 120 34 1 : tunables 0 0 0 : slabdata 44 44 0
> > idr_layer_cache 900 900 544 15 2 : tunables 0 0 0 : slabdata 60 60 0
> > jbd2_journal_head 742 1008 112 36 1 : tunables 0 0 0 : slabdata 28 28 0
> > fsnotify_event 648 648 112 36 1 : tunables 0 0 0 : slabdata 18 18 0
> > sock_inode_cache 577 624 640 12 2 : tunables 0 0 0 : slabdata 52 52 0
> > tw_sock_TCP 504 504 192 21 1 : tunables 0 0 0 : slabdata 24 24 0
> > TCP 356 414 1728 18 8 : tunables 0 0 0 : slabdata 23 23 0
> > RAW 342 342 832 19 4 : tunables 0 0 0 : slabdata 18 18 0
> > jbd2_journal_handle 340 340 24 170 1 : tunables 0 0 0 : slabdata 2 2 0
> > blkdev_requests 322 322 344 23 2 : tunables 0 0 0 : slabdata 14 14 0
> > task_struct 293 357 1504 21 8 : tunables 0 0 0 : slabdata 17 17 0
> > UDP 285 285 832 19 4 : tunables 0 0 0 : slabdata 15 15 0
> > files_cache 276 276 704 23 4 : tunables 0 0 0 : slabdata 12 12 0
> > nfsd4_openowners 220 320 392 20 2 : tunables 0 0 0 : slabdata 16 16 0
> > mm_struct 216 216 896 18 4 : tunables 0 0 0 : slabdata 12 12 0
> > sighand_cache 199 225 2112 15 8 : tunables 0 0 0 : slabdata 15 15 0
> > nfsd4_delegations 198 198 368 22 2 : tunables 0 0 0 : slabdata 9 9 0
> > kmem_cache_node 192 192 64 64 1 : tunables 0 0 0 : slabdata 3 3 0
> > xfs_buf_item 162 162 224 18 1 : tunables 0 0 0 : slabdata 9 9 0
> > ip_fib_trie 146 146 56 73 1 : tunables 0 0 0 : slabdata 2 2 0
> > ext4_io_end 140 154 1128 14 4 : tunables 0 0 0 : slabdata 11 11 0
> > dnotify_mark 120 120 136 30 1 : tunables 0 0 0 : slabdata 4 4 0
> > TCPv6 104 119 1856 17 8 : tunables 0 0 0 : slabdata 7 7 0
> > cfq_queue 102 102 232 17 1 : tunables 0 0 0 : slabdata 6 6 0
> > Acpi-State 102 102 80 51 1 : tunables 0 0 0 : slabdata 2 2 0
> > sigqueue 100 100 160 25 1 : tunables 0 0 0 : slabdata 4 4 0
> > xfs_efd_item 80 80 400 20 2 : tunables 0 0 0 : slabdata 4 4 0
> > tw_sock_TCPv6 64 144 256 16 1 : tunables 0 0 0 : slabdata 9 9 0
> > bdev_cache 57 57 832 19 4 : tunables 0 0 0 : slabdata 3 3 0
> > blkdev_queue 54 54 1744 18 8 : tunables 0 0 0 : slabdata 3 3 0
> > net_namespace 52 52 2432 13 8 : tunables 0 0 0 : slabdata 4 4 0
> > kmalloc-8192 52 52 8192 4 8 : tunables 0 0 0 : slabdata 13 13 0
> > kmem_cache 42 42 192 21 1 : tunables 0 0 0 : slabdata 2 2 0
> > xfs_log_ticket 40 40 200 20 1 : tunables 0 0 0 : slabdata 2 2 0
> > xfs_btree_cur 38 38 208 19 1 : tunables 0 0 0 : slabdata 2 2 0
> > rpc_inode_cache 38 38 832 19 4 : tunables 0 0 0 : slabdata 2 2 0
> > nf_conntrack_expect 34 34 240 17 1 : tunables 0 0 0 : slabdata 2 2 0
> > xfs_da_state 32 32 488 16 2 : tunables 0 0 0 : slabdata 2 2 0
> > UDPv6 32 32 1024 16 4 : tunables 0 0 0 : slabdata 2 2 0
> > xfs_trans 28 28 280 14 1 : tunables 0 0 0 : slabdata 2 2 0
> > taskstats 24 24 328 12 1 : tunables 0 0 0 : slabdata 2 2 0
> > dio 24 24 640 12 2 : tunables 0 0 0 : slabdata 2 2 0
> > posix_timers_cache 23 23 176 23 1 : tunables 0 0 0 : slabdata 1 1 0
> > hugetlbfs_inode_cache 14 14 560 14 2 : tunables 0 0 0 : slabdata 1 1 0
> > xfrm_dst_cache 0 0 384 21 2 : tunables 0 0 0 : slabdata 0 0 0
> > user_namespace 0 0 1072 15 4 : tunables 0 0 0 : slabdata 0 0 0
> > UDPLITEv6 0 0 1024 16 4 : tunables 0 0 0 : slabdata 0 0 0
> > UDP-Lite 0 0 832 19 4 : tunables 0 0 0 : slabdata 0 0 0
> > kcopyd_job 0 0 3240 10 8 : tunables 0 0 0 : slabdata 0 0 0
> > flow_cache 0 0 104 39 1 : tunables 0 0 0 : slabdata 0 0 0
> > ext2_xattr 0 0 88 46 1 : tunables 0 0 0 : slabdata 0 0 0
> > ext2_inode_cache 0 0 752 21 4 : tunables 0 0 0 : slabdata 0 0 0
> > dquot 0 0 256 16 1 : tunables 0 0 0 : slabdata 0 0 0
> > dm_uevent 0 0 2608 12 8 : tunables 0 0 0 : slabdata 0 0 0
> > dma-kmalloc-96 0 0 96 42 1 : tunables 0 0 0 : slabdata 0 0 0
> > dma-kmalloc-8192 0 0 8192 4 8 : tunables 0 0 0 : slabdata 0 0 0
> > dma-kmalloc-8 0 0 8 512 1 : tunables 0 0 0 : slabdata 0 0 0
> > dma-kmalloc-64 0 0 64 64 1 : tunables 0 0 0 : slabdata 0 0 0
> > dma-kmalloc-512 0 0 512 16 2 : tunables 0 0 0 : slabdata 0 0 0
> > dma-kmalloc-4096 0 0 4096 8 8 : tunables 0 0 0 : slabdata 0 0 0
> > dma-kmalloc-32 0 0 32 128 1 : tunables 0 0 0 : slabdata 0 0 0
> > dma-kmalloc-256 0 0 256 16 1 : tunables 0 0 0 : slabdata 0 0 0
> > dma-kmalloc-2048 0 0 2048 16 8 : tunables 0 0 0 : slabdata 0 0 0
> > dma-kmalloc-192 0 0 192 21 1 : tunables 0 0 0 : slabdata 0 0 0
> > dma-kmalloc-16 0 0 16 256 1 : tunables 0 0 0 : slabdata 0 0 0
> > dma-kmalloc-128 0 0 128 32 1 : tunables 0 0 0 : slabdata 0 0 0
> > dma-kmalloc-1024 0 0 1024 16 4 : tunables 0 0 0 : slabdata 0 0 0
> > bsg_cmd 0 0 312 13 1 : tunables 0 0 0 : slabdata 0 0 0
> >
> > > Could you also take few snapshots of /proc/420/stack to see what kswapd
> > > is doing.
> >
> > Uhm, there is no such entry in proc. Guess I need to enable some kernel
> > option and recompile?
>
> Yes, you need CONFIG_STACKTRACE. But you can get a similar information
> by sysrq+t

Ok, here it is (4 outputs at random time during dd):

<6>kswapd0 R running task 0 420 2 0x00000000
<4> 0000000000000001 0000000000000001 ffff88011b3bdc80 0000000000000000
<4> 0000000000000000 0000000000000001 ffff88011b3bde70 ffff8800d0ab1468
<4> ffffffffffffff96 ffffffff81059391 0000000000000010 0000000000000246
<4>Call Trace:
<4> [<ffffffff81059391>] ? down_read_trylock+0x11/0x20
<4> [<ffffffff810da7ad>] ? grab_super_passive+0x1d/0xb0
<4> [<ffffffff810d9ced>] ? drop_super+0xd/0x20
<4> [<ffffffff810da87c>] ? prune_super+0x3c/0x1c0
<4> [<ffffffff810a15c4>] ? shrink_slab+0x74/0x1f0
<4> [<ffffffff810a3d6a>] ? kswapd+0x5fa/0x890
<4> [<ffffffff81054c30>] ? wake_up_bit+0x40/0x40
<4> [<ffffffff810a3770>] ? shrink_zone+0x500/0x500
<4> [<ffffffff810a3770>] ? shrink_zone+0x500/0x500
<4> [<ffffffff81054796>] ? kthread+0x96/0xa0
<4> [<ffffffff813e2cf4>] ? kernel_thread_helper+0x4/0x10
<4> [<ffffffff81054700>] ? kthread_worker_fn+0x120/0x120
<4> [<ffffffff813e2cf0>] ? gs_change+0xb/0xb

<6>kswapd0 R running task 0 420 2 0x00000000
<4> ffff88011aeff630 000000000000003f ffffffff811c9601 ffff880118bb7cc0
<4> 0000000000000040 0000000000001443 ffff880117cc4000 0000000000000001
<4> 0000000000000000 ffff88011b3bde70 0000000000000000 ffffffffa00f85bf
<4>Call Trace:
<4> [<ffffffff811c9601>] ? radix_tree_gang_lookup_tag+0x81/0xf0
<4> [<ffffffffa00f85bf>] ? xfs_perag_get_tag+0x1f/0x40 [xfs]
<4> [<ffffffff810da7ad>] ? grab_super_passive+0x1d/0xb0
<4> [<ffffffff810d9c48>] ? put_super+0x18/0x20
<4> [<ffffffff810da87c>] ? prune_super+0x3c/0x1c0
<4> [<ffffffff810a15e0>] ? shrink_slab+0x90/0x1f0
<4> [<ffffffff810a3d6a>] ? kswapd+0x5fa/0x890
<4> [<ffffffff81054c30>] ? wake_up_bit+0x40/0x40
<4> [<ffffffff810a3770>] ? shrink_zone+0x500/0x500
<4> [<ffffffff810a3770>] ? shrink_zone+0x500/0x500
<4> [<ffffffff81054796>] ? kthread+0x96/0xa0
<4> [<ffffffff813e2cf4>] ? kernel_thread_helper+0x4/0x10
<4> [<ffffffff81054700>] ? kthread_worker_fn+0x120/0x120
<4> [<ffffffff813e2cf0>] ? gs_change+0xb/0xb

<6>kswapd0 R running task 0 420 2 0x00000000
<4> ffff88011aeff630 000000000000003f ffffffff811c9601 ffff88011b0646c0
<4> 0000000000000040 0000000000000d3e ffff880117cc4000 0000000000000005
<4> 0000000000000000 ffff88011b3bde70 0000000000000000 ffffffffa00f85bf
<4>Call Trace:
<4> [<ffffffff811c9601>] ? radix_tree_gang_lookup_tag+0x81/0xf0
<4> [<ffffffffa00f85bf>] ? xfs_perag_get_tag+0x1f/0x40 [xfs]
<4> [<ffffffff810da7d5>] ? grab_super_passive+0x45/0xb0
<4> [<ffffffff810da87c>] ? prune_super+0x3c/0x1c0
<4> [<ffffffff810a15e0>] ? shrink_slab+0x90/0x1f0
<4> [<ffffffff810a3d6a>] ? kswapd+0x5fa/0x890
<4> [<ffffffff81054c30>] ? wake_up_bit+0x40/0x40
<4> [<ffffffff810a3770>] ? shrink_zone+0x500/0x500
<4> [<ffffffff810a3770>] ? shrink_zone+0x500/0x500
<4> [<ffffffff81054796>] ? kthread+0x96/0xa0
<4> [<ffffffff813e2cf4>] ? kernel_thread_helper+0x4/0x10
<4> [<ffffffff81054700>] ? kthread_worker_fn+0x120/0x120
<4> [<ffffffff813e2cf0>] ? gs_change+0xb/0xb

<6>kswapd0 R running task 0 420 2 0x00000000
<4> 000000000003337a 000000000001747d 0000000000000000 0000000000000004
<4> ffffffff816492c0 0000000000000000 ffff880117cc4000 0000000000000000
<4> 0000000000000000 ffff88011b3bde70 0000000000000000 ffffffffa00f85bf
<4>Call Trace:
<4> [<ffffffffa00f85bf>] ? xfs_perag_get_tag+0x1f/0x40 [xfs]
<4> [<ffffffff810da7d5>] ? grab_super_passive+0x45/0xb0
<4> [<ffffffff810da87c>] ? prune_super+0x3c/0x1c0
<4> [<ffffffff810a15e0>] ? shrink_slab+0x90/0x1f0
<4> [<ffffffff810a3d6a>] ? kswapd+0x5fa/0x890
<4> [<ffffffff81054c30>] ? wake_up_bit+0x40/0x40
<4> [<ffffffff810a3770>] ? shrink_zone+0x500/0x500
<4> [<ffffffff810a3770>] ? shrink_zone+0x500/0x500
<4> [<ffffffff81054796>] ? kthread+0x96/0xa0
<4> [<ffffffff813e2cf4>] ? kernel_thread_helper+0x4/0x10
<4> [<ffffffff81054700>] ? kthread_worker_fn+0x120/0x120
<4> [<ffffffff813e2cf0>] ? gs_change+0xb/0xb


> > This will reset uptime and a problem for another 5-10 days..
>
> Then don't do that ;)
>
> > > > P.S.: The message above was written couple of days ago. Now I'm at 10
> > > > days uptime, and this is the result as of today
> > > > PID USER S %CPU %MEM TIME+ SWAP COMMAND
> > > > 420 root R 93 0.0 110:48.48 0 kswapd0
> > > > 30085 nowhere D 42 0.2 0:04.36 10m dd
> > > >
> > > > PPS: Please CC me.
>


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/