Re: [[PATCH]] mm: khugepaged: recalculate min_free_kbytes after memory hotplug as expected by khugepaged

From: Vijay Balakrishna
Date: Wed Sep 16 2020 - 14:40:50 EST




On 9/15/2020 11:53 PM, Michal Hocko wrote:
On Tue 15-09-20 08:48:08, Vijay Balakrishna wrote:


On 9/15/2020 1:18 AM, Michal Hocko wrote:
On Mon 14-09-20 09:57:02, Vijay Balakrishna wrote:


On 9/14/2020 7:33 AM, Michal Hocko wrote:
On Thu 10-09-20 13:47:39, Vijay Balakrishna wrote:
When memory is hotplug added or removed the min_free_kbytes must be
recalculated based on what is expected by khugepaged. Currently
after hotplug, min_free_kbytes will be set to a lower default and higher
default set when THP enabled is lost. This leaves the system with small
min_free_kbytes which isn't suitable for systems especially with network
intensive loads. Typical failure symptoms include HW WATCHDOG reset,
soft lockup hang notices, NETDEVICE WATCHDOG timeouts, and OOM process
kills.

Care to explain some more please? The whole point of increasing
min_free_kbytes for THP is to get a larger free memory with a hope that
huge pages will be more likely to appear. While this might help for
other users that need a high order pages it is definitely not the
primary reason behind it. Could you provide an example with some more
data?

Thanks Michal. I haven't looked into THP as part of my investigation, so I
cannot comment.

In our use case we are hotplug removing ~2GB of 8GB total (on our SoC)
during normal reboot/shutdown. This memory is hotplug hot-added as movable
type via systemd late service during start-of-day.

In our stress test first we ran into HW WATCHDOG recovery, on enabling
kernel watchdog we started seeing soft lockup hung task notices, failure
symptons varied, where stack trace of hung tasks sometimes trying to
allocate GFP_ATOMIC memory, looping in do_notify_resume, NETDEVICE WATCHDOG
timeouts, OOM process kills etc., During investigation we reran stress test
without hotplug use case. Surprisingly this run didn't encounter the said
problems. This led to comparing what is different between the two runs,
while looking at various globals, studying hotplug code I uncovered the
issue of failing to restore min_free_kbytes. In particular on our 8GB SoC
min_free_kbytes went down to 8703 from 22528 after hotplug add.

Did you try to increase min_free_kbytes manually after hot remove? Btw.

No, in our use case memory hot remove done during shutdown.

I do not follow. If you are hotremoving during shutdown then how come
the value of min_free_kbytes matter at all?

We are hot adding (which is hot removed memory during shutdown) during boot, the removed memory treated as persistent.


I would consider oom killer invocation due to min_free_kbytes really
weird behavior. If anything the higher value would cause more memory
reclaim and potentially oom rather than smaller one.

Yes, we wondered about it too. One panic stack trace (after many OOM kills)

[330321.174240] Out of memory and no killable processes...
[330321.179658] Kernel panic - not syncing: System is deadlocked on memory
[330321.186489] CPU: 4 PID: 1 Comm: systemd Kdump: loaded Tainted: G O
5.4.51-xxx #1
[330321.196900] Hardware name: Overlake (DT)
[330321.201038] Call trace:
[330321.203660] dump_backtrace+0x0/0x1d0
[330321.207533] show_stack+0x20/0x2c
[330321.211048] dump_stack+0xe8/0x150
[330321.214656] panic+0x18c/0x3b4
[330321.217901] out_of_memory+0x4c0/0x6e4
[330321.221863] __alloc_pages_nodemask+0xbdc/0x1c90
[330321.226722] alloc_pages_current+0x21c/0x2b0
[330321.231220] alloc_slab_page+0x1e0/0x7d8
[330321.235361] new_slab+0x2e8/0x2f8
[330321.238874] ___slab_alloc+0x45c/0x59c
[330321.242835] kmem_cache_alloc+0x2d4/0x360
[330321.247065] getname_flags+0x6c/0x2a8
[330321.250938] user_path_at_empty+0x3c/0x68
[330321.255168] do_readlinkat+0x7c/0x17c
[330321.259039] __arm64_sys_readlinkat+0x5c/0x70
[330321.263627] el0_svc_handler+0x1b8/0x32c
[330321.267767] el0_svc+0x10/0x14
[330321.271026] SMP: stopping secondary CPUs
[330321.275382] Starting crashdump kernel...
[330321.279526] Bye!

Do you have the full oom splat? The fact that previous oom killer
invocations haven't helped and all the eligible tasks have been killed
and you still hit the oom would suggest there is a lot of memory
allocated without a direct relation to tasks. I fail to see how
min_free_kbytes would be related.

OOM splat below. I see we had kmem leak detection turned on here. We haven't run stress with kmem leak detection since uncovereing low min_free_kbytes. During investigation we wanted to make sure there is no kmem leaks, we didn't find significant leaks detected.

[330319.234959] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/dbus-broker.service,task=dbus-broker,pid=541,uid=999
[330319.251380] Out of memory: Killed process 541 (dbus-broker) total-vm:4400kB, anon-rss:892kB, file-rss:380kB, shmem-rss:0kB, UID:999 pgtables:44kB oom_score_adj:-900
[330319.267587] oom_reaper: reaped process 541 (dbus-broker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[330319.766059] systemd invoked oom-killer: gfp_mask=0x40cc0(GFP_KERNEL|__GFP_COMP), order=1, oom_score_adj=0
[330319.776060] CPU: 4 PID: 1 Comm: systemd Kdump: loaded Tainted: G O 5.4.51-xxx #1
[330319.790612] Call trace:
[330319.793240] dump_backtrace+0x0/0x1d0
[330319.797112] show_stack+0x20/0x2c
[330319.800628] dump_stack+0xe8/0x150
[330319.804234] dump_header+0x80/0x494
[330319.807925] out_of_memory+0x480/0x6e4
[330319.811887] __alloc_pages_nodemask+0xbdc/0x1c90
[330319.816745] alloc_pages_current+0x21c/0x2b0
[330319.821244] alloc_slab_page+0x1e0/0x7d8
[330319.825383] new_slab+0x2e8/0x2f8
[330319.828895] ___slab_alloc+0x45c/0x59c
[330319.832854] kmem_cache_alloc+0x2d4/0x360
[330319.837086] getname_flags+0x6c/0x2a8
[330319.840958] user_path_at_empty+0x3c/0x68
[330319.845188] do_readlinkat+0x7c/0x17c
[330319.849059] __arm64_sys_readlinkat+0x5c/0x70
[330319.853648] el0_svc_handler+0x1b8/0x32c
[330319.857788] el0_svc+0x10/0x14
[330319.861064] Mem-Info:
[330319.863519] active_anon:60744 inactive_anon:109226 isolated_anon:0
active_file:6418 inactive_file:3869 isolated_file:2
unevictable:0 dirty:8 writeback:1 unstable:0
slab_reclaimable:34660 slab_unreclaimable:795718
mapped:1256 shmem:165765 pagetables:689 bounce:0
free:340962 free_pcp:4672 free_cma:0
[330319.898873] Node 0 active_anon:242976kB inactive_anon:436904kB active_file:25672kB inactive_file:15476kB unevictable:0kB isolated(anon):0kB isolated(file):8kB mapped:5024kB dirty:32kB writeback:4kB shmem:663060kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 73728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
[330319.928124] Node 0 Normal free:12652kB min:14344kB low:19092kB high:23840kB active_anon:55340kB inactive_anon:60276kB active_file:60kB inactive_file:128kB unevictable:0kB writepending:4kB present:6220656kB managed:4750196kB mlocked:0kB kernel_stack:9568kB pagetables:2756kB bounce:0kB free_pcp:10056kB local_pcp:1376kB free_cma:0kB
[330319.958376] lowmem_reserve[]: 0 15360 15360
[330319.962814] Node 0 Movable free:1351196kB min:2544kB low:4508kB high:6472kB active_anon:188352kB inactive_anon:376856kB active_file:26120kB inactive_file:15308kB unevictable:0kB writepending:32kB present:1966080kB managed:1966080kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:8632kB local_pcp:1336kB free_cma:0kB
[330319.993157] lowmem_reserve[]: 0 0 0
[330319.996879] Node 0 Normal: 3138*4kB (UME) 38*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 12856kB
[330320.009592] Node 0 Movable: 16382*4kB (M) 2980*8kB (M) 311*16kB (M) 77*32kB (M) 28*64kB (M) 5*128kB (M) 6*256kB (M) 1*512kB (M) 1*1024kB (M) 120*2048kB (M) 245*4096kB (M) = 1351592kB
[330320.026541] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[330320.035631] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=32768kB
[330320.044543] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[330320.053363] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=64kB
[330320.062004] 176215 total pagecache pages
[330320.066165] 2046684 pages RAM
[330320.069339] 0 pages HighMem/MovableOnly
[330320.073410] 367615 pages reserved
[330320.076943] 0 pages hwpoisoned
[330320.080190] Unreclaimable slab info:
[330320.083991] Name Used Total
[330320.090244] bio-3 560KB 672KB
[330320.095747] bio-2 757KB 832KB
[330320.101254] nf-frags 7KB 15KB
[330320.106763] fib6_nodes 16KB 16KB
[330320.112270] ip6_dst_cache 70KB 70KB
[330320.117779] RAWv6 154KB 154KB
[330320.123296] UDPv6 246KB 246KB
[330320.128806] TCPv6 146KB 146KB
[330320.134315] nf_conntrack 84KB 94KB
[330320.139823] io 66KB 80KB
[330320.145332] sd_ext_cdb 3KB 3KB
[330320.150838] virtio_scsi_cmd 16KB 16KB
[330320.156345] sgpool-128 29KB 29KB
[330320.161852] sgpool-64 31KB 31KB
[330320.167359] sgpool-32 31KB 31KB
[330320.172867] sgpool-16 15KB 15KB
[330320.178374] sgpool-8 7KB 7KB
[330320.183880] mqueue_inode_cache 31KB 31KB
[330320.189479] jbd2_inode 35KB 43KB
[330320.194989] ext4_system_zone 21KB 55KB
[330320.200497] ext4_bio_post_read_ctx 15KB 15KB
[330320.206453] kioctx 255KB 255KB
[330320.211960] aio_kiocb 54KB 77KB
[330320.217477] dio 196KB 323KB
[330320.222985] bio-1 7KB 7KB
[330320.228499] UNIX 308KB 369KB
[330320.234009] tcp_bind_bucket 20KB 20KB
[330320.239518] ip_fib_trie 16KB 16KB
[330320.245024] ip_fib_alias 15KB 15KB
[330320.250537] ip_dst_cache 64KB 64KB
[330320.256047] RAW 158KB 158KB
[330320.261553] UDP 247KB 247KB
[330320.267060] tw_sock_TCP 15KB 15KB
[330320.272566] request_sock_TCP 30KB 30KB
[330320.278073] TCP 278KB 278KB
[330320.283580] hugetlbfs_inode_cache 62KB 62KB
[330320.289446] eventpoll_pwq 31KB 31KB
[330320.294953] eventpoll_epi 31KB 31KB
[330320.300460] inotify_inode_mark 31KB 31KB
[330320.306061] request_queue 467KB 499KB
[330320.311568] blkdev_ioc 61KB 61KB
[330320.317087] bio-0 259KB 600KB
[330320.322603] biovec-max 3060KB 3718KB
[330320.328112] biovec-64 252KB 315KB
[330320.333619] biovec-16 62KB 78KB
[330320.339127] khugepaged_mm_slot 31KB 35KB
[330320.344723] user_namespace 122KB 122KB
[330320.350230] uid_cache 64KB 64KB
[330320.355738] iommu_iova 1076KB 1076KB
[330320.361245] dmaengine-unmap-2 4KB 4KB
[330320.366754] skbuff_fclone_cache 128KB 160KB
[330320.374394] skbuff_head_cache 79402KB 106265KB
[330320.379908] file_lock_cache 62KB 92KB
[330320.385416] file_lock_ctx 42KB 47KB
[330320.390924] fsnotify_mark_connector 46KB 51KB
[330320.396969] net_namespace 64KB 64KB
[330320.402477] task_delay_info 66KB 78KB
[330320.407984] taskstats 63KB 63KB
[330320.413491] proc_dir_entry 279KB 289KB
[330320.418998] pde_opener 31KB 31KB
[330320.424507] seq_file 63KB 94KB
[330320.430014] sigqueue 55KB 55KB
[330320.435525] shmem_inode_cache 1086KB 1221KB
[330320.441036] kernfs_iattrs_cache 782KB 826KB
[330320.446746] kernfs_node_cache 7943KB 8221KB
[330320.452306] mnt_cache 1579KB 2756KB
[330320.457813] filp 265KB 265KB
[330320.463321] names_cache 21543KB 21543KB
[330320.468833] hashtab_node 115KB 131KB
[330320.474341] ebitmap_node 626KB 641KB
[330320.479849] avtab_node 1047KB 1063KB
[330320.485361] avc_node 118KB 118KB
[330320.490868] iint_cache 23KB 23KB
[330320.496376] lsm_inode_cache 8578KB 8578KB
[330320.501902] lsm_file_cache 147KB 552KB
[330320.507416] key_jar 157KB 157KB
[330320.512927] nsproxy 31KB 31KB
[330320.518436] vm_area_struct 89KB 127KB
[330320.523944] mm_struct 252KB 252KB
[330320.529451] fs_cache 64KB 64KB
[330320.534959] files_cache 255KB 255KB
[330320.540468] signal_cache 507KB 569KB
[330320.545979] sighand_cache 633KB 841KB
[330320.551488] task_struct 1721KB 1940KB
[330320.556997] cred_jar 119KB 136KB
[330320.562504] anon_vma_chain 35KB 47KB
[330320.568011] anon_vma 73KB 95KB
[330320.573520] pid 101KB 120KB
[330320.579029] numa_policy 3KB 3KB
[330320.584536] trace_event_file 262KB 262KB
[330320.590041] ftrace_event_field 184KB 184KB
[330320.595638] pool_workqueue 128KB 128KB
[330320.601146] task_group 64KB 64KB
[330320.606652] vmap_area 77KB 78KB
[330320.612159] page->ptl 517KB 517KB
[330320.617665] kmemleak_scan_area 47KB 47KB
[330320.623262] kmemleak_object 2449340KB 2449340KB
[330320.628773] kmalloc-8k 4848KB 4928KB
[330320.634423] kmalloc-4k 48944KB 61856KB
[330320.639946] kmalloc-2k 11768KB 12480KB
[330320.645453] kmalloc-1k 10752KB 10752KB
[330320.651049] kmalloc-512 87024KB 94124KB
[330320.656561] kmalloc-256 2433KB 2528KB
[330320.662359] kmalloc-128 24071KB 29104KB
[330320.667869] kmem_cache_node 867KB 896KB
[330320.673377] kmem_cache 2162KB 2171KB
[330320.678881] Tasks state (memory values in pages):
[330320.683848] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[330320.692867] [ 238] 0 238 3446 1394 61440 0 -1000 systemd-udevd
[330320.702174] [ 20024] 62412 20008 313895 0 176128 0 0 test_ntttcp
[330320.711281] [ 20891] 64250 20885 67810 0 49152 0 0 test_cpuloadgen
[330320.720743] [ 20704] 62689 20704 157371 23 90112 0 0 fio
[330320.729119] [ 20708] 62689 20708 157372 5 90112 0 0 fio
[330320.737496] [ 20709] 62689 20709 157373 6 90112 0 0 fio
[330320.745874] [ 20710] 62689 20710 157374 6 90112 0 0 fio
[330320.754267] [ 4698] 0 4698 2231 0 57344 0 0 umount
[330320.762919] [ 4699] 0 4699 2231 0 57344 0 0 umount
[330320.771565] [ 4700] 0 4700 2231 0 53248 0 0 umount
[330320.780212] [ 4701] 0 4701 2231 0 57344 0 0 umount
[330320.788858] [ 4705] 0 4705 2231 0 53248 0 0 umount
[330320.797505] [ 4706] 0 4706 2231 0 57344 0 0 umount
[330320.806152] [ 4707] 0 4707 2231 0 61440 0 0 umount
[330320.814798] [ 4709] 0 4709 2265 0 57344 0 0 veritysetup
[330320.823891] [ 4715] 0 4715 2231 0 61440 0 0 umount
[330320.832537] [ 4716] 0 4716 2231 0 57344 0 0 umount
[330320.841185] [ 4717] 0 4717 2231 0 57344 0 0 umount
[330320.849832] [ 4721] 0 4721 2231 0 57344 0 0 umount
[330320.858478] [ 4722] 0 4722 2231 0 57344 0 0 umount
[330320.867124] [ 4723] 0 4723 2231 0 57344 0 0 umount
[330320.875770] [ 4728] 0 4728 2231 0 61440 0 0 umount
[330320.884423] [ 4729] 0 4729 2231 0 57344 0 0 umount
[330320.893075] [ 4730] 0 4730 2231 0 57344 0 0 umount
[330320.901722] [ 4731] 0 4731 2231 0 57344 0 0 umount
[330320.910369] [ 4732] 0 4732 2231 0 61440 0 0 umount
[330320.919016] [ 4733] 0 4733 2231 0 57344 0 0 umount
[330320.927662] [ 4735] 0 4735 2231 0 61440 0 0 umount
[330320.936307] [ 4736] 0 4736 2231 0 61440 0 0 umount
[330320.944953] [ 4737] 0 4737 2231 0 57344 0 0 umount
[330320.953599] [ 4738] 0 4738 2231 0 61440 0 0 umount
[330320.962245] [ 4739] 0 4739 2231 0 57344 0 0 umount
[330320.970891] [ 4740] 0 4740 2231 0 53248 0 0 umount
[330320.979536] [ 4744] 0 4744 2231 0 61440 0 0 umount
[330320.988187] [ 4746] 0 4746 2231 0 57344 0 0 umount
[330320.996832] [ 4747] 0 4747 2231 0 61440 0 0 umount
[330321.005479] [ 4757] 0 4757 2265 0 53248 0 0 veritysetup
[330321.014573] [ 4758] 0 4758 2231 0 57344 0 0 umount
[330321.023225] [ 4759] 0 4759 2231 0 57344 0 0 umount
[330321.031872] [ 4760] 0 4760 2231 0 61440 0 0 umount
[330321.040519] [ 5922] 0 5922 3012 0 61440 0 0 systemd-user-ru
[330321.049972] [ 6557] 0 6557 2231 0 61440 0 0 umount
[330321.058618] [ 6558] 0 6558 2231 0 61440 0 0 umount
[330321.067264] [ 6563] 0 6563 2231 0 57344 0 0 umount
[330321.075910] [ 6567] 0 6567 2231 0 57344 0 0 umount
[330321.084556] [ 6569] 0 6569 2231 0 53248 0 0 umount
[330321.093194] [ 6570] 0 6570 2231 0 65536 0 0 umount
[330321.101840] [ 6575] 0 6575 2231 0 57344 0 0 umount
[330321.110485] [ 6578] 0 6578 2231 0 61440 0 0 umount
[330321.119132] [ 6579] 0 6579 2231 0 57344 0 0 umount
[330321.127778] [ 6580] 0 6580 2231 0 61440 0 0 umount
[330321.136425] [ 7215] 0 7215 5087 0 69632 0 0 systemd-journal
[330321.145879] [ 8410] 0 8410 5087 0 65536 0 0 systemd-journal
[330321.155336] [ 9603] 0 9603 5087 0 73728 0 0 systemd-journal
[330321.164790] [ 10366] 0 10366 3012 0 61440 0 0 systemd-user-ru
[330321.174240] Out of memory and no killable processes...
[330321.179658] Kernel panic - not syncing: System is deadlocked on memory
[330321.186489] CPU: 4 PID: 1 Comm: systemd Kdump: loaded Tainted: G O 5.4.51-xxx #1
[330321.201038] Call trace:
[330321.203660] dump_backtrace+0x0/0x1d0
[330321.207533] show_stack+0x20/0x2c
[330321.211048] dump_stack+0xe8/0x150
[330321.214656] panic+0x18c/0x3b4
[330321.217901] out_of_memory+0x4c0/0x6e4
[330321.221863] __alloc_pages_nodemask+0xbdc/0x1c90
[330321.226722] alloc_pages_current+0x21c/0x2b0
[330321.231220] alloc_slab_page+0x1e0/0x7d8
[330321.235361] new_slab+0x2e8/0x2f8
[330321.238874] ___slab_alloc+0x45c/0x59c
[330321.242835] kmem_cache_alloc+0x2d4/0x360
[330321.247065] getname_flags+0x6c/0x2a8
[330321.250938] user_path_at_empty+0x3c/0x68
[330321.255168] do_readlinkat+0x7c/0x17c
[330321.259039] __arm64_sys_readlinkat+0x5c/0x70
[330321.263627] el0_svc_handler+0x1b8/0x32c
[330321.267767] el0_svc+0x10/0x14
[330321.271026] SMP: stopping secondary CPUs
[330321.275382] Starting crashdump kernel...
[330321.279526] Bye!


Then while searching I came across documented warning below. In above
instance panic after OOM kills happened after 3+ days of stress run (a
mixure of ttcp, cpuloadgen and fio).

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/performance_tuning_guide/sect-red_hat_enterprise_linux-performance_tuning_guide-configuration_tools-configuring_system_memory_capacity

Warning

Extreme values can damage your system. Setting min_free_kbytes to an
extremely low value prevents the system from reclaiming memory, which can
result in system hangs and OOM-killing processes. However, setting
min_free_kbytes too high (for example, to 5–10% of total system memory)
causes the system to enter an out-of-memory state immediately, resulting in
the system spending too much time reclaiming memory.

The auto tuned value should never reach such a low value to cause
problems.

The auto tuned value is incorrect post hotplug memory operation, in our use case memoy hot add occurs very early during boot.

Thanks,
Vijay