Re: [patch 0/2] mm: too_many_isolated can stall due to out of sync VM counters

From: Michal Hocko
Date: Tue Nov 14 2023 - 03:20:17 EST


On Mon 13-11-23 20:34:20, Marcelo Tosatti wrote:
> A customer reported seeing processes hung at too_many_isolated,
> while analysis indicated that the problem occurred due to out
> of sync per-CPU stats (see below).
>
> Fix is to use node_page_state_snapshot to avoid the out of stale values.
>
> 2136 static unsigned long
> 2137 shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> 2138 struct scan_control *sc, enum lru_list lru)
> 2139 {
> :
> 2145 bool file = is_file_lru(lru);
> :
> 2147 struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> :
> 2150 while (unlikely(too_many_isolated(pgdat, file, sc))) {
> 2151 if (stalled)
> 2152 return 0;
> 2153
> 2154 /* wait a bit for the reclaimer. */
> 2155 msleep(100); <--- some processes were sleeping here, with pending SIGKILL.
> 2156 stalled = true;
> 2157
> 2158 /* We are about to die and free our memory. Return now. */
> 2159 if (fatal_signal_pending(current))
> 2160 return SWAP_CLUSTER_MAX;
> 2161 }
>
> msleep() must be called only when there are too many isolated pages:

What do you mean here?

> 2019 static int too_many_isolated(struct pglist_data *pgdat, int file,
> 2020 struct scan_control *sc)
> 2021 {
> :
> 2030 if (file) {
> 2031 inactive = node_page_state(pgdat, NR_INACTIVE_FILE);
> 2032 isolated = node_page_state(pgdat, NR_ISOLATED_FILE);
> 2033 } else {
> :
> 2046 return isolated > inactive;
>
> The return value was true since:
>
> crash> p ((struct pglist_data *) 0xffff00817fffe580)->vm_stat[NR_INACTIVE_FILE]
> $8 = {
> counter = 1
> }
> crash> p ((struct pglist_data *) 0xffff00817fffe580)->vm_stat[NR_ISOLATED_FILE]
> $9 = {
> counter = 2
>
> while per_cpu stats had:
>
> crash> p ((struct pglist_data *) 0xffff00817fffe580)->per_cpu_nodestats
> $85 = (struct per_cpu_nodestat *) 0xffff8000118832e0
> crash> p/x 0xffff8000118832e0 + __per_cpu_offset[42]
> $86 = 0xffff00917fcc32e0
> crash> p ((struct per_cpu_nodestat *) 0xffff00917fcc32e0)->vm_node_stat_diff[NR_ISOLATED_FILE]
> $87 = -1 '\377'
>
> crash> p/x 0xffff8000118832e0 + __per_cpu_offset[44]
> $89 = 0xffff00917fe032e0
> crash> p ((struct per_cpu_nodestat *) 0xffff00917fe032e0)->vm_node_stat_diff[NR_ISOLATED_FILE]
> $91 = -1 '\377'

This doesn't really tell much. How much out of sync they really are
cumulatively over all cpus?

> It seems that processes were trapped in direct reclaim/compaction loop
> because these nodes had few free pages lower than watermark min.
>
> crash> kmem -z | grep -A 3 Normal
> :
> NODE: 4 ZONE: 1 ADDR: ffff00817fffec40 NAME: "Normal"
> SIZE: 8454144 PRESENT: 98304 MIN/LOW/HIGH: 68/166/264
> VM_STAT:
> NR_FREE_PAGES: 68
> --
> NODE: 5 ZONE: 1 ADDR: ffff00897fffec40 NAME: "Normal"
> SIZE: 118784 MIN/LOW/HIGH: 82/200/318
> VM_STAT:
> NR_FREE_PAGES: 45
> --
> NODE: 6 ZONE: 1 ADDR: ffff00917fffec40 NAME: "Normal"
> SIZE: 118784 MIN/LOW/HIGH: 82/200/318
> VM_STAT:
> NR_FREE_PAGES: 53
> --
> NODE: 7 ZONE: 1 ADDR: ffff00997fbbec40 NAME: "Normal"
> SIZE: 118784 MIN/LOW/HIGH: 82/200/318
> VM_STAT:
> NR_FREE_PAGES: 52

How have you concluded that too_many_isolated is at root of this issue.
With a very low NR_FREE_PAGES and many contending allocation the system
could be easily stuck in reclaim. What are other reclaim
characteristics? Is the direct reclaim successful?

--
Michal Hocko
SUSE Labs