Re: [PATCH] mm, vmscan: add cond_resched into shrink_node_memcg

From: Donald Buczek
Date: Fri Dec 09 2016 - 05:54:15 EST


On 12/02/16 10:58, Michal Hocko wrote:
From: Michal Hocko <mhocko@xxxxxxxx>

Boris Zhmurov has reported RCU stalls during the kswapd reclaim:
17511.573645] INFO: rcu_sched detected stalls on CPUs/tasks:
[17511.573699] 23-...: (22 ticks this GP) idle=92f/140000000000000/0 softirq=2638404/2638404 fqs=23
[17511.573740] (detected by 4, t=6389 jiffies, g=786259, c=786258, q=42115)
[17511.573776] Task dump for CPU 23:
[17511.573777] kswapd1 R running task 0 148 2 0x00000008
[17511.573781] 0000000000000000 ffff8efe5f491400 ffff8efe44523e68 ffff8f16a7f49000
[17511.573782] 0000000000000000 ffffffffafb67482 0000000000000000 0000000000000000
[17511.573784] 0000000000000000 0000000000000000 ffff8efe44523e58 00000000016dbbee
[17511.573786] Call Trace:
[17511.573796] [<ffffffffafb67482>] ? shrink_node+0xd2/0x2f0
[17511.573798] [<ffffffffafb683ab>] ? kswapd+0x2cb/0x6a0
[17511.573800] [<ffffffffafb680e0>] ? mem_cgroup_shrink_node+0x160/0x160
[17511.573806] [<ffffffffafa8b63d>] ? kthread+0xbd/0xe0
[17511.573810] [<ffffffffafa2967a>] ? __switch_to+0x1fa/0x5c0
[17511.573813] [<ffffffffaff9095f>] ? ret_from_fork+0x1f/0x40
[17511.573815] [<ffffffffafa8b580>] ? kthread_create_on_node+0x180/0x180

a closer code inspection has shown that we might indeed miss all the
scheduling points in the reclaim path if no pages can be isolated from
the LRU list. This is a pathological case but other reports from Donald
Buczek have shown that we might indeed hit such a path:
clusterd-989 [009] .... 118023.654491: mm_vmscan_direct_reclaim_end: nr_reclaimed=193
kswapd1-86 [001] dN.. 118023.987475: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239830 nr_taken=0 file=1
kswapd1-86 [001] dN.. 118024.320968: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239844 nr_taken=0 file=1
kswapd1-86 [001] dN.. 118024.654375: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239858 nr_taken=0 file=1
kswapd1-86 [001] dN.. 118024.987036: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239872 nr_taken=0 file=1
kswapd1-86 [001] dN.. 118025.319651: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239886 nr_taken=0 file=1
kswapd1-86 [001] dN.. 118025.652248: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239900 nr_taken=0 file=1
kswapd1-86 [001] dN.. 118025.984870: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239914 nr_taken=0 file=1
[...]
kswapd1-86 [001] dN.. 118084.274403: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4241133 nr_taken=0 file=1

this is minute long snapshot which didn't take a single page from the
LRU. It is not entirely clear why only 1303 pages have been scanned
during that time (maybe there was a heavy IRQ activity interfering).

In any case it looks like we can really hit long periods without
scheduling on non preemptive kernels so an explicit cond_resched() in
shrink_node_memcg which is independent on the reclaim operation is due.

Reported-and-tested-by: Boris Zhmurov <bb@xxxxxxxxxxxxxx>
Reported-by: Donald Buczek <buczek@xxxxxxxxxxxxx>
Reported-by: "Christopher S. Aker" <caker@xxxxxxxxxxxx>
Reported-by: Paul Menzel <pmenzel@xxxxxxxxxxxxx>
Signed-off-by: Michal Hocko <mhocko@xxxxxxxx>
---

Hi,
there were multiple reportes of the similar RCU stalls. Only Boris has
confirmed that this patch helps in his workload. Others might see a
slightly different issue and that should be investigated if it is the
case. As pointed out by Paul [1] cond_resched might be not sufficient
to silence RCU stalls because that would require a real scheduling.
This is a separate problem, though, and Paul is working with Peter [2]
to resolve it.

Anyway, I believe that this patch should be a good start because it
really seems that nr_taken=0 during the LRU isolation can be triggered
in the real life. All reporters are agreeing to start seeing this issue
when moving on to 4.8 kernel which might be just a coincidence or a
different behavior of some subsystem. Well, MM has moved from zone to
node reclaim but I couldn't have found any direct relation to that
change.

[1] http://lkml.kernel.org/r/20161130142955.GS3924@xxxxxxxxxxxxxxxxxx
[2] http://lkml.kernel.org/r/20161201124024.GB3924@xxxxxxxxxxxxxxxxxx

mm/vmscan.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c05f00042430..c4abf08861d2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2362,6 +2362,8 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc
}
}
+ cond_resched();
+
if (nr_reclaimed < nr_to_reclaim || scan_adjusted)
continue;

Our two backup servers which had rcu stall warnings since 4.8 are running with this patch on top of v4.8.12 for 3 1/2 days now and didn't log any rcu stalls since then. So this patch might be fixing it for our environment, too.

The previous times between boots and first occurrences of rcu stall warnings were:

Server A ("void"): 1d14h 5h 1d4h 2d2h 21h 3d21h
Server B ("null"): 3d12h 2d3h 5d4h 4h 12h

(Yes, this contradicts a previous mail from me, where I wrongly stated "37,0.2,1,2,0.8 hours" for the first server, because I messed up the units. Its "37 hours, 0.2 days, 1 day, 2 days, 0.8 days" which fits the first 5 numbers in the above list. Sorry.)

We should wait a few days longer for a better p-value but there is reason for hope.

Donald

--
Donald Buczek
buczek@xxxxxxxxxxxxx
Tel: +49 30 8413 1433