Re: zone_reclaimable() leads to livelock in __alloc_pages_slowpath()

From: Michal Hocko
Date: Wed May 25 2016 - 08:10:22 EST


On Wed 25-05-16 00:43:41, Oleg Nesterov wrote:
> On 05/24, Michal Hocko wrote:
> >
> > On Mon 23-05-16 17:14:19, Oleg Nesterov wrote:
> > > On 05/23, Michal Hocko wrote:
> > [...]
> > > > Could you add some tracing and see what are the numbers
> > > > above?
> > >
> > > with the patch below I can press Ctrl-C when it hangs, this breaks the
> > > endless loop and the output looks like
> > >
> > > vmscan: ZONE=ffffffff8189f180 0 scanned=0 pages=6
> > > vmscan: ZONE=ffffffff8189eb00 0 scanned=1 pages=0
> > > ...
> > > vmscan: ZONE=ffffffff8189eb00 0 scanned=2 pages=1
> > > vmscan: ZONE=ffffffff8189f180 0 scanned=4 pages=6
> > > ...
> > > vmscan: ZONE=ffffffff8189f180 0 scanned=4 pages=6
> > > vmscan: ZONE=ffffffff8189f180 0 scanned=4 pages=6
> > >
> > > the numbers are always small.
> >
> > Small but scanned is not 0 and constant which means it either gets reset
> > repeatedly (something gets freed) or we have stopped scanning. Which
> > pattern can you see? I assume that the swap space is full at the time
> > (could you add get_nr_swap_pages() to the output).
>
> no, I tested this without SWAP,

I have tried that as well but this just hit the OOM almost immediately.

> > I am trying to reproduce but your test case always hits the oom killer:
>
> Did you try to run it in a loop? Usually it takes a while before the system
> hangs.

OK, I wasn't patient enough... Now I can reproduce.

[...]
> So. I spent almost the whole day trying to understand whats going on, and
> of course I failed.
>
> But. It _seems to me_ that the kernel "leaks" some pages in LRU_INACTIVE_FILE
> list because inactive_file_is_low() returns the wrong value. And do not even
> ask me why I think so, unlikely I will be able to explain ;) to remind, I never
> tried to read vmscan.c before.
>
> But. if I change lruvec_lru_size()
>
> - return zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru);
> + return zone_page_state_snapshot(lruvec_zone(lruvec), NR_LRU_BASE + lru);
>
> the problem goes away too.

This is a bit surprising but my testing shows that the result shouldn't
make much difference. I can see some discrepancies between lru_vec size
and zone_reclaimable_pages but they are too small to actually matter.
What I do instead is that there are freed pages on the way and that
always resets the counter. And you can see the number of scanned pages
was really large there. See the attached serial log. So this smells like
the issue I was talking about when mentioning the livelock earlier.

Maybe your change to lruvec_lru_size resp. threshold size just changes
the timing for other operations to hit the race with the free less
likely. Or maybe my debugging output is interfering with the load. Dunno
but it seems like the inherent problem of the oom detection based on the
NR_PAGES_SCANNED to me.

Here is my debugging patch:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c1069efcc4d7..ac554b633f97 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -834,8 +834,10 @@ static void free_pcppages_bulk(struct zone *zone, int count,

spin_lock(&zone->lock);
nr_scanned = zone_page_state(zone, NR_PAGES_SCANNED);
- if (nr_scanned)
+ if (nr_scanned) {
+ pr_info("XXX: %s zone:%s clearing %lu pages\n", __FUNCTION__, zone->name, nr_scanned);
__mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);
+ }

while (to_free) {
struct page *page;
@@ -888,8 +890,10 @@ static void free_one_page(struct zone *zone,
unsigned long nr_scanned;
spin_lock(&zone->lock);
nr_scanned = zone_page_state(zone, NR_PAGES_SCANNED);
- if (nr_scanned)
+ if (nr_scanned) {
+ pr_info("XXX: %s zone:%s clearing %lu pages\n", __FUNCTION__, zone->name, nr_scanned);
__mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);
+ }

if (unlikely(has_isolate_pageblock(zone) ||
is_migrate_isolate(migratetype))) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 142cb61f4822..499a361fe5ad 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1606,6 +1606,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);

if (global_reclaim(sc)) {
+ if (!nr_scanned)
+ pr_info("XXXX: %s nothing scanned\n", __FUNCTION__);
__mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
if (current_is_kswapd())
__count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scanned);
@@ -1794,8 +1796,11 @@ static void shrink_active_list(unsigned long nr_to_scan,

nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
&nr_scanned, sc, isolate_mode, lru);
- if (global_reclaim(sc))
+ if (global_reclaim(sc)) {
+ if (!nr_scanned)
+ pr_info("XXXX: %s nothing scanned\n", __FUNCTION__);
__mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
+ }

reclaim_stat->recent_scanned[file] += nr_taken;

@@ -2615,8 +2620,15 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
reclaimable = true;

if (global_reclaim(sc) &&
- !reclaimable && zone_reclaimable(zone))
+ !reclaimable && zone_reclaimable(zone)) {
+ unsigned long file_lru_pages = zone_page_state(zone, NR_INACTIVE_FILE) + zone_page_state(zone, NR_ACTIVE_FILE);
+ pr_info("XXX: %s zone:%s scanned:%lu scanned_snapshot:%lu reclaimable:%lu file_active:%lu file_inactive:%lu isolated:%lu\n",
+ __FUNCTION__, zone->name,
+ zone_page_state(zone, NR_PAGES_SCANNED), zone_page_state_snapshot(zone, NR_PAGES_SCANNED),
+ zone_reclaimable_pages(zone),
+ zone_page_state(zone, NR_ACTIVE_FILE), zone_page_state(zone, NR_INACTIVE_FILE), zone_page_state(zone, NR_ISOLATED_FILE));
reclaimable = true;
+ }
}

/*
--
Michal Hocko
SUSE Labs

Attachment: oom-test.qcow_serial.log.gz
Description: application/gzip