Re: [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2

From: Mel Gorman
Date: Tue Feb 23 2016 - 16:59:08 EST


On Tue, Feb 23, 2016 at 12:59:15PM -0800, Johannes Weiner wrote:
> On Tue, Feb 23, 2016 at 08:19:32PM +0000, Mel Gorman wrote:
> > On Tue, Feb 23, 2016 at 12:04:16PM -0800, Johannes Weiner wrote:
> > > On Tue, Feb 23, 2016 at 03:04:23PM +0000, Mel Gorman wrote:
> > > > In many benchmarks, there is an obvious difference in the number of
> > > > allocations from each zone as the fair zone allocation policy is removed
> > > > towards the end of the series. For example, this is the allocation stats
> > > > when running blogbench that showed no difference in headling performance
> > > >
> > > > mmotm-20160209 nodelru-v2
> > > > DMA allocs 0 0
> > > > DMA32 allocs 7218763 608067
> > > > Normal allocs 12701806 18821286
> > > > Movable allocs 0 0
> > >
> > > According to the mmotm numbers, your DMA32 zone is over a third of
> > > available memory, yet in the nodelru-v2 kernel sees only 3% of the
> > > allocations.
> >
> > In this case yes but blogbench is not scaled to memory size and is not
> > reclaim intensive. If you look, you'll see the total number of overall
> > allocations is very similar. During that test, there is a small amount of
> > kswapd scan activity (but not reclaim which is odd) at the start of the
> > test for nodelru but that's about it.
>
> Yes, if fairness enforcement is now done by reclaim, then workloads
> without reclaim will show skewed placement as the Normal zone is again
> filled up first before moving on to the next zone.
>
> That is fine. But what about the balance in reclaiming workloads?
>

That is the key question -- whether node LRU reclaim renders it
unnecessary.

> > > That's an insanely high level of aging inversion, where
> > > the lifetime of a cache entry is again highly dependent on placement.
> > >
> >
> > The aging is now indepdant of what zone the page was allocated from because
> > it's node-based LRU reclaim. That may mean that the occupancy of individual
> > zones is now different but it should only matter if there is a large number
> > of address-limited requests.
>
> The problem is that kswapd will stay awake and continuously draw
> subsequent allocations into a single zone, thus utilizing only a
> fraction of available memory.

Not quite. Look at prepare_kswapd_sleep() in the full series and it has this


for (i = 0; i <= classzone_idx; i++) {
struct zone *zone = pgdat->node_zones + i;

if (!populated_zone(zone))
continue;

if (zone_balanced(zone, order, 0, classzone_idx))
return true;
}

and balance_pgdat has this

/* Only reclaim if there are no eligible zones */
for (i = classzone_idx; i >= 0; i--) {
zone = pgdat->node_zones + i;
if (!populated_zone(zone))
continue;

if (!zone_balanced(zone, order, 0, classzone_idx)) {
classzone_idx = i;
break;
}
}

kswapd only stays awake until *one* balanced zone is available. That is
a key difference with the existing kswapd which balances all zones.

> A DMA32-limited kswapd wakeups can
> reclaim cache in DMA32 continuously if the allocator continously
> places new cache pages in that zone. It looks like that is what
> happened in the stutter benchmark.
>

There may be corner cases where we artifically wake kswapd at DMA32
instead of a higher zone. If that happens, it should be addressed so
that only GFP_DMA32 wakes and reclaims that zone.

> Sure, it doesn't matter in that benchmark, because the pages are used
> only once. But if it had an actual cache workingset bigger than DMA32
> but smaller than DMA32+Normal, it would be thrashing unnecessarily.
>
> If kswapd were truly balancing the pages in a node equally, regardless
> of zone placement, then in the long run we should see zone allocations
> converge to a share that is in proportion to each zone's size. As far
> as I can see, that is not quite happening yet.
>

Not quite either. The order kswapd reclaims is in related to the age of
all pages in the node. Early in the lifetime of the system, that may be
ZONE_NORMAL initially until the other zones are populated. Ultimately
the balance of zones will be related to the age of the pages.

> > > The fact that this doesn't make a performance difference in the
> > > specific benchmarks you ran only proves just that: these specific
> > > benchmarks don't care. IMO, benchmarking is not enough here. If this
> > > is truly supposed to be unproblematic, then I think we need a reasoned
> > > explanation. I can't imagine how it possibly could be, though.
> > >
> >
> > The basic explanation is that reclaim is on a per-node basis and we
> > no longer balance all zones, just one that is necessary to satisfy the
> > original request that wokeup kswapd.
> >
> > > If reclaim can't guarantee a balanced zone utilization then the
> > > allocator has to keep doing it. :(
> >
> > That's the key issue - the main reason balanced zone utilisation is
> > necessary is because we reclaim on a per-zone basis and we must avoid
> > page aging anomalies. If we balance such that one eligible zone is above
> > the watermark then it's less of a concern.
>
> Yes, but only if there can't be extended reclaim stretches that prefer
> the pages of a single zone. Yet it looks like this is still possible.
>

And that is a problem if a workload is dominated by allocations
requiring the lower zones. If that is the common case then it's a bust
and fair zone allocation policy is still required. That removes one
motivation from the series as it leaves some fatness in the page
allocator paths.

> I wonder if that were fixed by dropping patch 7/27?

Potentially yes although it would be preferred to avoid unnecessarily
waking kswapd for a lower zone. That could be enforced by modifying
wake_all_kswapd() to always wake based on the highest available zone in
a pgdat that is below the zone required by the allocation request.

> Potentially it
> would need a bit more work than that. I.e. could we make kswapd
> balance only for the highest classzone in the system, and thus make
> address-limited allocations fend for themselves in direct reclaim?
>

That would be a side-effect of modifying wake_all_kswapd. Would shoving
that in alleviate your concerns?

--
Mel Gorman
SUSE Labs