Re: [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2

From: Johannes Weiner
Date: Tue Feb 23 2016 - 15:59:29 EST


On Tue, Feb 23, 2016 at 08:19:32PM +0000, Mel Gorman wrote:
> On Tue, Feb 23, 2016 at 12:04:16PM -0800, Johannes Weiner wrote:
> > On Tue, Feb 23, 2016 at 03:04:23PM +0000, Mel Gorman wrote:
> > > In many benchmarks, there is an obvious difference in the number of
> > > allocations from each zone as the fair zone allocation policy is removed
> > > towards the end of the series. For example, this is the allocation stats
> > > when running blogbench that showed no difference in headling performance
> > >
> > > mmotm-20160209 nodelru-v2
> > > DMA allocs 0 0
> > > DMA32 allocs 7218763 608067
> > > Normal allocs 12701806 18821286
> > > Movable allocs 0 0
> >
> > According to the mmotm numbers, your DMA32 zone is over a third of
> > available memory, yet in the nodelru-v2 kernel sees only 3% of the
> > allocations.
>
> In this case yes but blogbench is not scaled to memory size and is not
> reclaim intensive. If you look, you'll see the total number of overall
> allocations is very similar. During that test, there is a small amount of
> kswapd scan activity (but not reclaim which is odd) at the start of the
> test for nodelru but that's about it.

Yes, if fairness enforcement is now done by reclaim, then workloads
without reclaim will show skewed placement as the Normal zone is again
filled up first before moving on to the next zone.

That is fine. But what about the balance in reclaiming workloads?

> > That's an insanely high level of aging inversion, where
> > the lifetime of a cache entry is again highly dependent on placement.
> >
>
> The aging is now indepdant of what zone the page was allocated from because
> it's node-based LRU reclaim. That may mean that the occupancy of individual
> zones is now different but it should only matter if there is a large number
> of address-limited requests.

The problem is that kswapd will stay awake and continuously draw
subsequent allocations into a single zone, thus utilizing only a
fraction of available memory. A DMA32-limited kswapd wakeups can
reclaim cache in DMA32 continuously if the allocator continously
places new cache pages in that zone. It looks like that is what
happened in the stutter benchmark.

Sure, it doesn't matter in that benchmark, because the pages are used
only once. But if it had an actual cache workingset bigger than DMA32
but smaller than DMA32+Normal, it would be thrashing unnecessarily.

If kswapd were truly balancing the pages in a node equally, regardless
of zone placement, then in the long run we should see zone allocations
converge to a share that is in proportion to each zone's size. As far
as I can see, that is not quite happening yet.

> > The fact that this doesn't make a performance difference in the
> > specific benchmarks you ran only proves just that: these specific
> > benchmarks don't care. IMO, benchmarking is not enough here. If this
> > is truly supposed to be unproblematic, then I think we need a reasoned
> > explanation. I can't imagine how it possibly could be, though.
> >
>
> The basic explanation is that reclaim is on a per-node basis and we
> no longer balance all zones, just one that is necessary to satisfy the
> original request that wokeup kswapd.
>
> > If reclaim can't guarantee a balanced zone utilization then the
> > allocator has to keep doing it. :(
>
> That's the key issue - the main reason balanced zone utilisation is
> necessary is because we reclaim on a per-zone basis and we must avoid
> page aging anomalies. If we balance such that one eligible zone is above
> the watermark then it's less of a concern.

Yes, but only if there can't be extended reclaim stretches that prefer
the pages of a single zone. Yet it looks like this is still possible.

I wonder if that were fixed by dropping patch 7/27? Potentially it
would need a bit more work than that. I.e. could we make kswapd
balance only for the highest classzone in the system, and thus make
address-limited allocations fend for themselves in direct reclaim?

This way, we would avoid that pathological interaction between kswapd
and the allocator, and kswapd would be guaranteed to balance fairly.