Re: [patch] mm: page_alloc: exclude unreclaimable allocations fromzone fairness policy

From: Johannes Weiner
Date: Wed Dec 11 2013 - 20:09:20 EST


On Wed, Dec 11, 2013 at 10:47:19PM +0000, Mel Gorman wrote:
> On Wed, Dec 11, 2013 at 01:09:16PM -0500, Johannes Weiner wrote:
> > Dave Hansen noted a regression in a microbenchmark that loops around
> > open() and close() on an 8-node NUMA machine and bisected it down to
> > 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy"). That
> > change forces the slab allocations of the file descriptor to spread
> > out to all 8 nodes, causing remote references in the page allocator
> > and slab.
> >
>
> The original patch was primarily concerned with the fair aging of LRU pages
> of zones within a node. This patch uses GFP_MOVABLE_MASK which includes
> __GFP_RECLAIMABLE meaning any slab created with SLAB_RECLAIM_ACCOUNT is still
> getting the round-robin treatment. Those pages have a different lifecycle
> to LRU pages and the shrinkers are only node aware, not zone aware.
> While I get this patch probably helps this specific benchmark, was the
> use of GFP_MOVABLE_MASK intentional or did you mean to use __GFP_MOVABLE?

It was intentional to spread SLAB_RECLAIM_ACCOUNT pages across all
allowed nodes evenly for the same aging fairness reason.

> Looking at the original patch again I think I made a major mistake when
> reviewing it. Considering the effect of the following for NUMA machines
>
> for_each_zone_zonelist_nodemask(zone, z, zonelist,
> high_zoneidx, nodemask) {
> ....
> if (alloc_flags & ALLOC_WMARK_LOW) {
> if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
> continue;
> if (zone_reclaim_mode &&
> !zone_local(preferred_zone, zone))
> continue;
> }
>
>
> Enabling zone_reclaim_mode sucks badly for workloads that are not paritioned
> to fit within NUMA nodes. Consequently, I expect the common case it that
> it's disabled by default due to small NUMA distances or manually disabled.
>
> However, the effect of that block is that we allocate NR_ALLOC_BATCH
> from local zones then fallback to batch allocating remote nodes! I bet
> the numa_hit stats in /proc/vmstat have sucked recently. The original
> problem was because the page allocator would try allocating from the
> highest zone while kswapd reclaimed from it causing LRU-aging problems.
> The problem is not the same between nodes. How do you feel about dropping
> the zone_reclaim_mode check above and only round-robin in batches between
> zones on the local node?

It might not be for anon but it's the same problem for cache. The
page allocator will fill all the nodes in the system before waking up
the kswapds. It will utilize all nodes, just not evenly.

I know that on the node-level staying local is often preferrable over
full memory utilization but I was under the assumption that
zone_reclaim_mode is there to express this preference.

My patch certainly makes this preference more aggressive in the sense
that there is no grayzone anymore. There is no try to stay local.
There is either not using a block of memory at all, or using it to the
same extent as any other block of the same size; that's the
requirement for fair aging.

That being said, the fairness concerns are primarily about file pages.
Should we exclude anon and slab pages entirely? I'd still account for
them in the batches but only apply placement rules to page cache.
That should still leave us with roughly equal cache aging speeds in
all zones and nodes.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/