Re: [PATCH v4 1/2] mm/page_alloc: use ac->high_zoneidx for classzone_idx

From: Baoquan He
Date: Mon Mar 23 2020 - 03:08:33 EST


On 03/23/20 at 01:49pm, js1304@xxxxxxxxx wrote:
> From: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx>
>
> Currently, we use classzone_idx to calculate lowmem reserve proetection
> for an allocation request. This classzone_idx causes a problem
> on NUMA systems when the lowmem reserve protection exists for some zones
> on a node that do not exist on other nodes.
>
> Before further explanation, I should first clarify how to compute
> the classzone_idx and the high_zoneidx.
>
> - ac->high_zoneidx is computed via the arcane gfp_zone(gfp_mask) and
> represents the index of the highest zone the allocation can use
> - classzone_idx was supposed to be the index of the highest zone on
> the local node that the allocation can use, that is actually available
> in the system
>
> Think about following example. Node 0 has 4 populated zone,
> DMA/DMA32/NORMAL/MOVABLE. Node 1 has 1 populated zone, NORMAL. Some zones,
> such as MOVABLE, doesn't exist on node 1 and this makes following
> difference.
>
> Assume that there is an allocation request whose gfp_zone(gfp_mask) is
> the zone, MOVABLE. Then, it's high_zoneidx is 3. If this allocation is
> initiated on node 0, it's classzone_idx is 3 since actually
> available/usable zone on local (node 0) is MOVABLE. If this allocation
> is initiated on node 1, it's classzone_idx is 2 since actually
> available/usable zone on local (node 1) is NORMAL.
>
> You can see that classzone_idx of the allocation request are different
> according to their starting node, even if their high_zoneidx is the same.
>
> Think more about these two allocation requests. If they are processed
> on local, there is no problem. However, if allocation is initiated
> on node 1 are processed on remote, in this example, at the NORMAL zone
> on node 0, due to memory shortage, problem occurs. Their different
> classzone_idx leads to different lowmem reserve and then different
> min watermark. See the following example.
>
> root@ubuntu:/sys/devices/system/memory# cat /proc/zoneinfo
> Node 0, zone DMA
> per-node stats
> ...
> pages free 3965
> min 5
> low 8
> high 11
> spanned 4095
> present 3998
> managed 3977
> protection: (0, 2961, 4928, 5440)
> ...
> Node 0, zone DMA32
> pages free 757955
> min 1129
> low 1887
> high 2645
> spanned 1044480
> present 782303
> managed 758116
> protection: (0, 0, 1967, 2479)
> ...
> Node 0, zone Normal
> pages free 459806
> min 750
> low 1253
> high 1756
> spanned 524288
> present 524288
> managed 503620
> protection: (0, 0, 0, 4096)
> ...
> Node 0, zone Movable
> pages free 130759
> min 195
> low 326
> high 457
> spanned 1966079
> present 131072
> managed 131072
> protection: (0, 0, 0, 0)
> ...
> Node 1, zone DMA
> pages free 0
> min 0
> low 0
> high 0
> spanned 0
> present 0
> managed 0
> protection: (0, 0, 1006, 1006)
> Node 1, zone DMA32
> pages free 0
> min 0
> low 0
> high 0
> spanned 0
> present 0
> managed 0
> protection: (0, 0, 1006, 1006)
> Node 1, zone Normal
> per-node stats
> ...
> pages free 233277
> min 383
> low 640
> high 897
> spanned 262144
> present 262144
> managed 257744
> protection: (0, 0, 0, 0)
> ...
> Node 1, zone Movable
> pages free 0
> min 0
> low 0
> high 0
> spanned 262144
> present 0
> managed 0
> protection: (0, 0, 0, 0)
>
> - static min watermark for the NORMAL zone on node 0 is 750.
> - lowmem reserve for the request with classzone idx 3 at the NORMAL
> on node 0 is 4096.
> - lowmem reserve for the request with classzone idx 2 at the NORMAL
> on node 0 is 0.
>
> So, overall min watermark is:
> allocation initiated on node 0 (classzone_idx 3): 750 + 4096 = 4846
> allocation initiated on node 1 (classzone_idx 2): 750 + 0 = 750
>
> allocation initiated on node 1 will have some precedence than allocation
> initiated on node 0 because min watermark of the former allocation is
> lower than the other. So, allocation initiated on node 1 could succeed
> on node 0 when allocation initiated on node 0 could not, and, this could
> cause too many numa_miss allocation. Then, performance could be
> downgraded.
>
> Recently, there was a regression report about this problem on CMA patches
> since CMA memory are placed in ZONE_MOVABLE by those patches. I checked
> that problem is disappeared with this fix that uses high_zoneidx
> for classzone_idx.
>
> http://lkml.kernel.org/r/20180102063528.GG30397@yexl-desktop
>
> Using high_zoneidx for classzone_idx is more consistent way than previous
> approach because system's memory layout doesn't affect anything to it.
> With this patch, both classzone_idx on above example will be 3 so will
> have the same min watermark.
>
> allocation initiated on node 0: 750 + 4096 = 4846
> allocation initiated on node 1: 750 + 4096 = 4846
>
> One could wonder if there is a side effect that allocation initiated on
> node 1 will use higher bar when allocation is handled on local since
> classzone_idx could be higher than before. It will not happen because
> the zone without managed page doesn't contributes lowmem_reserve at all.
>
> Reported-by: Ye Xiaolong <xiaolong.ye@xxxxxxxxx>
> Tested-by: Ye Xiaolong <xiaolong.ye@xxxxxxxxx>
> Acked-by: Vlastimil Babka <vbabka@xxxxxxx>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx>
> ---
> mm/internal.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index c39c895..aebaa33 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -119,7 +119,7 @@ struct alloc_context {
> bool spread_dirty_pages;
> };
>
> -#define ac_classzone_idx(ac) zonelist_zone_idx(ac->preferred_zoneref)
> +#define ac_classzone_idx(ac) (ac->high_zoneidx)

Reviewed-by: Baoquan He <bhe@xxxxxxxxxx>