RE: [PATCH v3] zone_reclaim is always 0 by default

From: Zhang, Yanmin
Date: Wed May 20 2009 - 23:28:59 EST


>>-----Original Message-----
>>From: KOSAKI Motohiro [mailto:kosaki.motohiro@xxxxxxxxxxxxxx]
>>Sent: 2009年5月21日 10:47
>>To: LKML; linux-mm; Andrew Morton; Rik van Riel; Christoph Lameter; Robin Holt;
>>Zhang, Yanmin; Wu, Fengguang
>>Cc: kosaki.motohiro@xxxxxxxxxxxxxx
>>Subject: [PATCH v3] zone_reclaim is always 0 by default
>>
>>
>>Subject: [PATCH v3] zone_reclaim is always 0 by default
>>
>>Current linux policy is, zone_reclaim_mode is enabled by default if the machine
>>has large remote node distance. it's because we could assume that large distance
>>mean large server until recently.
>>
>>Unfortunately, recent modern x86 CPU (e.g. Core i7, Opeteron) have P2P
>>transport
>>memory controller. IOW it's seen as NUMA from software view.
>>Some Core i7 machine has large remote node distance.
>>
>>Yanmin reported zone_reclaim_mode=1 cause large apache regression.
>>
>> One Nehalem machine has 12GB memory,
>> but there is always 2GB free although applications accesses lots of files.
>> Eventually we located the root cause as zone_reclaim_mode=1.
>>
>>Actually, zone_reclaim_mode=1 mean "I dislike remote node allocation rather
>>than
>>disk access", it makes performance improvement to HPC workload.
>>but it makes performance degression desktop, file server and web server.
>>
>>In general, workload depended configration shouldn't put into default
>>settings.
>>Plus, desktop and file/web server eco-system is much larger than hpc's.
>>
>>Thus, zone_reclaim == 0 is better by default.
[YM] Thanks. I started a series of testing on 2 Nehalem machines by setting
zone_reclaim_mode=0 (The default is 1 on the 2 machines). I didn't find
regression with non-disk_I/O (mostly cpubound) benchmarks. disk I/O benchmarks
could benefit a little from zone_reclaim_mode=0. As I start benchmark fio with
numactl --interleave=all, so the fio improvement is not so bigger like before.

One thing I need mention is my testing with non-disk_I/O might be not good examples
for this patch, because every node has far more memory than the testing needs.
Only some disk I/O benchmarks have big requirement on page cache memory, so they could benefit from zone_reclaim_mode=0.


>>
>>
>>Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@xxxxxxxxxxxxxx>
>>Cc: Christoph Lameter <cl@xxxxxxxxxxxxxxxxxxxx>
>>Cc: Rik van Riel <riel@xxxxxxxxxx>
>>Cc: Robin Holt <holt@xxxxxxx>
>>Tested-by: "Zhang, Yanmin" <yanmin.zhang@xxxxxxxxx>
>>Acked-by: Wu Fengguang <fengguang.wu@xxxxxxxxx>
>>---
>> arch/ia64/include/asm/topology.h | 5 -----
>> include/linux/topology.h | 9 +--------
>> mm/page_alloc.c | 7 -------
>> 3 files changed, 1 insertion(+), 20 deletions(-)
>>
>>Index: b/mm/page_alloc.c
>>===================================================================
>>--- a/mm/page_alloc.c
>>+++ b/mm/page_alloc.c
>>@@ -2494,13 +2494,6 @@ static void build_zonelists(pg_data_t *p
>> int distance = node_distance(local_node, node);
>>
>> /*
>>- * If another node is sufficiently far away then it is better
>>- * to reclaim pages in a zone before going off node.
>>- */
>>- if (distance > RECLAIM_DISTANCE)
>>- zone_reclaim_mode = 1;
>>-
>>- /*
>> * We don't want to pressure a particular node.
>> * So adding penalty to the first node in same
>> * distance group to make it round-robin.
>>Index: b/arch/ia64/include/asm/topology.h
>>===================================================================
>>--- a/arch/ia64/include/asm/topology.h
>>+++ b/arch/ia64/include/asm/topology.h
>>@@ -21,11 +21,6 @@
>> #define PENALTY_FOR_NODE_WITH_CPUS 255
>>
>> /*
>>- * Distance above which we begin to use zone reclaim
>>- */
>>-#define RECLAIM_DISTANCE 15
>>-
>>-/*
>> * Returns the number of the node containing CPU 'cpu'
>> */
>> #define cpu_to_node(cpu) (int)(cpu_to_node_map[cpu])
>>Index: b/include/linux/topology.h
>>===================================================================
>>--- a/include/linux/topology.h
>>+++ b/include/linux/topology.h
>>@@ -53,14 +53,7 @@ int arch_update_cpu_topology(void);
>> #ifndef node_distance
>> #define node_distance(from,to) ((from) == (to) ? LOCAL_DISTANCE :
>>REMOTE_DISTANCE)
>> #endif
>>-#ifndef RECLAIM_DISTANCE
>>-/*
>>- * If the distance between nodes in a system is larger than RECLAIM_DISTANCE
>>- * (in whatever arch specific measurement units returned by node_distance())
>>- * then switch on zone reclaim on boot.
>>- */
>>-#define RECLAIM_DISTANCE 20
>>-#endif
>>+
>> #ifndef PENALTY_FOR_NODE_WITH_CPUS
>> #define PENALTY_FOR_NODE_WITH_CPUS (1)
>> #endif
>>

㈤旃??????+-遍荻?w??笔???dz罐??骅w*jg??????/??罐????璀??摺?囤??????:+v???佶>W?贽i?xPj??? -?+?d?