Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement

From: kemi
Date: Tue Nov 28 2017 - 03:35:08 EST




On 2017å11æ28æ 16:09, Vlastimil Babka wrote:
> On 11/28/2017 07:00 AM, Kemi Wang wrote:
>> The existed implementation of NUMA counters is per logical CPU along with
>> zone->vm_numa_stat[] separated by zone, plus a global numa counter array
>> vm_numa_stat[]. However, unlike the other vmstat counters, numa stats don't
>> effect system's decision and are only read from /proc and /sys, it is a
>> slow path operation and likely tolerate higher overhead. Additionally,
>> usually nodes only have a single zone, except for node 0. And there isn't
>> really any use where you need these hits counts separated by zone.
>>
>> Therefore, we can migrate the implementation of numa stats from per-zone to
>> per-node, and get rid of these global numa counters. It's good enough to
>> keep everything in a per cpu ptr of type u64, and sum them up when need, as
>> suggested by Andi Kleen. That's helpful for code cleanup and enhancement
>> (e.g. save more than 130+ lines code).
>
> OK.
>
>> With this patch, we can see 1.8%(335->329) drop of CPU cycles for single
>> page allocation and deallocation concurrently with 112 threads tested on a
>> 2-sockets skylake platform using Jesper's page_bench03 benchmark.
>
> To be fair, one can now avoid the overhead completely since 4518085e127d
> ("mm, sysctl: make NUMA stats configurable"). But if we can still
> optimize it, sure.
>

Yes, I did that several months ago. And both Dave Hansen and me thought that
auto tuning should be better because people probably do not touch this interface,
but Michal had some concerns about that.

This patch aims to cleanup up the code for numa stats with a little performance
improvement.

>> Benchmark provided by Jesper D Brouer(increase loop times to 10000000):
>> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
>> bench
>>
>> Also, it does not cause obvious latency increase when read /proc and /sys
>> on a 2-sockets skylake platform. Latency shown by time command:
>> base head
>> /proc/vmstat sys 0m0.001s sys 0m0.001s
>>
>> /sys/devices/system/ sys 0m0.001s sys 0m0.000s
>> node/node*/numastat
>
> Well, here I have to point out that the coarse "time" command resolution
> here means the comparison of a single read cannot be compared. You would
> have to e.g. time a loop with enough iterations (which would then be all
> cache-hot, but better than nothing I guess).
>

It indeed is a coarse comparison to show that it does not cause obvious
overhead in a slow path.

All right, I will do that to get a more accurate value.

>> We would not worry it much as it is a slow path and will not be read
>> frequently.
>>
>> Suggested-by: Andi Kleen <ak@xxxxxxxxxxxxxxx>
>> Signed-off-by: Kemi Wang <kemi.wang@xxxxxxxxx>
>
> ...
>
>> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
>> index 1779c98..7383d66 100644
>> --- a/include/linux/vmstat.h
>> +++ b/include/linux/vmstat.h
>> @@ -118,36 +118,8 @@ static inline void vm_events_fold_cpu(int cpu)
>> * Zone and node-based page accounting with per cpu differentials.
>> */
>> extern atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS];
>> -extern atomic_long_t vm_numa_stat[NR_VM_NUMA_STAT_ITEMS];
>> extern atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS];
>> -
>> -#ifdef CONFIG_NUMA
>> -static inline void zone_numa_state_add(long x, struct zone *zone,
>> - enum numa_stat_item item)
>> -{
>> - atomic_long_add(x, &zone->vm_numa_stat[item]);
>> - atomic_long_add(x, &vm_numa_stat[item]);
>> -}
>> -
>> -static inline unsigned long global_numa_state(enum numa_stat_item item)
>> -{
>> - long x = atomic_long_read(&vm_numa_stat[item]);
>> -
>> - return x;
>> -}
>> -
>> -static inline unsigned long zone_numa_state_snapshot(struct zone *zone,
>> - enum numa_stat_item item)
>> -{
>> - long x = atomic_long_read(&zone->vm_numa_stat[item]);
>> - int cpu;
>> -
>> - for_each_online_cpu(cpu)
>> - x += per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item];
>> -
>> - return x;
>> -}
>> -#endif /* CONFIG_NUMA */
>> +extern u64 __percpu *vm_numa_stat;
>>
>> static inline void zone_page_state_add(long x, struct zone *zone,
>> enum zone_stat_item item)
>> @@ -234,10 +206,39 @@ static inline unsigned long node_page_state_snapshot(pg_data_t *pgdat,
>>
>>
>> #ifdef CONFIG_NUMA
>> +static inline unsigned long zone_numa_state_snapshot(struct zone *zone,
>> + enum numa_stat_item item)
>> +{
>> + return 0;
>> +}
>> +
>> +static inline unsigned long node_numa_state_snapshot(int node,
>> + enum numa_stat_item item)
>> +{
>> + unsigned long x = 0;
>> + int cpu;
>> +
>> + for_each_possible_cpu(cpu)
>
> I'm worried about the "for_each_possible..." approach here and elsewhere
> in the patch as it can be rather excessive compared to the online number
> of cpus (we've seen BIOSes report large numbers of possible CPU's). IIRC
> the general approach with vmstat is to query just online cpu's / nodes,
> and if they go offline, transfer their accumulated stats to some other
> "victim"?
>

It's a trade off I think. "for_each_possible_cpu()" can avoid to fold local cpu
stats to a global counter (actually, the first available cpu in this patch) when
cpu is offline/dead.

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@xxxxxxxxxx For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>
>