[PATCH V2 0/3] sched/numa: Enhance vma scanning

From: Raghavendra K T
Date: Wed Feb 01 2023 - 03:03:40 EST


The patchset proposes one of the enhancements to numa vma scanning
suggested by Mel. This is continuation of [2]. Though I have removed
RFC, I do think some parts need more feedback and refinement.

Existing mechanism of scan period involves, scan period derived from
per-thread stats. Process Adaptive autoNUMA [1] proposed to gather NUMA
fault stats at per-process level to capture aplication behaviour better.

During that course of discussion, Mel proposed several ideas to enhance
current numa balancing. One of the suggestion was below

Track what threads access a VMA. The suggestion was to use an unsigned
long pid_mask and use the lower bits to tag approximately what
threads access a VMA. Skip VMAs that did not trap a fault. This would
be approximate because of PID collisions but would reduce scanning of
areas the thread is not interested in. The above suggestion intends not
to penalize threads that has no interest in the vma, thus reduce scanning
overhead.

About Patchset:
Patch1:
1) VMA scan delay logic added (Mel) where during initial phase of VMA,
we delay the scanning by sysctl_numa_balancing_scan_delay.

2) A new status structure is added (vma_numab) so as to not grow
the vm_area_struct in !NUMA_BALANCING case.

Patch2:
3) last 6 Bits of PID is used as index to remember which PIDs accessed
VMA in fault path. This is further used to restrict scanning of VMA in
scan path.

Please note that first two times scanning is unconditionally allowed
(using numa_scan_seq). But this may need some potential change since
numa_scan_seq is per mm.

Patch3:
4) Introduce basic patch clearing of accessed PIDs. This is as of now
done at every 4 * sysctl_numa_balancing_scan_delay interval.

This logic may need more experiment/refinement.

Things to ponder over (and Future TODO):
==========================================
- Improvement to clearing accessing PIDs logic (discussed in-detail in
patch3 itself)

- Current scan period is not changed in the patchset, so we do see frequent
tries to scan. Relaxing scan period dynamically could improve results
further.

Result Summary:
================
The result is obtained by running mmtests with below configs
config-numa-autonumabench

There is a significant reduction in AutoNuma cost from the benchmark
runs, But some of the results need improvement. I hope working on the
potential changes mentioned in patch3 and hopefuly numa_scan_period
tuning depending on current scanning efiiciency would help. will be
working on that..

SUT:
2 socket AMD Milan System
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 2

256GB memory per socket amounting to 512GB in total
NPS1 NUMA configuration where each socket is a NUMA node

autonumabench
6.1.0 6.1.0
BAmean-99 syst-NUMA01 195.84 ( 0.00%) 17.79 ( 90.91%)
BAmean-99 syst-NUMA01_THREADLOCAL 0.19 ( 0.00%) 0.19 ( 2.56%)
BAmean-99 syst-NUMA02 0.85 ( 0.00%) 0.85 ( 0.00%)
BAmean-99 syst-NUMA02_SMT 0.62 ( 0.00%) 0.65 ( -4.30%)
BAmean-99 elsp-NUMA01 254.95 ( 0.00%) 322.69 ( -26.57%)
BAmean-99 elsp-NUMA01_THREADLOCAL 1.04 ( 0.00%) 1.05 ( -1.29%)
BAmean-99 elsp-NUMA02 3.08 ( 0.00%) 3.29 ( -6.94%)
BAmean-99 elsp-NUMA02_SMT 3.49 ( 0.00%) 3.43 ( 1.91%)

6.1.0 6.1.0
Ops NUMA alloc hit 59210941.00 50772531.00
Ops NUMA alloc miss 0.00 0.00
Ops NUMA interleave hit 0.00 0.00
Ops NUMA alloc local 59200395.00 50771359.00
Ops NUMA base-page range updates 90670863.00 10952.00
Ops NUMA PTE updates 90670863.00 10952.00
Ops NUMA PMD updates 0.00 0.00
Ops NUMA hint faults 92069634.00 9501.00
Ops NUMA hint local faults % 69966984.00 9213.00
Ops NUMA hint local percent 75.99 96.97
Ops NUMA pages migrated 8424631.00 287.00
Ops AutoNUMA cost 461142.93 47.59

[1] sched/numa: Process Adaptive autoNUMA
Link: https://lore.kernel.org/lkml/20220128052851.17162-1-bharata@xxxxxxx/T/
[2] RFC V1:
Link: https://lore.kernel.org/all/cover.1673610485.git.raghavendra.kt@xxxxxxx/

Changes since RFC V1:
- Include Mel's vma scan delay patch
- Change the accessing pid store logic (Thanks Mel)
- Fencing structure / code to NUMA_BALANCING (David, Mel)
- Adding clearing access PID logic (Mel)
- Descriptive change log ( Mike Rapoport)

Mel Gorman (1):
sched/numa: Apply the scan delay to every vma instead of tasks

Raghavendra K T (2):
sched/numa: Enhance vma scanning logic
sched/numa: Reset the accessing PID information periodically

include/linux/mm.h | 23 +++++++++++++++++++
include/linux/mm_types.h | 9 ++++++++
kernel/fork.c | 2 ++
kernel/sched/fair.c | 49 ++++++++++++++++++++++++++++++++++++++++
mm/huge_memory.c | 1 +
mm/memory.c | 1 +
6 files changed, 85 insertions(+)

--
2.34.1