[PATCH V2 3/3] sched/numa: Reset the accessing PID information periodically

From: Raghavendra K T
Date: Wed Feb 01 2023 - 03:04:03 EST


This helps to ensure, only recently accessed PIDs scan the
VMAs.

Current implementation:
Reset accessing PIDs every (4 * sysctl_numa_balancing_scan_delay)
interval after initial scan delay period expires. The reset logic
is implemented in scan path

Suggested-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
Signed-off-by: Raghavendra K T <raghavendra.kt@xxxxxxx>
---
Some of the potential ideas for clearing the accessing PIDs

1) Flag to indicate phase in life cycle of vma and tie with timestamp (reuse next_scan or so)

VMA life cycle

t1 t2 t3 t4 t5 t6
|<- DS ->|<- US ->|<- CS ->|<- US ->|<- CS ->|
flags used to indicate whether we are in DS/CS/US phase

DS (delay scan): Initial phase where scan is avoided for new VMA
US (unconditional scan): Brief period where scanning is allowed irrespective of task faulting the VMA
CS (conditional scan) : Longer conditiona scanning phase where task scanning is allowed only for VMA of interest


2) Maintain duplicate list of accessing PIDs to keep track of history of access. and switch/reset. use OR operation during iteration

Two lists of PIDs maintained. At regular interval old list is reset and we make current list as old list
At any point of time tracking of PIDs accessing VMA is determined by ORing list1 and list2

accessing_pids_list1 <- current list
accessing_pids_list2 <- old list

3) Maintain per vma numa_seq also
Currently numa_seq (how many times we are scanning entire set of VMAs) is maintained at mm level.
Having numa_seq (almost like how many times the current VMA considered for scanning) per VMA may be helpful
in some context (for e.g., whether we need to allow VMA scanning unconditionally for a newly created VMA).

4) Reset accessing PIDs at regular intervals (current implementation)

t1 t2 t3 t4 t5 t6
|<- DS ->|<- CS ->|<- CS ->|<- CS ->|<- CS ->|

The current implementation resets accessing PIDs every 4*scan_delay intervals after initial scan delay
time expires. The reset logic is implemented in scan path

include/linux/mm_types.h | 1 +
kernel/sched/fair.c | 17 +++++++++++++++++
2 files changed, 18 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 980a6a4308b6..08a007744ea1 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -437,6 +437,7 @@ struct anon_vma_name {

struct vma_numab {
unsigned long next_scan;
+ unsigned long next_pid_reset;
unsigned long accessing_pids;
};

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3505ae57c07c..14db6d8a5090 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2928,6 +2928,8 @@ static bool vma_is_accessed(struct vm_area_struct *vma)
return vma->numab->accessing_pids & (1UL << active_pid_bit);
}

+#define VMA_PID_RESET_PERIOD (4 * sysctl_numa_balancing_scan_delay)
+
/*
* The expensive part of numa migration is done from task_work context.
* Triggered from task_tick_numa().
@@ -3035,6 +3037,10 @@ static void task_numa_work(struct callback_head *work)

vma->numab->next_scan = now +
msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
+
+ /* Reset happens after 4 times scan delay of scan start */
+ vma->numab->next_pid_reset = vma->numab->next_scan +
+ msecs_to_jiffies(VMA_PID_RESET_PERIOD);
}

/*
@@ -3047,6 +3053,17 @@ static void task_numa_work(struct callback_head *work)
if (!vma_is_accessed(vma))
continue;

+ /*
+ * RESET accessing PIDs regularly for old VMAs. Resetting after checking
+ * vma for recent access to avoid clearing PID info before access..
+ */
+ if (mm->numa_scan_seq &&
+ time_after(jiffies, vma->numab->next_pid_reset)) {
+ vma->numab->next_pid_reset = vma->numab->next_pid_reset +
+ msecs_to_jiffies(VMA_PID_RESET_PERIOD);
+ vma->numab->accessing_pids = 0;
+ }
+
do {
start = max(start, vma->vm_start);
end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
--
2.34.1