Re: [RFC PATCH V3 1/1] sched/numa: Fix disjoint set vma scan regression

From: Raghavendra K T
Date: Fri Aug 11 2023 - 09:35:39 EST


On 7/24/2023 1:11 PM, Raghavendra K T wrote:
On 7/21/2023 8:48 PM, Mel Gorman wrote:
On Wed, May 31, 2023 at 09:55:26AM +0530, Raghavendra K T wrote:
  With the numa scan enhancements [1], only the threads which had previously
accessed vma are allowed to scan.


I know it's very late to be reading this properly for the first time. It's a
brief review as it's late but I'm not 100% convinced by the patch as-is.


Thank you Mel for the review.  Expert opinion / intuition from you,
PeterZ helped, helps alot to improve to solve the problem in better way finally.
[...]
---
  include/linux/mm_types.h |  1 +
  kernel/sched/fair.c      | 31 ++++++++++++++++++++++++-------
  2 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 306a3d1a0fa6..992e460a713e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -479,6 +479,7 @@ struct vma_numab_state {
      unsigned long next_scan;
      unsigned long next_pid_reset;
      unsigned long access_pids[2];
+    unsigned int scan_counter;
  };

Vague name as it's not counting and gets reset. It might have been clearer
to name is something like scan_selective with an initial value related to
the VMA size that decrements. When it hits 0, the scan is forced *once*
for the unlucky task. The suggested name is also bad, I'm terrible at
naming but "counter" gives no hints either.


Agree that naming is bad, will use scan_selective / or try if I get more
idea on naming.
So you are suggesting to try other-way where we force scan once we hit
zero. Looks fair to me. will try the idea and see how it goes.


Hello Mel,
Sorry for going slow from my side. I was experimenting further with
combination patches, I will be posting set of patches that I mention
further down here.

Just want us to be on same page, posting full description of the
problem, details of solution, and how it looks like (code) etc

Also sorry for long description.

Problem statement (Disjoint VMA set):
====================
Since commit fc137c0ddab2 ("sched/numa: enhance vma scanning logic")
vma scanning is allowed if
1) The task had accessed vma
2) Early phase of the scan where mm->numa_scan_seq is less than 2
while this works for most of the times to reduce scanning overhead
there are concerning corner case(s) associated with it.

Let's look at these corner caseswith below example of tasks and their access pattern.
Set 1 tasks accessing vma_x (group of vmas)
Set 2 tasks accessing vma_y (group of vmas)

Set1 Set2
------------------- --------------------
| task_1..task_n/2 | | task_n/2+1..task_n |
------------------- --------------------
| |
V V
------------------- --------------------
| vma_x | | vma_y |
------------------- --------------------

Figure. (1) Tasks accessing disjoint vma

Corner cases:
(a) Out of N tasks, not all of them gets fair opportunity to scan. (PeterZ).
suppose Set1 tasks gets more opportunity to scan in the above example,
then vma_x gets scanned more number of times than vma_y.

some experiment is also done here:
Link: https://lore.kernel.org/lkml/c730dee0-a711-8a8e-3eb1-1bfdd21e6add@xxxxxxx/

(b) Sizes of vmas can differ.
suppose size of vma_y is far greater than size of vma_x, then bigger portion of
vma_y can potentially be left unscanned since scanning is bounded by scan_size

(c) highly active threads trap a few vmas frequently, and some of the vmas not
accessed for long time can potentially get starved of scanning indefinitely
(Mel) (because task access history gets reset periodically. Not sure if this
is is a bigger issue if vma is not important but on the other hand we will be
left with no hints for migration later if necessary)

(d) allocation of memory in some specific manner (Mel).

(e) vmas that are created after two full scans of mm (mm->numa_scan_seq > 2),
will never get scanned. (I have not seen that occurring in some rare cases)

on top of this Combination of some of the above (e.g., (a) and (b)) can potentially
amplify the side effect of numa scanning overall.

High level ideas to address the issue:
===================================
Idea 1) depending on vma_size populate per vma scan_select value, decrement it and when it hits zero
do force scan (Mel)

this is how vma phases looks like after implementation:

|<--p1--->|<-----p2----->|<-----p2----->|...

p1: new vma, initial phase do not scan till scan_delay

p2: allow scanning if task has accessed vma or vma_scan_selectiv hit zero

repeat p2

pros/cons:
+ : Ratelimiting is inbuilt to the approach
+/-: scanning continues forever
- : changes in vma size is taken care after force scan

Idea 2) Have a per vma_scan_seq. allow the unconditional scan till vma_scan_seq
reaches a value proportional (or equal) to vma_size/scan_size

this is how vma phases looks like after implementation:

|<--p1--->|<-----p2----->|<-----p3----->|<-----p4----->...||<-----p2----->|<-----p3----->|<-----p4-----> ...||
RESET RESET

p1: new vma, initial phase do not scan till scan_delay

p2: allow scanning if task has accessed vma or vma_scan_seq has reached till
f1(vma_size)/scan_size) for e.g., number of scan to cover full or half of vmas

p3: allow scanning if task has accessed vma or vma_scan_seq has reached till
f2(vma_size)/scan_size in rate limited manner (Optional)

p4: allow scanning iff task has accessed vma

Reset after p4 ( f(scan_delay) ??)

and repeat p2, p3 p4

+ : Ratelimiting need to be taken care separately if needed
+/-: scanning continues only if RESET of vma_scan_seq is implemented
+ : changes in vma size is taken care in every scan
- :

Additional optimizations (unconditionally allow if vma is hot
i.e., either accessed very recently or number of tasks accessing vma
is very high)

Idea 3) Take bitmask_weight of pid_access history of vma. (suggested by Bharata)
(vma->numab_state->pid_access[1] | vma->numab_state->pid_access[0])
If number of tasks accessing vma is > THRESHOLD (say 3),
unconditionally allow scanning

Idea 4) Take bitmask_weight of latest pid_access value (vma->numab_state->pid_access[1])
If number of tasks accessing vma is >= 1, i.e. recently accessed
unconditionally allow scanning.

Other ideas still to explore:
5) Can we maintain more pid access history (vs last two as maintained now)
The whole idea with code has been already given by PeterZ.

Some of the observations from my experiments:
-----------------------------------
- It is very critical to allow scanning after initial scan_delay (as mentioned by Mel)
so as to capture early patterns.
- It is also important to limit (or rate limit) unconditional scans to avoid excess
scanning overhead.

Example code for idea1:
=======================
static inline unsigned int vma_scan_ratelimit(struct vm_area_struct *vma)
{
unsigned int vma_size, ratelimit;

/* vma size */
vma_size = (vma->vm_end - vma->vm_start);

if (vma_size)
ratelimit = (1 << 15) / vma_size;

/* for 4KB vma rate limit is 1 in 9 */
/* and for > 64K it is 1 in 2 */

return 1 + ratelimit;
}

static bool task_disjoint_vma_select(struct vm_area_struct *vma)
{
if (vma->numab_state->vma_scan_seq > 0) {
vma->numab_state->vma_scan_selective--;
return false;
} else
vma->numab_state->vma_scan_selective = vma_scan_ratelimit(vma);
}

Example code for idea 2:
=======================
static inline unsigned int disjoint_vma_scan_threshold(struct vm_area_struct *vma)
{
unsigned int vma_size, scan_size, threshold;

scan_size = READ_ONCE(sysctl_numa_balancing_scan_size);

/* vma size in MB */
vma_size = (vma->vm_end - vma->vm_start) >> 20;

/* Total scans needed to cover VMA. Add minimum of one */
threshold = 1 + (vma_size / scan_size);

return threshold;
}

static bool task_disjoint_vma_select(struct vm_area_struct *vma)
{
return (vma->numab_state->vma_scan_seq <= disjoint_vma_scan_threshold(vma));
}

Example code for idea 3,4 combined:
================================

#define HOT_VMA_NR_TASK_THRESH 3

static bool vma_is_accessed(struct vm_area_struct *vma)
{
unsigned long pids, recent_hist;
unsigned int hashval;

recent_hist = READ_ONCE(vma->numab_state->access_pids[1]);
pids = vma->numab_state->access_pids[0] | recent_hist ;

/*
* Check if vma was recently accessed OR
* vma is being accessed by more than 2 tasks
*/
if (recent_hist || bitmap_wight(&pids, BITS_PER_LONG))
return true;

hashval = hash_32(current->pid, ilog2(BITS_PER_LONG));

return test_bit(hashval, &pids);
}

Additionally, to Implement vma_scan_seq reset have used something like

#define VMA_SCANSEQ_RESET_PERIOD (4 * VMA_PID_RESET_PERIOD)
/* this is to recent per vma scan_seq */

I will be posting combination of idea1 and idea3, idea4 by next week
mostly as a new enhancement series.

(OR idea2 and idea3, idea4 in case you have any positive thought about idea2)

initial results are looking good for both the ideas, I am taking more time to
experiment further and fine tune.

Thanks
- Raghu