Re: [RFC PATCH V3 1/1] sched/numa: Fix disjoint set vma scan regression

From: Raghavendra K T
Date: Mon Jul 24 2023 - 03:42:19 EST


On 7/21/2023 8:48 PM, Mel Gorman wrote:
On Wed, May 31, 2023 at 09:55:26AM +0530, Raghavendra K T wrote:
With the numa scan enhancements [1], only the threads which had previously
accessed vma are allowed to scan.


I know it's very late to be reading this properly for the first time. It's a
brief review as it's late but I'm not 100% convinced by the patch as-is.


Thank you Mel for the review. Expert opinion / intuition from you,
PeterZ helped, helps alot to improve to solve the problem in better way finally.

To start off -- reference the specific commit e.g. Since commit fc137c0ddab2
("sched/numa: enhance vma scanning logic")...


Okay.

While this had improved significant system time overhead, there were corner
cases, which genuinely need some relaxation. For e.g.,

1) Concern raised by PeterZ, where if there are N partition sets of vmas
belonging to tasks, then unfairness in allowing these threads to scan could
potentially amplify the side effect of some of the vmas being left
unscanned.


This needs a much better description as there is too much guesswork
required to figure out what problem is being solved.

Agree I needed to be more descriptive.

You may be referring
to an issue where a VMA not accessed for a long time may be skipped for
scanning indefinitely.
I feel this is mostly a favorable case because we don't incur
overhead of scanning unneeded VMA. so not this, but below ones

You could also be referring to an issue where a
highly active thread always traps the NUMA hinting fault first and other
tasks never pass the vma_is_accessed test. It also could be the case that
due to a malloc implementation behaviour that the thread using a particular
VMA changes and it's not detected. It's simply not clear how the VMAs are
partitioned or what exactly is unfair. I certainly can guess a number of
reasons why a patch like this is necessary but in a years time, it'll be
hard to be certain about the original intent.

Agree, to summarize
here there are multiple layers of issues

1) PeterZ pointed this

(Thread1/Set1)---->access(vma1) <== gets 65% opportunity

(Thread2/Set2)---->access(vma2) <== gets 35% opprtunity

Suppose Thread1 (or threads from Set1) are actively getting more scanning opportunity, we may miss or scan less of vma2.

I did some experiment and found there was indeed a bit of unfairness in
the way set1/set2 gets opportunity to scan.

Link: https://lore.kernel.org/lkml/c730dee0-a711-8a8e-3eb1-1bfdd21e6add@xxxxxxx/

2) Suppose for the above case if we have
vma1, vma2 represents sum of vmas accessed by set1 and set2 respectively in disjoint way.

and vma1 size is far less than vma2 size, effect of unfairness in scanning gets amplified. e.g.,

(Thread1/Set1)---->access(vma1 size = 1MB) <== actively scanning

(Thread2/Set2)---->access(vma2 size = 10MB)


3) malloc behavior you pointed.

I was trying to minimize the side effect of unfairness in scanning
overall by introducing a bit of unconditional scans.

(But the unfairness amongst the threads, which thread (or set of thread) gets opportunity to scan remains)

I may be missing some more cases here


2) Below reports of LKP numa01 benchmark regression.

Currently this is handled by allowing first two scanning unconditional
as indicated by mm->numa_scan_seq. This is imprecise since for some
benchmark vma scanning might itself start at numa_scan_seq > 2.


Well, it's also not useful in many cases. There is nothing special about
the first two scans other than they happen early in the lifetime of the
process. A major change in phase behaviour that occurs after 2 scans
could miss the new access pattern.

Fully agree. That is why that condition is removed in the patch now.
What I also saw was some times when new vma gets created
mm->numa_scan_seq could be already more than 2.
(To handle that correctly I would have to have per vma start_numa_seq
kind of things etc which was messy and more space consuming)

Note:
Somewhat related, In my experiments what I saw is normal short
running benchmarks used to finish by numa_scan_seq < 16 whereas some benchmark like NAS used to go till 100-150. (to set context for any long running benchmark optimization in future)


Solution:
Allow unconditional scanning of vmas of tasks depending on vma size. This
is achieved by maintaining a per vma scan counter, where

f(allowed_to_scan) = f(scan_counter < vma_size / scan_size)


This is a vague description as well and does not mention that the
scan_counter resets and the code does not appear to match the equation.
It's also a bit arbitrary that unconditional scanning occurs for every task
after the scan_count passes a threshold. The timing of when this occurs may
vary and all threads sharing the address space may conduct the scanning
which may be overkill. It should only be necessary for at least one task
to unconditionally mark the VMA for hinting, no? I'm generally a bit wary
that the time it takes to detect a VMA needs unconditonal scanning depends
on the size as due to VMA-merging,

I do agree that changing of vma size due to merge/split can affect
scanning. But I was assuming that it is Okay because this calculation is
done every time we hit task_numa_work which may be wrong (and imprecise).

it may be non-deterministic when a VMA
gets scanned and there may be run-to-run variance to the timing of threads
allocating address space >

Agree

Result:
numa01_THREAD_ALLOC result on 6.4.0-rc2 (that has numascan enhancement)
base-numascan base base+fix
real 1m1.507s 1m23.259s 1m2.632s
user 213m51.336s 251m46.363s 220m35.528s
sys 3m3.397s 0m12.492s 2m41.393s

numa_hit 5615517 4560123 4963875
numa_local 5615505 4560024 4963700
numa_other 12 99 175
numa_pte_updates 1822797 493 1559111
numa_hint_faults 1307113 523 1469031
numa_hint_faults_local 612617 488 884829
numa_pages_migrated 694370 35 584202


It's not clear what these kernels are. Is base-numascan a 6.4-rc2 kernel
with at least commit fc137c0ddab2 ("sched/numa: enhance vma scanning
logic") reverted?

This is correct. To compare the system time overhead before commit
fc137c0ddab2 ("sched/numa: enhance vma scanning logic")

The numa_[hint|local|other] stats are not helpful. They are updated on
the allocation side and not related to actual behaviour.

will trim this next time.


As I'm unsure what the kernels are, I'm not sure how interesting it is
that the pte update stats and number of faults trapped are very different
for base to base+fix.

Number of faults trapped are excessively refined for base (6.4.0-rc2 vanilla) for this particular case, because of malloc behaviour you pointed above.

Sure enough, the overall runtime is lower so it's
*probably* good but not sure as it may be also the case that disabling NUMA
balancing would be faster again given the nature of this particular test :P

The testing is a little thin and I'd worry slightly that this patch is
very specific to this particular workload.

I do agree that it is mostly trying to address LKP numa01_THREAD_ALLOC
case.
But regarding the testing though I reported only above result,
I do run
1) kernbench, autonuma from mmtest,
2) some microbenchmarks and also NAS/hashjoin etc long running
benchmarks to ensure there are no surprises.


Summary: Regression in base is recovered by allowing scanning as required.

[1] https://lore.kernel.org/lkml/cover.1677672277.git.raghavendra.kt@xxxxxxx/T/#t

Fixes: fc137c0ddab2 ("sched/numa: enhance vma scanning logic")
regression.

Fixes may be a bit overkill as this patch is more of an enhancement than
something that justifies a backport to -stable but I don't feel strongly
about it.

Sure.


Reported-by: Aithal Srikanth <sraithal@xxxxxxx>
Reported-by: kernel test robot <oliver.sang@xxxxxxxxx>
Closes: https://lore.kernel.org/lkml/db995c11-08ba-9abf-812f-01407f70a5d4@xxxxxxx/T/
Signed-off-by: Raghavendra K T <raghavendra.kt@xxxxxxx>
---
include/linux/mm_types.h | 1 +
kernel/sched/fair.c | 31 ++++++++++++++++++++++++-------
2 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 306a3d1a0fa6..992e460a713e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -479,6 +479,7 @@ struct vma_numab_state {
unsigned long next_scan;
unsigned long next_pid_reset;
unsigned long access_pids[2];
+ unsigned int scan_counter;
};

Vague name as it's not counting and gets reset. It might have been clearer
to name is something like scan_selective with an initial value related to
the VMA size that decrements. When it hits 0, the scan is forced *once*
for the unlucky task. The suggested name is also bad, I'm terrible at
naming but "counter" gives no hints either.


Agree that naming is bad, will use scan_selective / or try if I get more
idea on naming.
So you are suggesting to try other-way where we force scan once we hit
zero. Looks fair to me. will try the idea and see how it goes.

/*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 373ff5f55884..4e71fb58085b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2931,17 +2931,30 @@ static void reset_ptenuma_scan(struct task_struct *p)
static bool vma_is_accessed(struct vm_area_struct *vma)
{
unsigned long pids;
+ unsigned int vma_size;
+ unsigned int scan_threshold;
+ unsigned int scan_size;
+
+ pids = vma->numab_state->access_pids[0] | vma->numab_state->access_pids[1];
+
+ if (test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids))
+ return true;
+
+ scan_size = READ_ONCE(sysctl_numa_balancing_scan_size);
+ /* vma size in MB */
+ vma_size = (vma->vm_end - vma->vm_start) >> 20;
+
+ /* Total scans needed to cover VMA */
+ scan_threshold = vma_size / scan_size;
+
/*
- * Allow unconditional access first two times, so that all the (pages)
- * of VMAs get prot_none fault introduced irrespective of accesses.
+ * Allow the scanning of half of disjoint set's VMA to induce
+ * prot_none fault irrespective of accesses.
* This is also done to avoid any side effect of task scanning
* amplifying the unfairness of disjoint set of VMAs' access.
*/

Much better description is required. It's not stated how the VMAs are
disjoint. For example, they all belong to the same address space so from that
perspective they are not disjoint. Clarify that it's related to accesses
at the very least or be clear on how a set of VMAs can be disjoint. Also,
it's not half the total VMAs or some other critieria either, it's related
to scanning events. There is a lot of guesswork involved here.

Sure. will add more details.
ALso I was not clear here where does that half came from. Idea was
allowing unconditional scanning = vma_size for a unrelated task that has
not accessed vma is too much.. since we are letting unconditional scans let it be reduced to atleast half of that. I will experiment a bit here
to get optimal f(vma_size/scan_size)


- if (READ_ONCE(current->mm->numa_scan_seq) < 2)
- return true;
-
- pids = vma->numab_state->access_pids[0] | vma->numab_state->access_pids[1];
- return test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids);
+ scan_threshold = 1 + (scan_threshold >> 1);
+ return (vma->numab_state->scan_counter < scan_threshold);
}
#define VMA_PID_RESET_PERIOD (4 * sysctl_numa_balancing_scan_delay)
@@ -3058,6 +3071,8 @@ static void task_numa_work(struct callback_head *work)
/* Reset happens after 4 times scan delay of scan start */
vma->numab_state->next_pid_reset = vma->numab_state->next_scan +
msecs_to_jiffies(VMA_PID_RESET_PERIOD);
+
+ vma->numab_state->scan_counter = 0;
}
/*
@@ -3084,6 +3099,8 @@ static void task_numa_work(struct callback_head *work)
vma->numab_state->access_pids[1] = 0;
}
+ vma->numab_state->scan_counter++;
+
do {
start = max(start, vma->vm_start);
end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);

This has a potential timing issue as only tasks that accessed the VMA
update the counter.

I think there is a confusion because of reverted logic. we do allow
unconditionally to let the scan happen initially till scan_counter reaches scan_threshold. So even though task did not access vma, we get
to scan initially.

So idea is update scan_counter only if we are able to successfully come till here because we are sure that there is not condition that returns back after this without scanning.

A task could create a VMA, access is heavily in an
init phase and then go completely idle waiting on other tasks to complete
(e.g. parallelised load that uses a single thread to setup the data and then
creates worker threads to process that data). As there is no guarantee other
tasks would trap a fault, the access bits would never be set, the counter
never increments and the VMA is ignored forever. It's late on a Friday so
maybe I'm wrong but it looks like this could have weird corner cases.

The counter should at least increase if the VMA could have been scanned
except the access bits were not set with the caveat that it may incur
excessive cache misses.


Explained above

Thanks a lot,
Raghu