Re: [PATCH V2 2/3] sched/numa: Enhance vma scanning logic

From: Raghavendra K T
Date: Tue Feb 07 2023 - 01:42:41 EST


On 2/4/2023 11:44 PM, Raghavendra K T wrote:
On 2/3/2023 4:45 PM, Peter Zijlstra wrote:
On Wed, Feb 01, 2023 at 01:32:21PM +0530, Raghavendra K T wrote:
[...]

+        if (!vma_is_accessed(vma))
+            continue;
+
          do {
              start = max(start, vma->vm_start);
              end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);


This feels wrong, specifically we track numa_scan_offset per mm, now, if
we divide the threads into two dis-joint groups each only using their
own set of vmas (in fact quite common for workloads with proper data
partitioning) it is possible to consistently sample one set of threads
and thus not scan the other set of vmas.

It seems somewhat unlikely, but not impossible to create significant
unfairness.


Agree, But that is the reason why we want to allow first few
unconditional scans Or am I missing something?


Thinking further, may be we can summarize the different aspects of thread/ two disjoint set case itself into:

1) Unfairness because of way in which threads gets opportunity
to scan.

2) Disjoint set of vmas in the partition set could be of different sizes

3) Disjoint set of vmas could be associated with different number of
threads

Each of above can potentially help or make some thread do heavy lifting

but (2), and (3). is what I think we are trying to be Okay with by
making sure tasks mostly do not scan others' vmas

(1) could be a real issue (though I know that there are many places we
have corrected the issue by introducing offset in p->numa_next_scan)
but how the distribution looks now practically, I take it as a TODO and
post.