Re: [RFC PATCH V1 0/6] sched/numa: Enhance disjoint VMA scanning

From: Mel Gorman
Date: Tue Sep 19 2023 - 12:22:23 EST


On Tue, Sep 19, 2023 at 11:28:30AM +0200, Peter Zijlstra wrote:
> On Tue, Aug 29, 2023 at 11:36:08AM +0530, Raghavendra K T wrote:
>
> > Peter Zijlstra (1):
> > sched/numa: Increase tasks' access history
> >
> > Raghavendra K T (5):
> > sched/numa: Move up the access pid reset logic
> > sched/numa: Add disjoint vma unconditional scan logic
> > sched/numa: Remove unconditional scan logic using mm numa_scan_seq
> > sched/numa: Allow recently accessed VMAs to be scanned
> > sched/numa: Allow scanning of shared VMAs
> >
> > include/linux/mm.h | 12 +++--
> > include/linux/mm_types.h | 5 +-
> > kernel/sched/fair.c | 109 ++++++++++++++++++++++++++++++++-------
> > 3 files changed, 102 insertions(+), 24 deletions(-)
>
> So I don't immediately see anything horrible with this. Mel, do you have
> a few cycles to go over this as well?

I've been trying my best to find the necessary time and it's still on my
radar for this week. Preliminary results don't look great for the first part
of the series up to the patch "sched/numa: Add disjoint vma unconditional
scan logic" even though other reports indicate the performance may be
fixed up later in the series. For example

autonumabench
6.5.0-rc6 6.5.0-rc6
sched-pidclear-v1r5 sched-forcescan-v1r5
Min syst-NUMA02 1.94 ( 0.00%) 1.38 ( 28.87%)
Min elsp-NUMA02 12.67 ( 0.00%) 21.02 ( -65.90%)
Amean syst-NUMA02 2.35 ( 0.00%) 1.86 ( 21.13%)
Amean elsp-NUMA02 12.93 ( 0.00%) 21.69 * -67.76%*
Stddev syst-NUMA02 0.54 ( 0.00%) 0.90 ( -67.67%)
Stddev elsp-NUMA02 0.18 ( 0.00%) 0.44 (-144.19%)
CoeffVar syst-NUMA02 22.82 ( 0.00%) 48.50 (-112.58%)
CoeffVar elsp-NUMA02 1.38 ( 0.00%) 2.01 ( -45.56%)
Max syst-NUMA02 3.15 ( 0.00%) 3.89 ( -23.49%)
Max elsp-NUMA02 13.16 ( 0.00%) 22.36 ( -69.91%)
BAmean-50 syst-NUMA02 2.01 ( 0.00%) 1.45 ( 27.69%)
BAmean-50 elsp-NUMA02 12.77 ( 0.00%) 21.34 ( -67.04%)
BAmean-95 syst-NUMA02 2.22 ( 0.00%) 1.52 ( 31.68%)
BAmean-95 elsp-NUMA02 12.89 ( 0.00%) 21.58 ( -67.39%)
BAmean-99 syst-NUMA02 2.22 ( 0.00%) 1.52 ( 31.68%)
BAmean-99 elsp-NUMA02 12.89 ( 0.00%) 21.58 ( -67.39%)

6.5.0-rc6 6.5.0-rc6
sched-pidclear-v1r5sched-forcescan-v1r5
Duration User 5702.00 10264.25
Duration System 17.02 13.59
Duration Elapsed 92.57 156.30

Similar results seen across multiple machines. It's not universally bad
but the NUMA02 tests appear to suffer quite badly and while not realistic,
they are somewhat relevant because numa02 is likely an "adverse workload"
for the logic that skips VMAs based on PID accesses.

For the rest of the series, the changelogs lacked detail on why those
changes helped. Patch 4's changelog lacks detail and patch 6 stating
"VMAs being accessed by more than two tasks are critical" is not helpful
either -- e.g. why are they critical? They are obviously shared VMAs and
therefore it may be the case that they need to be identified and interleaved
quickly but maybe not. Is the shared VMA that is critical a large malloc'd
area split into per-thread sections or something that is MAP_SHARED? The
changelog doesn't say so I have to guess. There are also a bunch of
magic variables with limited explanation (e.g. why NR_ACCESS_PID_HIST==4
and SHARED_VMA_THRESH=3?), the numab fields are not documented and the
changelogs lack supporting data. I suspect that patches 3-6 may be dealing
with regressions introduced by patch 2, particularly for NUMA02, but I'm
not certain as I didn't dedicate the necessary test time to prove that
and it's the type of information that should be in the changelog. While
there is nothing wrong with that as such, it's very hard to imagine how
patches 3-6 work in every case and be certain that the various parameters
make sense. That could cause difficulties later in terms of maintenance.

My initial thinking was "There should be a standalone series that deals
*only* with scanning VMAs that had no fault activity and skipped due to
PID hashing". These are important because there may be no fault activity
because there is no scan activity which is due to to fault activity. The
series is incomplete and without changelogs but I pushed it anyway to

https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git/ sched-numabselective-v1r5

The first two patches simply improve the documentation on what is going
on, patch 3 adds a tracepoint for figuring out why VMAs were skipped or
not skipped. Patch 4 handles a corner case to complete the scan of a VMA
once it has started regardless of what task is doing the scanning. The
last patch scans VMAs that have seen no fault activity once active VMAs
have been scanned.

It has its weaknesses because it may be overly simplisitic and it forces
all VMAs to be scanned on every sequence which is wasteful. It also hurts
NUMA02 performance, although not as badly as ""sched/numa: Add disjoint
vma unconditional scan logic". On the plus side, it is easier to reason
about, it solves only one problem in the series and any patch on top or
modification should justify each change individually.

--
Mel Gorman
SUSE Labs