Re: [PATCH] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group

From: K Prateek Nayak
Date: Fri Feb 11 2022 - 02:37:09 EST


Hello Peter,

On 2/9/2022 10:50 PM, Peter Zijlstra wrote:
>>> Where does this affinity come from?
>> The affinity comes from limiting the process to a certain subset of
>> available cpus by modifying the cpus_ptr member of task_struck
>> via taskset or numactl.
> That's obviously not an answer. Why is that done?
Sorry, I should have been more clear in my previous reply.

Currently, the scheduler is conservative while spreading the tasks across
groups of a NUMA domain. It does so only when the number of runnable
tasks in the local sched-group is less than the allowed imbalance
threshold. The imbalance threshold is 25% of the domain's span weight
in sched/tip/core. Mel's recent patchset
"Adjust NUMA imbalance for multiple LLCs"
(https://lore.kernel.org/lkml/20220208094334.16379-1-mgorman@xxxxxxxxxxxxxxxxxxx/)
make it dependent on the number of LLCs in the NUMA domain.

In case of AMD Zen like systems containing with multiple LLCs per
socket, users want to spread bandwidth hungry applications across
multiple LLCs. Stream is one such representative workload where the
best performance is obtained by limiting one stream thread per LLC. To
ensure this, users are known to pin the tasks to a specify a subset of
the CPUs consisting of one CPU per LLC while running such bandwidth
hungry tasks.

Suppose we kickstart a multi-threaded task like
stream with 8 threads using taskset or numactl to run on a subset of
CPUs on a 2 socket Zen3 server where each socket contains 128 CPUs
(0-63,128-191 in one socket, 64-127,192-255 in another socket)

Eg: numactl -C 0,16,32,48,64,80,96,112 ./stream8

Here each CPU in the list is from a different LLC and 4 of those LLCs
are on one socket, while the other 4 are on another socket.

Ideally we would prefer that each stream thread runs on a different
CPU from the allowed list of CPUs. However, the current heuristics in
find_idlest_group() do not allow this during the initial placement.

Suppose the first socket (0-63,128-191) is our local group from which
we are kickstarting the stream tasks. The first four stream threads
will be placed in this socket. When it comes to placing the 5th
thread, all the allowed CPUs are from the local group (0,16,32,48)
would have been taken. We can detect this by checking if the number of
CPUs in the local group are fewer than the number of tasks running in
the local group and use this information to spread the 5th task out
into the the next socket (after all, the goal in this slowpath is to
find the idlest group and the idlest CPU during the initial
placement!).

However, the current scheduler code simply checks if the number of
tasks in the local group is fewer than the allowed numa-imbalance
threshold. This threshold is 25% of the NUMA domain span
in sched/tip/core (in this case threshold = 32) and is equal to the
number of LLCs in the domain with Mel's recent v6 patchset
(https://lore.kernel.org/lkml/20220208094334.16379-1-mgorman@xxxxxxxxxxxxxxxxxxx/)
(in this case threshold = 8). For this example, the number of tasks
will always be within threshold and thus all the 8 stream threads
will be woken up on the first socket thereby resulting in
sub-optimal performance.

Following are the results from running 8 Stream threads with and
without pinning:

               tip sched/core           tip sched/core          tip sched/core
                     + mel-v6                 + mel-v6            + this-patch
                 (no pinning)                 +pinning               + pinning

 Copy:   111309.82 (0.00 pct)    111133.84 (-0.15 pct)   151249.35 (35.88 pct)
Scale:   107391.64 (0.00 pct)    105933.51 (-1.35 pct)   144272.14 (34.34 pct)
  Add:   126090.18 (0.00 pct)    127533.88 (1.14 pct)    177142.50 (40.48 pct)
Triad:   124517.67 (0.00 pct)    126944.83 (1.94 pct)    175712.64 (41.11 pct)

The following sched_wakeup_new tracepoint output shows the initial
placement of tasks in tip/sched/core:

stream-4300    [032] d..2.   115.753321: sched_wakeup_new: comm=stream pid=4302 prio=120 target_cpu=048
stream-4300    [032] d..2.   115.753389: sched_wakeup_new: comm=stream pid=4303 prio=120 target_cpu=000
stream-4300    [032] d..2.   115.753443: sched_wakeup_new: comm=stream pid=4304 prio=120 target_cpu=016
stream-4300    [032] d..2.   115.753487: sched_wakeup_new: comm=stream pid=4305 prio=120 target_cpu=032
stream-4300    [032] d..2.   115.753539: sched_wakeup_new: comm=stream pid=4306 prio=120 target_cpu=032
stream-4300    [032] d..2.   115.753576: sched_wakeup_new: comm=stream pid=4307 prio=120 target_cpu=032
stream-4300    [032] d..2.   115.753611: sched_wakeup_new: comm=stream pid=4308 prio=120 target_cpu=032

Output from V6 of Mel's patchset is also similar.
Once the first four threads are distributed among the allowed CPUs of
socket one, the rest of the treads start piling on these same CPUs
when clearly there are CPUs on the second socket that can be used.

The following sched_wakeup_new tracepoint output shows the initial
placement of tasks in after adding this fix:

stream-4733    [032] d..2.   116.017980: sched_wakeup_new: comm=stream pid=4735 prio=120 target_cpu=048
stream-4733    [032] d..2.   116.018032: sched_wakeup_new: comm=stream pid=4736 prio=120 target_cpu=000
stream-4733    [032] d..2.   116.018127: sched_wakeup_new: comm=stream pid=4737 prio=120 target_cpu=064
stream-4733    [032] d..2.   116.018185: sched_wakeup_new: comm=stream pid=4738 prio=120 target_cpu=112
stream-4733    [032] d..2.   116.018235: sched_wakeup_new: comm=stream pid=4739 prio=120 target_cpu=096
stream-4733    [032] d..2.   116.018289: sched_wakeup_new: comm=stream pid=4740 prio=120 target_cpu=016
stream-4733    [032] d..2.   116.018334: sched_wakeup_new: comm=stream pid=4741 prio=120 target_cpu=080

We see that threads are using all of the allowed CPUs
and there is no pileup.

Output of tracepoint sched_migrate_task for sched-tip is as follows:
(output has been slightly altered for readability)

115.765048:  sched_migrate_task:  comm=stream  pid=4305  prio=120  orig_cpu=32  dest_cpu=16    START - {8}{0}
115.767042:  sched_migrate_task:  comm=stream  pid=4306  prio=120  orig_cpu=32  dest_cpu=0
115.767089:  sched_migrate_task:  comm=stream  pid=4307  prio=120  orig_cpu=32  dest_cpu=48
115.996255:  sched_migrate_task:  comm=stream  pid=4306  prio=120  orig_cpu=0  dest_cpu=64     * {7}{1}
116.039173:  sched_migrate_task:  comm=stream  pid=4304  prio=120  orig_cpu=16  dest_cpu=64    * {6}{2}
... 19 migrations
116.367329:  sched_migrate_task:  comm=stream  pid=4303  prio=120  orig_cpu=0  dest_cpu=64     * {5}{3}
... 17 migrations
116.647607:  sched_migrate_task:  comm=stream  pid=4306  prio=120  orig_cpu=64  dest_cpu=0     * {6}{2}
... 3 migrations
116.705935:  sched_migrate_task:  comm=stream  pid=4308  prio=120  orig_cpu=48  dest_cpu=80    * {5}{3}
... 15 migrations
116.921504:  sched_migrate_task:  comm=stream  pid=4300  prio=120  orig_cpu=48  dest_cpu=64    * {4}{4}
116.941469:  sched_migrate_task:  comm=stream  pid=4300  prio=120  orig_cpu=64  dest_cpu=32    * {5}{3}
... 20 migrations
117.426116:  sched_migrate_task:  comm=stream  pid=4305  prio=120  orig_cpu=32  dest_cpu=64    * {4}{4}
... 4 migrations
117.634768:  sched_migrate_task:  comm=stream  pid=4303  prio=120  orig_cpu=64  dest_cpu=16    * {5}{3}
... 5 migrations
117.775718:  sched_migrate_task:  comm=stream  pid=4303  prio=120  orig_cpu=48  dest_cpu=64    * {4}{4}
... 3 migrations
117.901872:  sched_migrate_task:  comm=stream  pid=4303  prio=120  orig_cpu=96  dest_cpu=112   END - {4}{4}

* Denotes cross NUMA migrations of task followed by number of
  stream threads running in each NUMA domain.

No output is generated for tracepoint sched_migrate_task with this
patch due to a perfect initial placement which removes the need
for balancing later on - both across NUMA boundaries and within
NUMA boundaries for stream.

Based on the results above, a bad placement can lead to lot of
unnecessary migrations before reaching the optimal placement
which is further found to be unstable - in the
sched_migrate_task traces above we see a lot of ping ponging
between optimal and nearly optimal case ({5}{3} -> {4}{4} -> {5}{3})

Thus there is an opportunity detect the situation when the current
NUMA group is full and spread the remaining tasks to the CPUs from
the other NUMA group.
If the task is not pinned, we fall to the default behavior as we
consider the minimum of:

min(number_of_allowed_cpus, calculated_imbalance_metric)

Mel suggested reusing the select_idle_mask which is now done in
my V2 (https://lore.kernel.org/lkml/20220209100534.12813-1-kprateek.nayak@xxxxxxx/)
V2 changes are rebased on top of V6 of Mel's patchset and
contains the latest numbers for this fix.
--
Thanks and Regards,
Prateek