Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs

From: Gautham R. Shenoy
Date: Tue Dec 21 2021 - 10:04:21 EST


On Mon, Dec 20, 2021 at 11:12:43AM +0000, Mel Gorman wrote:
> (sorry for the delay, was offline for a few days)
>
> On Fri, Dec 17, 2021 at 12:03:06AM +0530, Gautham R. Shenoy wrote:
> > Hello Mel,
> >
> > On Wed, Dec 15, 2021 at 12:25:50PM +0000, Mel Gorman wrote:
> > > On Wed, Dec 15, 2021 at 05:22:30PM +0530, Gautham R. Shenoy wrote:
> >
> > [..SNIP..]
> >
> > > > On a 2 Socket Zen3:
> > > >
> > > > NPS=1
> > > > child=MC, llc_weight=16, sd=DIE. sd->span_weight=128 imb=max(2U, (16*16/128) / 4)=2
> > > > top_p = NUMA, imb_span = 256.
> > > >
> > > > NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/256) = 2
> > > >
> > > > NPS=2
> > > > child=MC, llc_weight=16, sd=NODE. sd->span_weight=64 imb=max(2U, (16*16/64) / 4) = 2
> > > > top_p = NUMA, imb_span = 128.
> > > >
> > > > NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2
> > > > NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4
> > > >
> > > > NPS=4:
> > > > child=MC, llc_weight=16, sd=NODE. sd->span_weight=32 imb=max(2U, (16*16/32) / 4) = 2
> > > > top_p = NUMA, imb_span = 128.
> > > >
> > > > NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2
> > > > NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4
> > > >
> > > > Again, we will be more aggressively load balancing across the two
> > > > sockets in NPS=1 mode compared to NPS=2/4.
> > > >
> > >
> > > Yes, but I felt it was reasonable behaviour because we have to strike
> > > some sort of balance between allowing a NUMA imbalance up to a point
> > > to prevent communicating tasks being pulled apart and v3 broke that
> > > completely. There will always be a tradeoff between tasks that want to
> > > remain local to each other and others that prefer to spread as wide as
> > > possible as quickly as possible.
> >
> > I agree with this argument that we want to be conservative while
> > pulling tasks across NUMA domains. My point was that the threshold at
> > the NUMA domain that spans the 2 sockets is lower for NPS=1
> > (imb_numa_nr = 2) when compared to the threshold for the same NUMA
> > domain when NPS=2/4 (imb_numa_nr = 4).
> >
>
> Is that a problem though? On an Intel machine with sub-numa clustering,
> the distances are 11 and 21 for a "node" that is the split cache and the
> remote node respectively.

So, my question was, on an Intel machine, with sub-numa clustering
enabled vs disabled, is the value of imb_numa_nr for the NUMA domain
which spans the remote nodes (distance=21) the same or different. And
if it is different, what is the rationale behind that. I am totally
on-board with the idea that for the different NUMA levels, the
corresponding imb_numa_nr should be different.

Just in case, I was not making myself clear earlier, on Zen3, the
NUMA-A sched-domain, in the figures below, has groups where each group
spans a socket in all the NPS configurations. However, only on NPS=1
we have sd->imb_numa_nr=2 for NUMA-A, while on NPS=2/4, the value of
sd->imb_numa_nr=4 for NUMA-A domain. Thus if we had 4 tasks sharing
data, on NPS=2/4, they would reside on the same socket, while on
NPS=1, we will have 2 tasks on one socket, and the other 2 will
migrated to the other socket.

That said, I have not been able to observe any significiant difference
with a real-world workload like Mongodb run on NPS=1 with imb_numa_nr
set to 2 vs 4.



Zen3, NPS=1
------------------------------------------------------------------
| |
| NUMA-A : sd->imb_numa_nr = 2 |
| ------------------------ ------------------------ |
| |DIE | |DIE | |
| | | | | |
| | ------ ------ | | ------ ------ | |
| | |MC | |MC | | | |MC | |MC | | |
| | ------ ------ | | ------ ------ | |
| | ------ ------ | | ------ ------ | |
| | |MC | |MC | | | |MC | |MC | | |
| | ------ ------ | | ------ ------ | |
| | | | | |
| | ------ ------ | | ------ ------ | |
| | |MC | |MC | | | |MC | |MC | | |
| | ------ ------ | | ------ ------ | |
| | ------ ------ | | ------ ------ | |
| | |MC | |MC | | | |MC | |MC | | |
| | ------ ------ | | ------ ------ | |
| | | | | |
| ------------------------ ------------------------ |
| |
------------------------------------------------------------------



Zen3, NPS=2
------------------------------------------------------------------
| |
| NUMA-A: sd->imb_numa_nr = 4 |
| --------------------------- --------------------------- |
| |NUMA-B :sd->imb_numa_nr=2| |NUMA-B :sd->imb_numa_nr=2| |
| | ----------- ----------- | | ----------- ----------- | |
| | |NODE | |NODE | | | |NODE | |NODE | | |
| | | ------ | | ------ | | | | ------ | | ------ | | |
| | | |MC | | | |MC | | | | | |MC | | | |MC | | | |
| | | ------ | | ------ | | | | ------ | | ------ | | |
| | | ------ | | ------ | | | | ------ | | ------ | | |
| | | |MC | | | |MC | | | | | |MC | | | |MC | | | |
| | | ------ | | ------ | | | | ------ | | ------ | | |
| | | | | | | | | | | | | |
| | | ------ | | ------ | | | | ------ | | ------ | | |
| | | |MC | | | |MC | | | | | |MC | | | |MC | | | |
| | | ------ | | ------ | | | | ------ | | ------ | | |
| | | ------ | | ------ | | | | ------ | | ------ | | |
| | | |MC | | | |MC | | | | | |MC | | | |MC | | | |
| | | ------ | | ------ | | | | ------ | | ------ | | |
| | ----------- ----------- | | ----------- ----------- | |
| | | | | |
| --------------------------- --------------------------- |
| |
------------------------------------------------------------------


Zen3, NPS=4
------------------------------------------------------------------
| |
| NUMA-A: sd->imb_numa_nr = 4 |
| --------------------------- --------------------------- |
| |NUMA-B :sd->imb_numa_nr=2| |NUMA-B :sd->imb_numa_nr=2| |
| | ----------- ----------- | | ----------- ----------- | |
| | |NODE | |NODE | | | |NODE | |NODE | | |
| | | ------ | | ------ | | | | ------ | | ------ | | |
| | | |MC | | | |MC | | | | | |MC | | | |MC | | | |
| | | ------ | | ------ | | | | ------ | | ------ | | |
| | | ------ | | ------ | | | | ------ | | ------ | | |
| | | |MC | | | |MC | | | | | |MC | | | |MC | | | |
| | | ------ | | ------ | | | | ------ | | ------ | | |
| | ----------- ----------- | | ----------- ----------- | |
| | ----------- ----------- | | ----------- ----------- | |
| | |NODE | |NODE | | | |NODE | |NODE | | |
| | | ------ | | ------ | | | | ------ | | ------ | | |
| | | |MC | | | |MC | | | | | |MC | | | |MC | | | |
| | | ------ | | ------ | | | | ------ | | ------ | | |
| | | ------ | | ------ | | | | ------ | | ------ | | |
| | | |MC | | | |MC | | | | | |MC | | | |MC | | | |
| | | ------ | | ------ | | | | ------ | | ------ | | |
| | ----------- ----------- | | ----------- ----------- | |
| | | | | |
| --------------------------- --------------------------- |
| |
------------------------------------------------------------------




> > Irrespective of what NPS mode we are operating in, the NUMA distance
> > between the two sockets is 32 on Zen3 systems. Hence shouldn't the
> > thresholds be the same for that level of NUMA?
> >
>
> Maybe, but then it is not a property of the sched_domain and instead
> needs to account for distance when balancing between two nodes that may
> be varying distances from each other.
>
> > Would something like the following work ?
> >
> > if (sd->flags & SD_NUMA) {
> >
> > /* We are using the child as a proxy for the group. */
> > group_span = sd->child->span_weight;
> > sd_distance = /* NUMA distance at this sd level */
> >
>
> NUMA distance relative to what? On Zen, the distance to a remote node may
> be fixed but on topologies with multiple nodes that are not fully linked
> to every other node by one hop, the same is not true.

Fair enough. The "sd_distance" I was referring to the node_distance()
between the CPUs of any two groups in this NUMA domain. However, this
was assuming that the node_distance() between the CPUs of any two
groups would be the same, which is not the case for certain
platforms. So this wouldn't work.



>
> > /* By default we set the threshold to 1/4th the sched-group span. */
> > imb_numa_shift = 2;
> >
> > /*
> > * We can be a little aggressive if the cost of migrating tasks
> > * across groups of this NUMA level is not high.
> > * Assuming
> > */
> >
> > if (sd_distance < REMOTE_DISTANCE)
> > imb_numa_shift++;
> >
>
> The distance would have to be accounted for elsewhere because here we
> are considering one node in isolation, not relative to other nodes.
>
> > /*
> > * Compute the number of LLCs in each group.
> > * More the LLCs, more aggressively we migrate across
> > * the groups at this NUMA sd.
> > */
> > nr_llcs = group_span/llc_size;
> >
> > sd->imb_numa_nr = max(2U, (group_span / nr_llcs) >> imb_numa_shift);
> > }
> >
>
> Same, any adjustment would have to happen during load balancing taking
> into account the relatively NUMA distances. I'm not necessarily opposed
> but it would be a separate patch.


Sure, we can look into this separately.


>
> > > > <SNIP>
> > > > If we retain the (2,4) thresholds from v4.1 but use them in
> > > > allow_numa_imbalance() as in v3 we get
> > > >
> > > > NPS=4
> > > > Test: mel-v4.2
> > > > Copy: 225860.12 (498.11%)
> > > > Scale: 227869.07 (572.58%)
> > > > Add: 278365.58 (624.93%)
> > > > Triad: 264315.44 (596.62%)
> > > >
> > >
> > > The potential problem with this is that it probably will work for
> > > netperf when it's a single communicating pair but may not work as well
> > > when there are multiple communicating pairs or a number of communicating
> > > tasks that exceed numa_imb_nr.
> >
> >
> > Yes that's true. I think what you are doing in v4 is the right thing.
> >
> > In case of stream in NPS=4, it just manages to hit the corner case for
> > this heuristic which results in a suboptimal behaviour. Description
> > follows:
> >
>
> To avoid the corner case, we'd need to explicitly favour spreading early
> and assume wakeup will pull communicating tasks together and NUMA
> balancing migrate the data after some time which looks like


Actually I was able to root-cause the reason behind the drop in the
performance of stream on NPS-4. I have already responded earlier in
another thread : https://lore.kernel.org/lkml/Ybzq%2FA+HS%2FGxGYha@xxxxxxxxxxxxxxxxxxxxxx/
Appending the patch here:


---
kernel/sched/fair.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ec354bf88b0d..c1b2a422a877 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9191,13 +9191,14 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
return idlest;
#endif
/*
- * Otherwise, keep the task on this node to stay local
- * to its wakeup source if the number of running tasks
- * are below the allowed imbalance. If there is a real
- * need of migration, periodic load balance will take
- * care of it.
+ * Otherwise, keep the task on this node to
+ * stay local to its wakeup source if the
+ * number of running tasks (including the new
+ * one) are below the allowed imbalance. If
+ * there is a real need of migration, periodic
+ * load balance will take care of it.
*/
- if (local_sgs.sum_nr_running <= sd->imb_numa_nr)
+ if (local_sgs.sum_nr_running + 1 <= sd->imb_numa_nr)
return NULL;
}

--

With this fix on top of your fix to compute the imb_numa_nr at the
relevant level (v4.1:
https://lore.kernel.org/lkml/20211213130131.GQ3366@xxxxxxxxxxxxxxxxxxx/),
the stream regression for NPS4 is no longer there.


Test: tip-core v4 v4.1 v4.1-find_idlest_group_fix
Copy: 37762.14 (0.00%) 48748.60 (29.09%) 164765.83 (336.32%) 205963.99 (445.42%)
Scale: 33879.66 (0.00%) 48317.66 (42.61%) 159641.67 (371.20%) 218136.57 (543.85%)
Add: 38398.90 (0.00%) 54259.56 (41.30%) 184583.70 (380.70%) 257857.90 (571.52%)
Triad: 37942.38 (0.00%) 54503.74 (43.64%) 181250.80 (377.70%) 251410.28 (562.61%)

--
Thanks and Regards
gautham.