Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs

From: Gautham R. Shenoy
Date: Mon Dec 06 2021 - 03:48:44 EST


Hello Peter, Mel,


On Sat, Dec 04, 2021 at 11:40:56AM +0100, Peter Zijlstra wrote:
> On Wed, Dec 01, 2021 at 03:18:44PM +0000, Mel Gorman wrote:
> > + /* Calculate allowed NUMA imbalance */
> > + for_each_cpu(i, cpu_map) {
> > + int imb_numa_nr = 0;
> > +
> > + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> > + struct sched_domain *child = sd->child;
> > +
> > + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> > + (child->flags & SD_SHARE_PKG_RESOURCES)) {
> > + int nr_groups;
> > +
> > + nr_groups = sd->span_weight / child->span_weight;
> > + imb_numa_nr = max(1U, ((child->span_weight) >> 1) /
> > + (nr_groups * num_online_nodes()));
> > + }
> > +
> > + sd->imb_numa_nr = imb_numa_nr;
> > + }
>
> OK, so let's see. All domains with SHARE_PKG_RESOURCES set will have
> imb_numa_nr = 0, all domains above it will have the same value
> calculated here.
>
> So far so good I suppose :-)

Well, we will still have the same imb_numa_nr set for different NUMA
domains which have different distances!

>
> Then nr_groups is what it says on the tin; we could've equally well
> iterated sd->groups and gotten the same number, but this is simpler.
>
> Now, imb_numa_nr is where the magic happens, the way it's written
> doesn't help, but it's something like:
>
> (child->span_weight / 2) / (nr_groups * num_online_nodes())
>
> With a minimum value of 1. So the larger the system is, or the smaller
> the LLCs, the smaller this number gets, right?
>
> So my ivb-ep that has 20 cpus in a LLC and 2 nodes, will get: (20 / 2)
> / (1 * 2) = 10, while the ivb-ex will get: (20/2) / (1*4) = 5.
>
> But a Zen box that has only like 4 CPUs per LLC will have 1, regardless
> of how many nodes it has.

That's correct. On a Zen3 box with 2 sockets with 64 cores per
sockets, we can configure it with either 1/2/4 Nodes Per Socket
(NPS). The imb_numa_nr value for each of the NPS configurations is as
follows:


NPS1 :
~~~~~~~~
SMT [span_wt=2]
--> MC [span_wt=16, LLC]
--> DIE[span_wt=128]
--> NUMA [span_wt=256, SD_NUMA]

sd->span = 128, child->span = 16, nr_groups = 8, num_online_nodes() = 2
imb_numa_nr = max(1, (16 >> 1)/(8*2)) = max(1, 0.5) = 1.



NPS2 :
~~~~~~~~
SMT [span_wt=2]
--> MC [span_wt=16,LLC]
--> NODE[span_wt=64]
--> NUMA [span_wt=128, SD_NUMA]
--> NUMA [span_wt=256, SD_NUMA]

sd->span = 64, child->span = 16, nr_groups = 4, num_online_nodes() = 4
imb_numa_nr = max(1, (16 >> 1)/(4*4)) = max(1, 0.5) = 1.


NPS 4:
~~~~~~~
SMT [span_wt=2]
--> MC [span_wt=16, LLC]
--> NODE [span_wt=32]
--> NUMA [span_wt=128, SD_NUMA]
--> NUMA [span_wt=256, SD_NUMA]

sd->span = 32, child->span = 16, nr_groups = 2, num_online_nodes() = 8
imb_numa_nr = max(1, (16 >> 1)/(2*8)) = max(1, 0.5) = 1.


While the imb_numa_nr = 1 is good for the NUMA domain within a socket
(the lower NUMA domains in in NPS2 and NPS4 modes), it appears to be a
little bit aggressive for the NUMA domain spanning the two sockets. If
we have only a pair of communicating tasks in a socket, we will end up
spreading them across the two sockets with this patch.

>
> Now, I'm thinking this assumes (fairly reasonable) that the level above
> LLC is a node, but I don't think we need to assume this, while also not
> assuming the balance domain spans the whole machine (yay paritions!).
>
> for (top = sd; top->parent; top = top->parent)
> ;
>
> nr_llcs = top->span_weight / child->span_weight;
> imb_numa_nr = max(1, child->span_weight / nr_llcs);
>
> which for my ivb-ep gets me: 20 / (40 / 20) = 10
> and the Zen system will have: 4 / (huge number) = 1
>
> Now, the exp: a / (b / a) is equivalent to a * (a / b) or a^2/b, so we
> can also write the above as:
>
> (child->span_weight * child->span_weight) / top->span_weight;


Assuming that "child" here refers to the LLC domain, on Zen3 we would have
(a) child->span_weight = 16. (b) top->span_weight = 256.

So we get a^2/b = 1.

>
> Hmm?

Last week, I tried a modification on top of Mel's current patch where
we spread tasks between the LLCs of the groups within each NUMA domain
and compute the value of imb_numa_nr per NUMA domain. The idea is to set

sd->imb_numa_nr = min(1U,
(Number of LLCs in each sd group / Number of sd groups))

This won't work for processors which have a single LLC in a socket,
since the sd->imb_numa_nr will be 1 which is probably too low. FWIW,
with this heuristic, the imb_numa_nr across the different NPS
configurations of a Zen3 server is as follows

NPS1:
NUMA domain: nr_llcs_per_group = 8. nr_groups = 2. imb_numa_nr = 4

NPS2:
1st NUMA domain: nr_llcs_per_group = 4. nr_groups = 2. imb_numa_nr = 2.
2nd NUMA domain: nr_llcs_per_group = 8. nr_groups = 2. imb_numa_nr = 4.

NPS4:
1st NUMA domain: nr_llcs_per_group = 2. nr_groups = 4. imb_numa_nr = min(1, 2/4) = 1.
2nd NUMA domain: nr_llcs_per_group = 8. nr_groups = 2. imb_numa_nr = 4.

Thus, at the highest NUMA level (socket), we don't spread across the
two sockets until there are 4 tasks within the socket. If there is
only a pair of communicating tasks in the socket, they will be left
alone within that socket. The stream numbers (average of 10 runs. The
following are Triad numbers. The Copy, Scale and Add numbers have the
same trend) are presented in the table below. We do see some
degradation for the 4 thread case in NPS2 and NPS4 modes with the
aforementioned approach, but there are gains as well for 16 and 32
thread case on NPS4 mode.

NPS1:

==========+===========+================+=================
| Nr | Mel v3 | tip/sched/core | Spread across|
| Stream | | | LLCs of NUMA |
| Threads | | | groups |
==========+===========+================+=================
| 4 | 111106.14 | 94849.77 | 111820.02 |
| | | (-14.63%) | (+0.64%) |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
| 8 | 175633.00 | 128268.22 | 189705.48 |
| | | (-26.97%) | (+8.01%) |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
| 16 | 252812.87 | 136745.98 | 255577.34 |
| | | (-45.91%) | (+1.09%) |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
| 32 | 248198.43 | 130120.30 | 253266.86 |
| | | (-47.57%) | (+2.04%) |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
| 64 | 244202.33 | 133773.03 | 249449.53 |
| | | (-45.22%) | (+2.15%) |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
| 128 | 248459.85 | 249450.61 | 250346.09 |
| | | (+0.40%) | (+0.76%) |
~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+

NPS2:
==========+===========+================+=================
| Nr | Mel v3 | tip/sched/core | Spread across|
| Stream | | | LLCs of NUMA |
| Threads | | | groups |
==========+===========+================+=================
| 4 | 110888.35 | 63067.26 | 104971.36 |
| | | (-43.12%) | (-5.34%) |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
| 8 | 174983.85 | 96226.39 | 177558.65 |
| | | (-45.01%) | (+1.47%) |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
| 16 | 252943.21 | 106474.3 | 260749.60 |
| | | (-57.90%) | (+1.47%) |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
| 32 | 248540.52 | 113864.09 | 254141.33 |
| | | (-54.19%) | (+2.25%) |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
| 64 | 248383.17 | 137101.85 | 255018.52 |
| | | (-44.80%) | (+2.67%) |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
| 128 | 250123.31 | 257031.29 | 254457.13 |
| | | (+2.76%) | (+1.73%) |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+

NPS4:
==========+===========+================+=================
| Nr | Mel v3 | tip/sched/core | Spread across|
| Stream | | | LLCs of NUMA |
| Threads | | | groups |
==========+===========+================+=================
| 4 | 108580.91 | 31746.06 | 97585.53 |
| | | (-70.76%) | (-10.12%) |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
| 8 | 150259.94 | 64841.89 | 154954.75 |
| | | (-56.84%) | (+3.12%) |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
| 16 | 234137.41 | 106780.26 | 261005.27 |
| | | (-54.39%) | (+11.48%) |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
| 32 | 241583.06 | 147572.50 | 257004.22 |
| | | (-38.91%) | (+6.38%) |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
| 64 | 248511.64 | 166183.06 | 259599.32 |
| | | (-33.12%) | (+4.46%) |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
| 128 | 252227.34 | 239270.85 | 259117.18 |
| | | (-5.13%) | (2.73%) |
~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+

--
Thanks and Regards
gautham.