Re: [RFC PATCH] sched/fair: Make tg->load_avg per node

From: Aaron Lu
Date: Thu May 04 2023 - 06:28:09 EST


Hi Daniel,

Thanks a lot for collecting these data.

I had hoped to also share some data I collected on other machines after
seeing your last email but trying to explain why only SPR showed benefit
has slowed me down. I now has some finding on this, please see below.

On Wed, May 03, 2023 at 03:41:25PM -0400, Daniel Jordan wrote:
> On Fri, Apr 21, 2023 at 11:05:59PM +0800, Aaron Lu wrote:
> > On Thu, Apr 20, 2023 at 04:52:01PM -0400, Daniel Jordan wrote:
> > > AMD EPYC 7J13 64-Core Processor (NPS1)
> > > 2 sockets * 64 cores * 2 threads = 256 CPUs
> > >
> > > update_load_avg profile% update_cfs_group profile%
> > > affinity nr_threads base test diff base test diff
> > > unbound 96 0.7 0.6 -0.1 0.3 0.6 0.4
> > > unbound 128 0.8 0.7 0.0 0.3 0.7 0.4
> > > unbound 160 2.4 1.7 -0.7 1.2 2.3 1.1
> > > unbound 192 2.3 1.7 -0.6 0.9 2.4 1.5
> > > unbound 224 0.9 0.9 0.0 0.3 0.6 0.3
> > > unbound 256 0.4 0.4 0.0 0.1 0.2 0.1
> >
> > Is it possible to show per-node profile for the two functions? I wonder
> > how the per-node profile changes with and without this patch on Milan.
> > And for vanilla kernel, it would be good to know on which node the struct
> > task_group is allocated. I used below script to fetch this info:
> > kretfunc:sched_create_group
> > {
> > $root = kaddr("root_task_group");
> > if (args->parent == $root) {
> > return;
> > }
> >
> > printf("cpu%d, node%d: tg=0x%lx, parent=%s\n", cpu, numaid,
> > retval, str(args->parent->css.cgroup->kn->name));
> > }
>
> That's helpful, nid below comes from this. The node happened to be different
> between base and test kernels on both machines, so that's one less way the
> experiment is controlled but for the unbound case where tasks are presumably
> spread fairly evenly I'm not sure how much it matters, especially given that
> the per-node profile numbers are fairly close to each other.
>
>
> Data below, same parameters and times as the last mail.
>
> > BTW, is the score(transactions) of the workload stable? If so, how the
> > score change when the patch is applied?
>
> Transactions seem to be mostly stable but unfortunately regress overall on both
> machines.

Yeah, I noticed your result is pretty stable in that the stddev% is
mostly zero. Mine are not that stable though. And it looks like there
are some wins in the node1 case :)

> FWIW, t-test compares the two sets of ten iterations apiece. The higher the
> percentage, the higher the confidence that the difference is significant.
>
>
> AMD EPYC 7J13 64-Core Processor (NPS1)
> 2 sockets * 64 cores * 2 threads = 256 CPUs
>
> transactions per second
>
> diff base test
> ----------------- ------------------ ------------------
> tps tps
> affinity nr_threads (%diff) (t-test) tps std% nid tps std% nid
> unbound 96 -0.8% 100% 128,450 0% 1 127,433 0% 0
> unbound 128 -1.0% 100% 138,471 0% 1 137,099 0% 0
> unbound 160 -1.2% 100% 136,829 0% 1 135,170 0% 0
> unbound 192 0.4% 95% 152,767 0% 1 153,336 0% 0
> unbound 224 -0.2% 81% 179,946 0% 1 179,620 0% 0
> unbound 256 -0.2% 71% 203,920 0% 1 203,583 0% 0
> node0 48 0.1% 46% 69,635 0% 0 69,719 0% 0
> node0 64 -0.1% 69% 75,213 0% 0 75,163 0% 0
> node0 80 -0.4% 100% 72,520 0% 0 72,217 0% 0
> node0 96 -0.2% 89% 81,345 0% 0 81,210 0% 0
> node0 112 -0.3% 98% 96,174 0% 0 95,855 0% 0
> node0 128 -0.7% 100% 111,813 0% 0 111,045 0% 0
> node1 48 0.3% 78% 69,985 1% 1 70,200 1% 1
> node1 64 0.6% 100% 75,770 0% 1 76,231 0% 1
> node1 80 0.3% 100% 73,329 0% 1 73,567 0% 1
> node1 96 0.4% 99% 82,222 0% 1 82,556 0% 1
> node1 112 0.1% 62% 96,573 0% 1 96,689 0% 1
> node1 128 -0.2% 69% 111,614 0% 1 111,435 0% 1
>
> update_load_avg profile%
>
> all_nodes node0 node1
> ---------------- ---------------- ----------------
> affinity nr_threads base test diff base test diff base test diff
> unbound 96 0.7 0.6 -0.1 0.7 0.6 -0.1 0.7 0.6 -0.1
> unbound 128 0.8 0.7 -0.1 0.8 0.7 -0.1 0.8 0.7 -0.1
> unbound 160 2.3 1.7 -0.7 2.5 1.7 -0.8 2.2 1.6 -0.5
> unbound 192 2.2 1.6 -0.6 2.5 1.8 -0.7 2.0 1.4 -0.6
> unbound 224 0.9 0.8 -0.1 1.1 0.7 -0.3 0.8 0.8 0.0
> unbound 256 0.4 0.4 0.0 0.4 0.4 0.0 0.4 0.4 0.0
> node0 48 0.7 0.6 -0.1
> node0 64 0.8 0.7 -0.2
> node0 80 2.0 1.4 -0.7
> node0 96 2.3 1.4 -0.9
> node0 112 1.0 0.8 -0.2
> node0 128 0.5 0.4 0.0
> node1 48 0.7 0.6 -0.1
> node1 64 0.8 0.6 -0.1
> node1 80 1.4 1.2 -0.2
> node1 96 1.5 1.3 -0.2
> node1 112 0.8 0.7 -0.1
> node1 128 0.4 0.4 -0.1
>
> update_cfs_group profile%
>
> all_nodes node0 node1
> ---------------- ---------------- ----------------
> affinity nr_threads base test diff base test diff base test diff
> unbound 96 0.3 0.6 0.3 0.3 0.6 0.3 0.3 0.6 0.3
> unbound 128 0.3 0.6 0.3 0.3 0.6 0.3 0.3 0.7 0.4
> unbound 160 1.1 2.5 1.4 1.3 2.2 0.9 0.9 2.8 1.9
> unbound 192 0.9 2.6 1.7 1.1 2.4 1.3 0.7 2.8 2.1
> unbound 224 0.3 0.8 0.5 0.4 0.6 0.3 0.2 0.9 0.6
> unbound 256 0.1 0.2 0.1 0.1 0.2 0.1 0.1 0.2 0.1
> node0 48 0.4 0.6 0.2
> node0 64 0.3 0.6 0.3
> node0 80 0.7 0.6 -0.1
> node0 96 0.6 0.6 0.0
> node0 112 0.3 0.4 0.1
> node0 128 0.1 0.2 0.1
> node1 48 0.3 0.6 0.3
> node1 64 0.3 0.6 0.3
> node1 80 0.3 0.6 0.3
> node1 96 0.3 0.6 0.3
> node1 112 0.2 0.3 0.2
> node1 128 0.1 0.2 0.1
>

update_load_avg()'s cost dropped while update_cfs_group()'s cost
increased. I think this is reasonable since the write side only has to
deal with local data now while the read side has to iterate the per-node
tg->load_avg on all nodes.

>
> Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
> 2 sockets * 32 cores * 2 thread = 128 CPUs
>
> transactions per second
>
> diff base test
> ----------------- ------------------ ------------------
> tps tps
> affinity nr_threads (%diff) (t-test) tps std% nid tps std% nid
> unbound 48 -0.9% 100% 75,500 0% 1 74,834 0% 0
> unbound 64 -0.4% 100% 81,687 0% 1 81,368 0% 0
> unbound 80 -0.4% 100% 78,620 0% 1 78,281 0% 0
> unbound 96 -0.5% 74% 78,949 1% 1 78,580 1% 0
> unbound 112 -2.9% 87% 94,189 3% 1 91,458 5% 0
> unbound 128 -1.4% 100% 117,557 0% 1 115,921 0% 0
> node0 24 -0.7% 100% 38,601 0% 0 38,333 0% 0
> node0 32 -1.2% 100% 41,539 0% 0 41,038 0% 0
> node0 40 -1.6% 100% 42,325 0% 0 41,662 0% 0
> node0 48 -1.3% 100% 41,956 0% 0 41,404 0% 0
> node0 56 -1.3% 100% 42,115 0% 0 41,569 0% 0
> node0 64 -1.0% 100% 62,431 0% 0 61,784 0% 0
> node1 24 0.0% 1% 38,752 0% 1 38,752 0% 1
> node1 32 0.9% 100% 42,568 0% 1 42,943 0% 1
> node1 40 -0.2% 87% 43,452 0% 1 43,358 0% 1
> node1 48 -0.5% 100% 43,047 0% 1 42,831 0% 1
> node1 56 -0.5% 100% 43,464 0% 1 43,259 0% 1
> node1 64 0.5% 100% 64,111 0% 1 64,450 0% 1

This looks like mostly a loss for Icelake.

I also tested on the same Icelake 8358 and my result is not entirely
the same to yours:

nr_thread=128
score(tps) update_cfs_group% update_load_avg%
6.2.0 97418±0.17% 0.50% - 0.74% 0.69% - 0.93%
this_patch 97029±0.32% 0.68% - 0.89% 0.70% - 0.89%

For the above nr_thread=128 unbound case, the score(tps) is in noise
range, instead of a 1.4% loss in your run. Profile wise, the write
side's cost slightly dropped while the read side's cost slightly
increased. Overall, no big change for nr_thread=128 on this Icelake.
I think this is also expected since in nr_thread=128 case, there are
very few migrations on wake up due to cpu utilization is almost 100% so
this patch shouldn't make an obvious difference.

nr_thread=96
score(tps) update_cfs_group% update_load_avg%
6.2.0 59183±0.21% 2.81% - 3.57% 3.48% - 3.76%
this_patch 58397±0.35% 2.70% - 3.01% 2.82% - 3.24%

For this case, there are enough task migrations on wakeup for this patch
to make a difference: the tps dropped about 1.3%, worse than your run.
Profile wise, both write side and read side dropped. But these drops do
not translate to performance gains. I think from the profile, this patch
is doing something good, it's just the tps suggested otherwise.

On another 2S, 224 cpu Sapphire Rapids:

nr_thread=224
score update_cfs_group% update_load_avg%
6.2.0 93504±4.79% 11.63% - 15.12% 7.00% - 10.31%
this_patch 103040±0.46% 7.08% - 9.08% 4.82% - 6.73%

The above is where this patch helps the most, both profile and score
shows improvement. My finding about why only SPR shows benefit is, I
think this has something to do with SPR's "Ingress Queue overflow" when
many cpus access the same cache line and once that overflow happened,
all those accessing cpus will have their memory operations slowed down.
This is described in section 3.11 of Intel's optimization reference
manual.

To confirm the above explanation, I did the nr_thread=96 run on SPR. To
make sure task migration still happens in this case, some cpus are
offlined and only 128 cpus are left. With fewer threads, the chance of
ingress queue overflow is much lower:

nr_thread=96 with cpu offlined to 128c left (32cores/64cpus on each socket)

score update_cfs_group% update_load_avg%
6.2.0 74878±0.58% 3.47% - 4.90% 3.59% - 4.42%
this_patch 75671±0.24% 2.66% - 3.55% 2.71% - 3.44%

Profile wise, the two functions dropped to near Icelake level, still
higher than Icelake but much better than nr_thread=224 case. When
comparing base line with this patch, it follows what I saw on Icelake:
the two functions' cost dropped but that did not translate to
performance increase(average is slightly higer but it's in noise range).

Base on my current understanding, the summary is:
- Running this workload with nr_thread=224 on SPR, the ingress queue
will overflow and that will slow things down. This patch helps
performance mainly because it transform the "many cpus accessing the
same cacheline" scenario to "many cpus accessing two cachelines" and
that can reduce the likelyhood of ingress queue overflow and thus,
helps performance;
- On Icelake with high nr_threads but not too high that would cause
100% cpu utilization, the two functions' cost will drop a little but
performance did not improve(it actually regressed a little);
- On SPR when there is no ingress queue overflow, it's similar to
Icelake: the two functions' cost will drop but performance did not
improve.