Re: [RFC PATCH] sched/fair: Make tg->load_avg per node

From: Aaron Lu
Date: Tue May 16 2023 - 03:50:35 EST


On Thu, May 04, 2023 at 06:27:46PM +0800, Aaron Lu wrote:
> Base on my current understanding, the summary is:
> - Running this workload with nr_thread=224 on SPR, the ingress queue
> will overflow and that will slow things down. This patch helps
> performance mainly because it transform the "many cpus accessing the
> same cacheline" scenario to "many cpus accessing two cachelines" and
> that can reduce the likelyhood of ingress queue overflow and thus,
> helps performance;
> - On Icelake with high nr_threads but not too high that would cause
> 100% cpu utilization, the two functions' cost will drop a little but
> performance did not improve(it actually regressed a little);
> - On SPR when there is no ingress queue overflow, it's similar to
> Icelake: the two functions' cost will drop but performance did not
> improve.

More results when running hackbench and netperf on Sapphire Rapids as
well as on 2 sockets Icelake and 2 sockets Cascade Lake.

The summary is:
- on SPR, hackbench time reduced ~8% and netperf(UDP_RR/nr_thread=100%)
performance increased ~50%;
- on Icelake, performance regressed about 1%-2% for postgres_sysbench
and hackbench, netperf has no performance change;
- on Cascade Lake, netperf/UDP_RR/nr_thread=50% sees performance drop
~3%; others have no performance change.

Together with results kindly collected by Daniel, it looks this patch
helps most for SPR while for other machines, it either is flat or
regressed 1%-3% for some workloads. With these results, I'm thinking an
alternative solution to reduce the cost of accessing tg->load_avg.

There are two main reasons to access tg->load_avg. One is driven by
pelt decay, which has a fixed frequency and is not a concern; the other
is by enqueue_entity/dequeue_entity triggered by task migration. The
number of migrations can be unbound so the access to tg->load_avg can
be huge due to this. This frequent task migration is the problem for
tg->load_avg. One thing I noticed is, on task migration, the load is
carried from the old per-cpu cfs_rq to the new per-cpu cfs_rq. While
the cfs_rq's load_avg and tg_load_avg_contrib should change accordingly
to reflect this so that its corresponding sched entity can get a correct
weight, the task group's load_avg should stay unchanged. So instead of
removing a delta to tg->load_avg by src cfs_rq and then increasing the
same delta to tg->load_avg by target cfs_rq, the two updates to tg's
load_avg could be avoided. With this change, the update to tg->load_avg
will be greatly reduced and the problem should be solved and it is
likely to be a win for most machines/workloads. Not sure if I understand
this correctly? I'm going to persue a solution based on this, feel free
to let me know if you see anything wrong here, thanks.

Below are the test result details of the current patch.
=======================================================================
Details for SPR(2 sockets, 96cores, 192cpus):
- postgres_sysbench score increased 6.5%;
- hackbench(threads, pipe) time reduced to 41s from 45s(less is better);
- netperf(UDP_RR,nr_thread=100%=nr_cpu) throughput increased from 10105
to 15121.

postgres_sysbench:
nr_thread=192
score update_cfs_group% update_load_avg%
6.2.0 92440±2.62% 8.11% - 13.48% 7.07% - 9.54%
this_patch 98425±0.62% 5.73% - 7.56% 4.47% - 5.96%
note: performance increased 6.5% and the two functions cost also
reduced.

nr_thread=96 with cpu offlined to 128c (2 sockets/64cores)
score update_cfs_group% update_load_avg%
6.2.0 75726±0.12% 3.56% - 4.49% 3.58% - 4.42%
this_patch 76736±0.17% 2.95% - 3.32% 2.80% - 3.29%
note: this test is mainly to see if performance incease is due to
ingress queue overflow or not and the result suggests the performance
increase on SPR is mainly due to ingress queue overflow.

hackbench(threads, pipe, groups=10, fds=20, 400 tasks):
time update_cfs_group% update_load_avg%
6.2.0 45.51±0.36% 12.68% - 20.22% 7.73% - 11.01%
this_patch 41.41±0.43% 7.73% - 13.15% 4.31% - 6.91%
note: there is a clear cut of profiles on node 0 and node 1 -
e.g. on v6.2.0, the cost of update_cfs_group()% on node0 is about 13% and 20% on node 1;
on patched, the cost of update_cfs_group()% on node0 is about 8% and 12% on node 1;
update_load_avg() is similar.

netperf(UDP_RR, nr_thread=100%=192):
throughput update_cfs_group% update_load_avg%
6.2.0 10105±2.91% 26.43% - 27.90% 17.51% - 18.31%
this_patch 15121±3.25% 25.12% - 26.50% 12.47% - 16.02%
note: performance increased a lot, although the two functions' cost didn't
drop much.

=======================================================================
Details for Icelake (2sockets, 64cores, 128cpus)
- postgres_sysbench:
nr_thread=128 does not show any performance change;
nr_thread=96 performance regressed 1.3% after patch, the two update
functions' cost reduce a bit though;
- hackbench(pipe/threads):
no obvious performance change after patch; the two update functions
cost reduced ~2% after patch;
- netperf(UDP_RR/nr_thread=100%=nr_cpu):
results in noise range; results are very unstable on vanilla kernel;
the two functions' cost reduced some after patch.

postgres_sysbench:
nr_thread=128
score update_cfs_group% update_load_avg%
6.2.0 97418±0.17% 0.50% - 0.74% 0.69% - 0.93%
this_patch 97029±0.32% 0.68% - 0.89% 0.70% - 0.89%
note: score in noise

nr_thread=96
score update_cfs_group% update_load_avg%
6.2.0 59183±0.21% 2.81% - 3.57% 3.48% - 3.76%
this_patch 58397±0.35% 2.70% - 3.01% 2.82% - 3.24%
note: score is 1.3% worse when patched.
update_XXX()% percent dropped but that does not translate to performance
increase.

hackbench(pipe, threads):
time update_cfs_group% update_load_avg%
6.2.0 41.80±0.65 5.90% - 7.36% 4.37% - 5.28%
this_patch 40.48±1.85 3.36% - 4.34% 2.89% - 3.35%
note: update_XXX()% percent dropped but does not translate to
performance increase.

netperf(UDP_RR, nr_thread=100%=128):
throughput update_cfs_group% update_load_avg%
6.2.0 31146±26% 11% - 33%* 2.30% - 17.7%*
this_patch 24900±2% 14% - 18% 8.67% - 12.03%
note: performance in noise;
update_cfs_group()% on vanilla can show big difference on the two nodes,
and also show big difference on different runs;
update_load_avg()%: for some runs, one node would show very low cost like
2.x% and the other node has 10+%. This is probably because one node's cpu
utils is approaching 100% and that inhibit task migrations; for other
runs, both nodes have 10+%.


=======================================================================
Details for Cascade Lake(2 sockets, 48cores, 96cpus):
- netperf (TCP_STREAM/UDP_RR):
- Most tests have no performance change;
- UDP_RR/nr_thread=50% sees performance drop about 3% on patched kernel;
- UDP_RR/nr_thread=100%: results are unstable for both kernels.
- hackbench(pipe/threads):
- performance in noise range after patched.

netperf/UDP_RR/nr_thread=100%=96
Throughput update_cfs_group% update_load_avg%
v6.2.0 41593±8% 10.94%±20% 10.23%±27%
this_patch 38603±8 9.53% 8.66%
note: performance in noise range; profile wise, the two functions' cost
become stable after patched.

netperf/UDP_RR/nr_thread=50%=48
Throughput update_cfs_group% update_load_avg%
v6.2.0 70489 0.59±8% 1.60
this_patch 68457 -2.9% 1.39 1.62
note: performance dropped ~3%; update_cfs_group()'s cost rises after
patched.

netperf/TCP_STREAM/nr_thread=100%=96
Throughput update_cfs_group% update_load_avg%
v6.2.0 12011 0.57% 2.45%
this_patch 11743 1.44% 2.30%
note: performance in noise range.

netperf/TCP_STREAM/nr_thread=50%=48
Throughput update_cfs_group% update_load_avg%
v6.2.0 16409±12% 0.20±2% 0.54±2%
this_patch 19295 0.47±4% 0.54±2%
note: result unstable for v6.2.0, performance in noise range.

hackbench/threads/pipe:
Throughput update_cfs_group% update_load_avg%
v6.2.0 306321±12% 2.80±58% 3.07±38%
this_patch 322967±10% 3.60±36% 3.56±30%