Re: [PATCH v8 2/2] sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU

From: Chen Yu
Date: Thu May 04 2023 - 07:08:11 EST


On 2023-05-02 at 13:54:08 +0200, Peter Zijlstra wrote:
> On Mon, May 01, 2023 at 11:52:47PM +0800, Chen Yu wrote:
>
> > > So,... I've been poking around with this a bit today and I'm not seeing
> > > it. On my ancient IVB-EP (2*10*2) with the code as in
> > > queue/sched/core I get:
> > >
> > > netperf NO_SIS_CURRENT %
> > > SIS_CURRENT
> > > ----------------------- -------------------------------
> > > TCP_SENDFILE-1 : Avg: 42001 40783.4 -2.89898
> > > TCP_SENDFILE-10 : Avg: 37065.1 36604.4 -1.24295
> > > TCP_SENDFILE-20 : Avg: 21004.4 21356.9 1.67822
> > > TCP_SENDFILE-40 : Avg: 7079.93 7231.3 2.13802
> > > TCP_SENDFILE-80 : Avg: 3582.98 3615.85 0.917393
>
> > > TCP_STREAM-1 : Avg: 37134.5 35095.4 -5.49112
> > > TCP_STREAM-10 : Avg: 31260.7 31588.1 1.04732
> > > TCP_STREAM-20 : Avg: 17996.6 17937.4 -0.328951
> > > TCP_STREAM-40 : Avg: 7710.4 7790.62 1.04041
> > > TCP_STREAM-80 : Avg: 2601.51 2903.89 11.6232
>
> > > TCP_RR-1 : Avg: 81167.8 83541.3 2.92419
> > > TCP_RR-10 : Avg: 71123.2 69447.9 -2.35549
> > > TCP_RR-20 : Avg: 50905.4 52157.2 2.45907
> > > TCP_RR-40 : Avg: 46289.2 46350.7 0.13286
> > > TCP_RR-80 : Avg: 22024.4 22229.2 0.929878
>
> > > UDP_RR-1 : Avg: 95997.2 96553.3 0.579288
> > > UDP_RR-10 : Avg: 83878.5 78998.6 -5.81782
> > > UDP_RR-20 : Avg: 61838.8 62926 1.75812
> > > UDP_RR-40 : Avg: 56456.1 57115.2 1.16746
> > > UDP_RR-80 : Avg: 27635.2 27784.8 0.541339
>
> > > UDP_STREAM-1 : Avg: 52808.2 51908.6 -1.70352
> > > UDP_STREAM-10 : Avg: 43115 43561.2 1.03491
> > > UDP_STREAM-20 : Avg: 18798.7 20066 6.74142
> > > UDP_STREAM-40 : Avg: 13070.5 13110.2 0.303737
> > > UDP_STREAM-80 : Avg: 6248.86 6413.09 2.62816
>
>
> > > tbench
>
> > > WA_WEIGHT, WA_BIAS, NO_SIS_CURRENT (aka, mainline)
> > >
> > > Throughput 649.46 MB/sec 2 clients 2 procs max_latency=0.092 ms
> > > Throughput 1370.93 MB/sec 5 clients 5 procs max_latency=0.140 ms
> > > Throughput 1904.14 MB/sec 10 clients 10 procs max_latency=0.470 ms
> > > Throughput 2406.15 MB/sec 20 clients 20 procs max_latency=0.276 ms
> > > Throughput 2419.40 MB/sec 40 clients 40 procs max_latency=0.414 ms
> > > Throughput 2426.00 MB/sec 80 clients 80 procs max_latency=1.366 ms
> > >
> > > WA_WEIGHT, WA_BIAS, SIS_CURRENT (aka, with patches on)
> > >
> > > Throughput 646.55 MB/sec 2 clients 2 procs max_latency=0.104 ms
> > > Throughput 1361.06 MB/sec 5 clients 5 procs max_latency=0.100 ms
> > > Throughput 1889.82 MB/sec 10 clients 10 procs max_latency=0.154 ms
> > > Throughput 2406.57 MB/sec 20 clients 20 procs max_latency=3.667 ms
> > > Throughput 2318.00 MB/sec 40 clients 40 procs max_latency=0.390 ms
> > > Throughput 2384.85 MB/sec 80 clients 80 procs max_latency=1.371 ms
> > >
> > >
> > > So what's going on here? I don't see anything exciting happening at the
> > > 40 mark. At the same time, I can't seem to reproduce Mike's latency pile
> > > up either :/
> > >
> > Thank you very much for trying this patch. This patch was found to mainly
> > benefit system with large number of CPUs in 1 LLC. Previously I tested
> > it on Sapphire Rapids(2x56C/224T) and Ice Lake Server(2x32C/128T)[1], it
> > seems to have benefit on them. The benefit seems to come from:
> > 1. reducing the waker stacking among many CPUs within 1 LLC
>
> I should be seeing that at 10 cores per LLC. And when we look at the
> tbench results (never the most stable -- let me run a few more of those)
> it looks like SIS_CURRENT is actually making that worse.
>
> That latency spike at 20 seems stable for me -- and 3ms is rather small,
> I've seen it up to 11ms (but typical in the 4-6 range). This does not
> happen with NO_SIS_CURRENT and is a fairly big point against these
> patches.
>
I tried to reproduce the issue on your platform, so I launched tbench with
nr_thread = 50% on a Ivy Bridge-EP, it seems that I could not reproduce the
issue(the difference is that the default testing is with perf record enabled).

I launched netperf/tbench under 50%/75%/100%/125% on some platforms with
smaller number of CPUs, including:
Ivy Bridge-EP, nr_node: 2, nr_cpu: 48
Ivy Bridge, nr_node: 1, nr_cpu: 4
Coffee Lake, nr_node: 1, nr_cpu: 12
Commet Lake, nr_node: 1, nr_cpu: 20
Kaby Lake, nr_node: 1, nr_cpu: 8

All platforms are tested with cpu freq govenor set to performance to
get stable result. Each test lasts for 60 seconds.

It seems that per the test result, no obvious netperf/tbench throughput
regress was detected on these platforms(within 3%), and some platforms
such as Commet Lake shows some improvement.

The tbench.max_latency show improvement/degrading and it seems that
this latency value is unstable(with/without patch applied).
I don't know how to interpret this value(should we look at the tail
latency like schbench) and it seems that the latency variance is another
issue to be looked into.


netperf.Throughput_total_tps(higher the better):

Ivy Bridge-EP, nr_node: 2, nr_cpu: 48:
NO_SIS_CURRENT SIS_CURRENT
---------------- ---------------------------
50%+TCP_RR: 990828 -1.0% 980992
50%+UDP_RR: 1282489 +1.0% 1295717
75%+TCP_RR: 935827 +8.9% 1019470
75%+UDP_RR: 1164074 +11.6% 1298844
100%+TCP_RR: 1846962 -0.1% 1845311
100%+UDP_RR: 2557455 -2.3% 2497846
125%+TCP_RR: 1771652 -1.4% 1747653
125%+UDP_RR: 2415665 -1.1% 2388459

Ivy Bridge, nr_node: 1, nr_cpu: 4
NO_SIS_CURRENT SIS_CURRENT
---------------- ---------------------------
50%+TCP_RR: 52697 -1.2% 52088
50%+UDP_RR: 135397 -0.1% 135315
75%+TCP_RR: 135613 -0.6% 134777
75%+UDP_RR: 183439 -0.3% 182853
100%+TCP_RR: 183255 -1.3% 180859
100%+UDP_RR: 245836 -0.6% 244345
125%+TCP_RR: 174957 -2.1% 171258
125%+UDP_RR: 232509 -1.1% 229868


Coffee Lake, nr_node: 1, nr_cpu: 12
NO_SIS_CURRENT SIS_CURRENT
---------------- ---------------------------
50%+TCP_RR: 429718 -1.2% 424359
50%+UDP_RR: 536240 +0.1% 536646
75%+TCP_RR: 450310 -1.2% 444764
75%+UDP_RR: 538645 -1.0% 532995
100%+TCP_RR: 774423 -0.3% 771764
100%+UDP_RR: 971805 -0.3% 969223
125%+TCP_RR: 720546 +0.6% 724593
125%+UDP_RR: 911169 +0.2% 912576

Commet Lake, nr_node: 1, nr_cpu: 20
NO_SIS_CURRENT SIS_CURRENT
---------------- ---------------------------
50%+UDP_RR: 1174505 +4.6% 1228945
75%+TCP_RR: 833303 +20.2% 1001582
75%+UDP_RR: 1149171 +13.4% 1303623
100%+TCP_RR: 1928064 -0.5% 1917500
125%+TCP_RR: 74389 -0.1% 74304
125%+UDP_RR: 2564210 -1.1% 2535377


Kaby Lake, nr_node: 1, nr_cpu: 8
NO_SIS_CURRENT SIS_CURRENT
---------------- ---------------------------
50%+TCP_RR: 303956 -1.7% 298749
50%+UDP_RR: 382059 -0.8% 379176
75%+TCP_RR: 368399 -1.5% 362742
75%+UDP_RR: 459285 -0.3% 458020
100%+TCP_RR: 544630 -1.1% 538901
100%+UDP_RR: 684498 -0.6% 680450
125%+TCP_RR: 514266 +0.0% 514367
125%+UDP_RR: 645970 +0.2% 647473



tbench.max_latency(lower the better)

Ivy Bridge-EP, nr_node: 2, nr_cpu: 48:
NO_SIS_CURRENT SIS_CURRENT
---------------- ---------------------------
50%: 45.31 -26.3% 33.41
75%: 269.36 -87.5% 33.72
100%: 274.76 -66.6% 91.85
125%: 723.34 -49.1% 368.29

Ivy Bridge, nr_node: 1, nr_cpu: 4
NO_SIS_CURRENT SIS_CURRENT
---------------- ---------------------------
50%: 10.04 -70.5% 2.96
75%: 10.12 +63.0% 16.49
100%: 73.97 +148.1% 183.55
125%: 138.31 -39.9% 83.09


Commet Lake, nr_node: 1, nr_cpu: 20
NO_SIS_CURRENT SIS_CURRENT
---------------- ---------------------------
50%: 10.59 +24.5% 13.18
75%: 11.53 -0.5% 11.47
100%: 414.65 -13.9% 356.93
125%: 411.51 -81.9% 74.56

Coffee Lake, nr_node: 1, nr_cpu: 12
NO_SIS_CURRENT SIS_CURRENT
---------------- ---------------------------
50% 452.07 -99.5% 2.06
75% 4.42 +81.2% 8.00
100% 76.11 -44.7% 42.12
125% 47.06 +280.6% 179.09


Kaby Lake, nr_node: 1, nr_cpu: 8
NO_SIS_CURRENT SIS_CURRENT
---------------- ---------------------------
50%: 10.52 +0.1% 10.53
75%: 12.95 +62.1% 20.99
100%: 25.63 +181.1% 72.05
125%: 94.05 -17.0% 78.06

> > 2. reducing the C2C overhead within 1 LLC
>
> This is due to how L3 became non-inclusive with Skylake? I can't see
> that because I don't have anything that recent :/
>
I checked it with the colleagues and it seems to not be related to non-inclusive L3
but related to the number of CPUs. More CPUs makes distances in the die longer,
which adds to the latency.
> > So far I did not received performance difference from LKP on desktop
> > test boxes. Let me queue the full test on some desktops to confirm
> > if this change has any impact on them.
>
> Right, so I've updated my netperf results above to have a relative
> difference between NO_SIS_CURRENT and SIS_CURRENT and I see some losses
> at the low end. For servers that gets compensated at the high end, but
> desktops tend to not get there much.
>
>
Since the large CPU number is one major motivation to wakeup
the task locally, may I have your opinion that is it applicable to add llc_size
as the factor when deciding whether we should wake up the wakee on current CPU?
The smaller the llc_size is, the harder the wake will be woken up on current CPU.


thanks,
Chenyu