Re: [PATCH v8 2/2] sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU

From: Chen Yu
Date: Mon May 01 2023 - 11:53:44 EST


Hi Peter,
On 2023-05-01 at 15:48:27 +0200, Peter Zijlstra wrote:
> On Sat, Apr 29, 2023 at 07:16:56AM +0800, Chen Yu wrote:
> > netperf
> > =======
> > case load baseline(std%) compare%( std%)
> > TCP_RR 56-threads 1.00 ( 1.96) +15.23 ( 4.67)
> > TCP_RR 112-threads 1.00 ( 1.84) +88.83 ( 4.37)
> > TCP_RR 168-threads 1.00 ( 0.41) +475.45 ( 4.45)
> > TCP_RR 224-threads 1.00 ( 0.62) +806.85 ( 3.67)
> > TCP_RR 280-threads 1.00 ( 65.80) +162.66 ( 10.26)
> > TCP_RR 336-threads 1.00 ( 17.30) -0.19 ( 19.07)
> > TCP_RR 392-threads 1.00 ( 26.88) +3.38 ( 28.91)
> > TCP_RR 448-threads 1.00 ( 36.43) -0.26 ( 33.72)
> > UDP_RR 56-threads 1.00 ( 7.91) +3.77 ( 17.48)
> > UDP_RR 112-threads 1.00 ( 2.72) -15.02 ( 10.78)
> > UDP_RR 168-threads 1.00 ( 8.86) +131.77 ( 13.30)
> > UDP_RR 224-threads 1.00 ( 9.54) +178.73 ( 16.75)
> > UDP_RR 280-threads 1.00 ( 15.40) +189.69 ( 19.36)
> > UDP_RR 336-threads 1.00 ( 24.09) +0.54 ( 22.28)
> > UDP_RR 392-threads 1.00 ( 39.63) -3.90 ( 33.77)
> > UDP_RR 448-threads 1.00 ( 43.57) +1.57 ( 40.43)
> >
> > tbench
> > ======
> > case load baseline(std%) compare%( std%)
> > loopback 56-threads 1.00 ( 0.50) +10.78 ( 0.52)
> > loopback 112-threads 1.00 ( 0.19) +2.73 ( 0.08)
> > loopback 168-threads 1.00 ( 0.09) +173.72 ( 0.47)
> > loopback 224-threads 1.00 ( 0.20) -2.13 ( 0.42)
> > loopback 280-threads 1.00 ( 0.06) -0.77 ( 0.15)
> > loopback 336-threads 1.00 ( 0.14) -0.08 ( 0.08)
> > loopback 392-threads 1.00 ( 0.17) -0.27 ( 0.86)
> > loopback 448-threads 1.00 ( 0.37) +0.32 ( 0.02)
>
> So,... I've been poking around with this a bit today and I'm not seeing
> it. On my ancient IVB-EP (2*10*2) with the code as in
> queue/sched/core I get:
>
> netperf NO_WA_WEIGHT NO_SIS_CURRENT
> NO_WA_BIAS SIS_CURRENT
> -------------------------------------------------------------------
> TCP_SENDFILE-1 : Avg: 40495.7 41899.7 42001 40783.4
> TCP_SENDFILE-10 : Avg: 37218.6 37200.1 37065.1 36604.4
> TCP_SENDFILE-20 : Avg: 21495.1 21516.6 21004.4 21356.9
> TCP_SENDFILE-40 : Avg: 6947.24 7917.64 7079.93 7231.3
> TCP_SENDFILE-80 : Avg: 4081.91 3572.48 3582.98 3615.85
> TCP_STREAM-1 : Avg: 37078.1 34469.4 37134.5 35095.4
> TCP_STREAM-10 : Avg: 31532.1 31265.8 31260.7 31588.1
> TCP_STREAM-20 : Avg: 17848 17914.9 17996.6 17937.4
> TCP_STREAM-40 : Avg: 7844.3 7201.65 7710.4 7790.62
> TCP_STREAM-80 : Avg: 2518.38 2932.74 2601.51 2903.89
> TCP_RR-1 : Avg: 84347.1 81056.2 81167.8 83541.3
> TCP_RR-10 : Avg: 71539.1 72099.5 71123.2 69447.9
> TCP_RR-20 : Avg: 51053.3 50952.4 50905.4 52157.2
> TCP_RR-40 : Avg: 46370.9 46477.5 46289.2 46350.7
> TCP_RR-80 : Avg: 21515.2 22497.9 22024.4 22229.2
> UDP_RR-1 : Avg: 96933 100076 95997.2 96553.3
> UDP_RR-10 : Avg: 83937.3 83054.3 83878.5 78998.6
> UDP_RR-20 : Avg: 61974 61897.5 61838.8 62926
> UDP_RR-40 : Avg: 56708.6 57053.9 56456.1 57115.2
> UDP_RR-80 : Avg: 26950 27895.8 27635.2 27784.8
> UDP_STREAM-1 : Avg: 52808.3 55296.8 52808.2 51908.6
> UDP_STREAM-10 : Avg: 45810 42944.1 43115 43561.2
> UDP_STREAM-20 : Avg: 19212.7 17572.9 18798.7 20066
> UDP_STREAM-40 : Avg: 13105.1 13096.9 13070.5 13110.2
> UDP_STREAM-80 : Avg: 6372.57 6367.96 6248.86 6413.09
>
>
> tbench
>
> NO_WA_WEIGHT, NO_WA_BIAS, NO_SIS_CURRENT
>
> Throughput 626.57 MB/sec 2 clients 2 procs max_latency=0.095 ms
> Throughput 1316.08 MB/sec 5 clients 5 procs max_latency=0.106 ms
> Throughput 1905.19 MB/sec 10 clients 10 procs max_latency=0.161 ms
> Throughput 2428.05 MB/sec 20 clients 20 procs max_latency=0.284 ms
> Throughput 2323.16 MB/sec 40 clients 40 procs max_latency=0.381 ms
> Throughput 2229.93 MB/sec 80 clients 80 procs max_latency=0.873 ms
>
> WA_WEIGHT, NO_WA_BIAS, NO_SIS_CURRENT
>
> Throughput 575.04 MB/sec 2 clients 2 procs max_latency=0.093 ms
> Throughput 1285.37 MB/sec 5 clients 5 procs max_latency=0.122 ms
> Throughput 1916.10 MB/sec 10 clients 10 procs max_latency=0.150 ms
> Throughput 2422.54 MB/sec 20 clients 20 procs max_latency=0.292 ms
> Throughput 2361.57 MB/sec 40 clients 40 procs max_latency=0.448 ms
> Throughput 2479.70 MB/sec 80 clients 80 procs max_latency=1.249 ms
>
> WA_WEIGHT, WA_BIAS, NO_SIS_CURRENT (aka, mainline)
>
> Throughput 649.46 MB/sec 2 clients 2 procs max_latency=0.092 ms
> Throughput 1370.93 MB/sec 5 clients 5 procs max_latency=0.140 ms
> Throughput 1904.14 MB/sec 10 clients 10 procs max_latency=0.470 ms
> Throughput 2406.15 MB/sec 20 clients 20 procs max_latency=0.276 ms
> Throughput 2419.40 MB/sec 40 clients 40 procs max_latency=0.414 ms
> Throughput 2426.00 MB/sec 80 clients 80 procs max_latency=1.366 ms
>
> WA_WEIGHT, WA_BIAS, SIS_CURRENT (aka, with patches on)
>
> Throughput 646.55 MB/sec 2 clients 2 procs max_latency=0.104 ms
> Throughput 1361.06 MB/sec 5 clients 5 procs max_latency=0.100 ms
> Throughput 1889.82 MB/sec 10 clients 10 procs max_latency=0.154 ms
> Throughput 2406.57 MB/sec 20 clients 20 procs max_latency=3.667 ms
> Throughput 2318.00 MB/sec 40 clients 40 procs max_latency=0.390 ms
> Throughput 2384.85 MB/sec 80 clients 80 procs max_latency=1.371 ms
>
>
> So what's going on here? I don't see anything exciting happening at the
> 40 mark. At the same time, I can't seem to reproduce Mike's latency pile
> up either :/
>
Thank you very much for trying this patch. This patch was found to mainly
benefit system with large number of CPUs in 1 LLC. Previously I tested
it on Sapphire Rapids(2x56C/224T) and Ice Lake Server(2x32C/128T)[1], it
seems to have benefit on them. The benefit seems to come from:
1. reducing the waker stacking among many CPUs within 1 LLC
2. reducing the C2C overhead within 1 LLC
As a comparison, Prateek has tested this patch on the Zen3 platform,
which has 16 threads per LLC and smaller than Sapphire Rapids and Ice
Lake Server. He did not observe too much difference with this patch
applied, but only saw some limited improvement on tbench and Spec.
So far I did not received performance difference from LKP on desktop
test boxes. Let me queue the full test on some desktops to confirm
if this change has any impact on them.

[1] https://lore.kernel.org/lkml/202211021600.ceb04ba9-yujie.liu@xxxxxxxxx/

thanks,
Chenyu


The original symptom I found was that, there are
quite some idle time(up to 30%) when running will-it-scale context switch
using the same number as the online CPUs. And waking up the task locally
reduce the race condition and reduce the C2C overhead within 1 LLC,
which is more severe on a system with large number of CPUs in 1 LLC.