Re: [RFC PATCH 0/7] Optimization to reduce the cost of newidle balance

From: Chen Yu
Date: Wed Aug 30 2023 - 15:18:47 EST


Hi Shrikanth,

On 2023-08-25 at 13:18:56 +0530, Shrikanth Hegde wrote:
>
> On 7/27/23 8:03 PM, Chen Yu wrote:
>
> Hi Chen. It is a nice patch series in effort to reduce the newidle cost.
> This gives the idea of making use of calculations done in load_balance to used
> among different idle types.
>

Thanks for taking a look at this patch set.

> It was interesting to see how this would work on Power Systems. The reason being we have
> large core count and LLC size is small. i.e at small core level (llc_weight=4). This would
> mean quite frequest access sd_share at different level which would reside on the first_cpu of
> the sched domain, which might result in more cache-misses. But perf stats didnt show the same.
>

Do you mean 1 large domain(Die domain?) has many LLC sched domains as its children,
and accessing the large domain's sd_share field would cross different LLCs and the
latency is high? Yes, this could be a problem and it depends on the hardware that how
fast differet LLCs snoop the data with each other.
On the other hand, the periodic load balance is the writer of sd_share, and the
interval is based on the cpu_weight of that domain. So the write might be less frequent
on large domains, and most access to sd_share would be the read issued by newidle balance,
which is less costly.

> Another concern on more number of sched groups at DIE level, which might take a hit if
> the balancing takes longer for the system to stabilize.

Do you mean, if newidle balance does not pull tasks hard enough, the imbalance between groups
would last longer? Yes, Prateek has mentioned this point, the ILB_UTIL has this problem, I'll
think more about it. We want to find a way in newidle balance to do less scan, but still pulls
tasks as hard as before.

>
> tl;dr
>
> Tested with micro-benchmarks on system with 96 Cores with SMT=8. Total of 768 CPU's. There is some amount

May I know the sched domain hierarchy of this platform?
grep . /sys/kernel/debug/sched/domains/cpu0/domain*/*
cat /proc/schedstat | grep cpu0 -A 4 (4 domains?)

> of regression with hackbench and schbench. haven't looked into why. Any pointers to check would be helpful.

May I know what is the command to run hackbench and schbench below? For example
the fd number, package size and the loop number of hackbench, and
number of message thread and worker thread of schbench, etc. I assume
you are using the old schbench? As the latest schbench would track other metrics
besides tail latency.


> Did a test with more real case workload that we have called daytrader. Its is DB workload which gives total
> transcations done per second. That doesn't show any regression.
>
> Its true that all benchmarks will not be happy.
> Maybe in below cases, newidle may not be that costly. Do you have any specific benchmark to be tried?
>

Previously I tested schbench/hackbench/netperf/tbench/sqlite, and also I'm planning
to try an OLTP.

> -----------------------------------------------------------------------------------------------------
> 6.5.rc4 6.5.rc4 + PATCH_V2 gain
> Daytrader: 55049 55378 0.59%
>
> -----------------------------------------------------------------------------------------------------
> hackbench(50 iterations): 6.5.rc4 6.5.rc4 + PATCH_V2(gain%)
>
>
> Process 10 groups : 0.19, 0.19(0.00)
> Process 20 groups : 0.23, 0.24(-4.35)
> Process 30 groups : 0.28, 0.30(-7.14)
> Process 40 groups : 0.38, 0.40(-5.26)
> Process 50 groups : 0.43, 0.45(-4.65)
> Process 60 groups : 0.51, 0.51(0.00)
> thread 10 Time : 0.21, 0.22(-4.76)
> thread 20 Time : 0.27, 0.32(-18.52)
> Process(Pipe) 10 Time : 0.17, 0.17(0.00)
> Process(Pipe) 20 Time : 0.23, 0.23(0.00)
> Process(Pipe) 30 Time : 0.28, 0.28(0.00)
> Process(Pipe) 40 Time : 0.33, 0.32(3.03)
> Process(Pipe) 50 Time : 0.38, 0.36(5.26)
> Process(Pipe) 60 Time : 0.40, 0.39(2.50)
> thread(Pipe) 10 Time : 0.14, 0.14(0.00)
> thread(Pipe) 20 Time : 0.20, 0.19(5.00)
>
> Observation: lower is better. socket based runs show regression quite a bit,
> pipe shows slight improvement.
>
>
> -----------------------------------------------------------------------------------------------------
> Unixbench(10 iterations): 6.5.rc4 6.5.rc4 + PATCH_V2(gain%)
>
> 1 X Execl Throughput : 4280.15, 4398.30(2.76)
> 4 X Execl Throughput : 8171.60, 8061.60(-1.35)
> 1 X Pipe-based Context Switching : 172455.50, 174586.60(1.24)
> 4 X Pipe-based Context Switching : 633708.35, 664659.85(4.88)
> 1 X Process Creation : 6891.20, 7056.85(2.40)
> 4 X Process Creation : 8826.20, 8996.25(1.93)
> 1 X Shell Scripts (1 concurrent) : 9272.05, 9456.10(1.98)
> 4 X Shell Scripts (1 concurrent) : 27919.60, 25319.75(-9.31)
> 1 X Shell Scripts (8 concurrent) : 4462.70, 4392.75(-1.57)
> 4 X Shell Scripts (8 concurrent) : 11852.30, 10820.70(-8.70)
>
> Observation: higher is better. Results are somewhat mixed.
>
>
> -----------------------------------------------------------------------------------------------------
> schbench(10 iterations) 6.5.rc4 6.5.rc4 + PATCH_V2(gain%)
>
> 1 Threads
> 50.0th: 8.00, 7.00(12.50)
> 75.0th: 8.00, 7.60(5.00)
> 90.0th: 8.80, 8.00(9.09)
> 95.0th: 10.20, 8.20(19.61)
> 99.0th: 13.60, 11.00(19.12)
> 99.5th: 14.00, 12.80(8.57)
> 99.9th: 15.80, 35.00(-121.52)
> 2 Threads
> 50.0th: 8.40, 8.20(2.38)
> 75.0th: 9.00, 8.60(4.44)
> 90.0th: 10.20, 9.60(5.88)
> 95.0th: 11.20, 10.20(8.93)
> 99.0th: 14.40, 11.40(20.83)
> 99.5th: 14.80, 12.80(13.51)
> 99.9th: 17.60, 14.80(15.91)
> 4 Threads
> 50.0th: 10.60, 10.40(1.89)
> 75.0th: 12.20, 11.60(4.92)
> 90.0th: 13.60, 12.60(7.35)
> 95.0th: 14.40, 13.00(9.72)
> 99.0th: 16.40, 15.60(4.88)
> 99.5th: 16.80, 16.60(1.19)
> 99.9th: 22.00, 29.00(-31.82)
> 8 Threads
> 50.0th: 12.00, 11.80(1.67)
> 75.0th: 14.40, 14.40(0.00)
> 90.0th: 17.00, 18.00(-5.88)
> 95.0th: 19.20, 19.80(-3.13)
> 99.0th: 23.00, 24.20(-5.22)
> 99.5th: 26.80, 29.20(-8.96)
> 99.9th: 68.00, 97.20(-42.94)
> 16 Threads
> 50.0th: 18.00, 18.20(-1.11)
> 75.0th: 23.20, 23.60(-1.72)
> 90.0th: 28.00, 27.40(2.14)
> 95.0th: 31.20, 30.40(2.56)
> 99.0th: 38.60, 38.20(1.04)
> 99.5th: 50.60, 50.40(0.40)
> 99.9th: 122.80, 108.00(12.05)
> 32 Threads
> 50.0th: 30.00, 30.20(-0.67)
> 75.0th: 42.20, 42.60(-0.95)
> 90.0th: 52.60, 55.40(-5.32)
> 95.0th: 58.60, 63.00(-7.51)
> 99.0th: 69.60, 78.20(-12.36)
> 99.5th: 79.20, 103.80(-31.06)
> 99.9th: 171.80, 209.60(-22.00)
>
> Observation: lower is better. tail latencies seem to go up. schbench also has run to run variations.
>
> -----------------------------------------------------------------------------------------------------
> stress-ng(20 iterations) 6.5.rc4 6.5.rc4 + PATCH_V2(gain%)
> ( 100000 cpu-ops)
>
> --cpu=768 Time : 1.58, 1.53(3.16)
> --cpu=384 Time : 1.66, 1.63(1.81)
> --cpu=192 Time : 2.67, 2.77(-3.75)
> --cpu=96 Time : 3.70, 3.69(0.27)
> --cpu=48 Time : 5.73, 5.69(0.70)
> --cpu=24 Time : 7.27, 7.26(0.14)
> --cpu=12 Time : 14.25, 14.24(0.07)
> --cpu=6 Time : 28.42, 28.40(0.07)
> --cpu=3 Time : 56.81, 56.68(0.23)
> --cpu=768 -util=10 Time : 3.69, 3.70(-0.27)
> --cpu=768 -util=20 Time : 5.67, 5.70(-0.53)
> --cpu=768 -util=30 Time : 7.08, 7.12(-0.56)
> --cpu=768 -util=40 Time : 8.23, 8.27(-0.49)
> --cpu=768 -util=50 Time : 9.22, 9.26(-0.43)
> --cpu=768 -util=60 Time : 10.09, 10.15(-0.59)
> --cpu=768 -util=70 Time : 10.93, 10.98(-0.46)
> --cpu=768 -util=80 Time : 11.79, 11.79(0.00)
> --cpu=768 -util=90 Time : 12.63, 12.60(0.24)
>
>
> Observation: lower is better. Almost no difference.

I'll try to run the same tests of hackbench/schbench on my machine, to
see if I could find any clue for the regression.


thanks,
Chenyu