Re: [PATCH 14/24] workqueue: Generalize unbound CPU pods

From: K Prateek Nayak
Date: Mon Jul 10 2023 - 23:04:45 EST

Next message: Zhang Zekun: "[PATCH -next 0/2] iommu/iova: optimize the iova rcache"
Previous message: ying zuxin: "[PATCH] time/hrtimer: Use hrtimer_is_queued instead of accessingtimer bits directly"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello Tejun,

On 7/6/2023 12:09 AM, Tejun Heo wrote:
> Hello,
>
> On Wed, Jul 05, 2023 at 12:34:48PM +0530, K Prateek Nayak wrote:
>> - Apart from tbench and netperf, the rest of the benchmarks show no
>> difference out of the box.
>
> Just looking at the data, it's a bit difficult for me to judge. I suppose
> most of differences are due to run-to-run variances? It'd be really useful
> if the data contains standard deviation (whether historical or directly from
> multiple runs).

I'll make sure to include this from now on.

>
>> - SPECjbb2015 Multi-jVM sees small uplift to max-jOPS with certain
>> affinity scopes.
>>
>> - tbench and netperf seem to be unhappy throughout. None of the affinity
>> scopes seem to bring back the performance. I'll dig more into this.
>
> Yeah, that seems pretty consistent.
>
>> ~~~~~~~~~~
>> ~ stream ~
>> ~~~~~~~~~~
>>
>> o NPS1
>>
>> - 10 Runs:
>>
>> Test: base affinity_scopes
>> Copy: 245676.59 (0.00 pct) 333646.71 (35.80 pct)
>> Scale: 206545.41 (0.00 pct) 205706.04 (-0.40 pct)
>> Add: 213506.82 (0.00 pct) 236739.07 (10.88 pct)
>> Triad: 217679.43 (0.00 pct) 249263.46 (14.50 pct)
>>
>> - 100 Runs:
>>
>> Test: base affinity_scopes
>> Copy: 318060.91 (0.00 pct) 326025.89 (2.50 pct)
>> Scale: 213943.40 (0.00 pct) 207647.37 (-2.94 pct)
>> Add: 237892.53 (0.00 pct) 232164.59 (-2.40 pct)
>> Triad: 245672.84 (0.00 pct) 246333.21 (0.26 pct)
>>
>> o NPS2
>>
>> - 10 Runs:
>>
>> Test: base affinity_scopes
>> Copy: 296632.20 (0.00 pct) 291153.63 (-1.84 pct)
>> Scale: 206193.90 (0.00 pct) 216368.42 (4.93 pct)
>> Add: 240363.50 (0.00 pct) 245954.23 (2.32 pct)
>> Triad: 242748.60 (0.00 pct) 238606.20 (-1.70 pct)
>>
>> - 100 Runs:
>>
>> Test: base affinity_scopes
>> Copy: 322535.79 (0.00 pct) 315020.03 (-2.33 pct)
>> Scale: 217723.56 (0.00 pct) 220172.32 (1.12 pct)
>> Add: 248117.72 (0.00 pct) 250557.17 (0.98 pct)
>> Triad: 257768.66 (0.00 pct) 248264.00 (-3.68 pct)
>>
>> o NPS4
>>
>> - 10 Runs:
>>
>> Test: base affinity_scopes
>> Copy: 274067.54 (0.00 pct) 302804.77 (10.48 pct)
>> Scale: 224944.53 (0.00 pct) 230112.39 (2.29 pct)
>> Add: 229318.09 (0.00 pct) 241939.54 (5.50 pct)
>> Triad: 230175.89 (0.00 pct) 253613.85 (10.18 pct)
>>
>> - 100 Runs:
>>
>> Test: base affinity_scopes
>> Copy: 338922.96 (0.00 pct) 348183.65 (2.73 pct)
>> Scale: 240262.45 (0.00 pct) 245939.67 (2.36 pct)
>> Add: 256968.24 (0.00 pct) 260657.01 (1.43 pct)
>> Triad: 262644.16 (0.00 pct) 262286.46 (-0.13 pct)
>
> The differences seem more consistent and pronounced for this benchmark too.
> Is this just expected variance for this benchmark?

Stream's changes are mostly due to run-to-run variance.

>
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> ~ Benchmarks run with multiple affinity scope ~
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>> o NPS1
>>
>> - tbench
>>
>> Clients: base cpu cache numa system
>> 1 450.40 (0.00 pct) 459.44 (2.00 pct) 457.12 (1.49 pct) 456.36 (1.32 pct) 456.75 (1.40 pct)
>> 2 872.50 (0.00 pct) 869.68 (-0.32 pct) 890.59 (2.07 pct) 878.87 (0.73 pct) 890.14 (2.02 pct)
>> 4 1630.13 (0.00 pct) 1621.24 (-0.54 pct) 1634.74 (0.28 pct) 1628.62 (-0.09 pct) 1646.57 (1.00 pct)
>> 8 3139.90 (0.00 pct) 3044.58 (-3.03 pct) 3099.49 (-1.28 pct) 3081.43 (-1.86 pct) 3151.16 (0.35 pct)
>> 16 6113.51 (0.00 pct) 5555.17 (-9.13 pct) 5465.09 (-10.60 pct) 5661.31 (-7.39 pct) 5742.58 (-6.06 pct)
>> 32 11024.64 (0.00 pct) 9574.62 (-13.15 pct) 9282.62 (-15.80 pct) 9542.00 (-13.44 pct) 9916.66 (-10.05 pct)
>> 64 19081.96 (0.00 pct) 15656.53 (-17.95 pct) 15176.12 (-20.46 pct) 16527.77 (-13.38 pct) 15097.97 (-20.87 pct)
>> 128 30956.07 (0.00 pct) 28277.80 (-8.65 pct) 27662.76 (-10.63 pct) 27817.94 (-10.13 pct) 28925.78 (-6.55 pct)
>> 256 42829.46 (0.00 pct) 38646.48 (-9.76 pct) 38355.27 (-10.44 pct) 37073.24 (-13.43 pct) 34391.01 (-19.70 pct)
>> 512 42395.69 (0.00 pct) 36931.34 (-12.88 pct) 39259.49 (-7.39 pct) 36571.62 (-13.73 pct) 36245.55 (-14.50 pct)
>> 1024 41973.51 (0.00 pct) 38817.07 (-7.52 pct) 38733.15 (-7.72 pct) 38864.45 (-7.40 pct) 35728.70 (-14.87 pct)
>>
>> - netperf
>>
>> base cpu cache numa system
>> 1-clients: 100910.82 (0.00 pct) 103440.72 (2.50 pct) 102592.36 (1.66 pct) 103199.49 (2.26 pct) 103561.90 (2.62 pct)
>> 2-clients: 99777.76 (0.00 pct) 100414.00 (0.63 pct) 100305.89 (0.52 pct) 99890.90 (0.11 pct) 101512.46 (1.73 pct)
>> 4-clients: 97676.17 (0.00 pct) 96624.28 (-1.07 pct) 95966.77 (-1.75 pct) 97105.22 (-0.58 pct) 97972.11 (0.30 pct)
>> 8-clients: 95413.11 (0.00 pct) 89926.72 (-5.75 pct) 89977.14 (-5.69 pct) 91020.10 (-4.60 pct) 92458.94 (-3.09 pct)
>> 16-clients: 88961.66 (0.00 pct) 81295.02 (-8.61 pct) 79144.83 (-11.03 pct) 80216.42 (-9.83 pct) 85439.68 (-3.95 pct)
>> 32-clients: 82199.83 (0.00 pct) 77914.00 (-5.21 pct) 75055.66 (-8.69 pct) 76813.94 (-6.55 pct) 80768.87 (-1.74 pct)
>> 64-clients: 66094.87 (0.00 pct) 64419.91 (-2.53 pct) 63718.37 (-3.59 pct) 60370.40 (-8.66 pct) 66179.58 (0.12 pct)
>> 128-clients: 43833.63 (0.00 pct) 42936.08 (-2.04 pct) 44554.69 (1.64 pct) 42666.82 (-2.66 pct) 45543.69 (3.90 pct)
>> 256-clients: 38917.58 (0.00 pct) 24807.28 (-36.25 pct) 20517.01 (-47.28 pct) 21651.40 (-44.36 pct) 23778.87 (-38.89 pct)
>>
>> - SPECjbb2015 Mutli-JVM
>>
>> max-jOPS critical-jOPS
>> base: 0.00% 0.00%
>> smt: -1.11% -1.84%
>> cpu: 2.86% -1.35%
>> cache: 2.86% -1.66%
>> numa: 1.43% -1.49%
>> system: 0.08% -0.41%
>>
>>
>> I'll go dig deeper into the tbench and netperf regressions. I'm not sure
>> why the regression is observed for all the affinity scopes. I'll look
>> into IBS profile and see if something obvious pops up. Meanwhile if there
>> is any specific data you would like me to collect or benchmark you would
>> like me to test, let me know.
>
> Yeah, that's a bit surprising given that in terms of affinity behavior
> "numa" should be identical to base. The only meaningful differences that I
> can think of is when the work item is assigned to its worker and maybe how
> pwq max_active limit is applied. Hmm... can you monitor the number of
> kworker kthreads while running the benchmark? No need to do the whole
> matrix, just comparing base against numa should be enough.

Sure. I'll get back to you with the data soon.

>
> Thanks.
>

--
Thanks and Regards,
Prateek

Next message: Zhang Zekun: "[PATCH -next 0/2] iommu/iova: optimize the iova rcache"
Previous message: ying zuxin: "[PATCH] time/hrtimer: Use hrtimer_is_queued instead of accessingtimer bits directly"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]