[PATCH 0/4] powerpc/smp: Shared processor sched optimizations

From: Srikar Dronamraju
Date: Wed Aug 30 2023 - 15:14:33 EST


PowerVM systems configured in shared processors mode have some unique
challenges. Some device-tree properties will be missing on a shared
processor. Hence some sched domains may not make sense for shared processor
systems.

Most shared processor systems are over-provisioned. Underlying PowerVM
Hypervisor would schedule at a Big Core granularity. The most recent power
processors support two almost independent cores. In a lightly loaded
condition, it helps the overall system performance if we pack to lesser
number of Big Cores.

System Configuration
type=Shared mode=Capped smt=8 lcpu=128 mem=1066732224 kB cpus=96 ent=40.00
So *40 Entitled cores / 128 Virtual processors* scenario.

lscpu
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 1024
On-line CPU(s) list: 0-1023
Model name: POWER10 (architected), altivec supported
Model: 2.0 (pvr 0080 0200)
Thread(s) per core: 8
Core(s) per socket: 16
Socket(s): 8
Hypervisor vendor: pHyp
Virtualization type: para
L1d cache: 8 MiB (256 instances)
L1i cache: 12 MiB (256 instances)
NUMA node(s): 8
NUMA node0 CPU(s): 0-7,64-71,128-135,192-199,256-263,320-327,384-391,448-455,512-519,576-583,640-647,704-711,768-775,832-839,896-903,960-967
NUMA node1 CPU(s): 8-15,72-79,136-143,200-207,264-271,328-335,392-399,456-463,520-527,584-591,648-655,712-719,776-783,840-847,904-911,968-975
NUMA node2 CPU(s): 16-23,80-87,144-151,208-215,272-279,336-343,400-407,464-471,528-535,592-599,656-663,720-727,784-791,848-855,912-919,976-983
NUMA node3 CPU(s): 24-31,88-95,152-159,216-223,280-287,344-351,408-415,472-479,536-543,600-607,664-671,728-735,792-799,856-863,920-927,984-991
NUMA node4 CPU(s): 32-39,96-103,160-167,224-231,288-295,352-359,416-423,480-487,544-551,608-615,672-679,736-743,800-807,864-871,928-935,992-999
NUMA node5 CPU(s): 40-47,104-111,168-175,232-239,296-303,360-367,424-431,488-495,552-559,616-623,680-687,744-751,808-815,872-879,936-943,1000-1007
NUMA node6 CPU(s): 48-55,112-119,176-183,240-247,304-311,368-375,432-439,496-503,560-567,624-631,688-695,752-759,816-823,880-887,944-951,1008-1015
NUMA node7 CPU(s): 56-63,120-127,184-191,248-255,312-319,376-383,440-447,504-511,568-575,632-639,696-703,760-767,824-831,888-895,952-959,1016-1023

ebizzy -t 40 -S 200 (5 iterations) Records per second. (Higher is better)
Kernel N Min Max Median Avg Stddev %Change
v6.5 5 4664647 5148125 5130549 5043050.2 211756.06
+patch 5 4769453 5220808 5137476 5040333.8 193586.43 -0.0538642

>From lparstat (when the workload stabilized)
Kernel %user %sys %wait %idle physc %entc lbusy app vcsw phint
v6.5 6.23 0.00 0.00 93.77 40.06 100.15 6.23 55.92 138699651 100
+patch 6.26 0.01 0.00 93.73 21.15 52.87 6.27 74.78 71743299 148

ebizzy -t 80 -S 200 (5 iterations) Records per second. (Higher is better)
Kernel N Min Max Median Avg Stddev %Change
v6.5 5 8735907 9121401 8986218 8967125.6 152793.38
+patch 5 9636679 9990229 9765958 9770081.8 143913.29 8.95444

>From lparstat (when the workload stabilized)
Kernel %user %sys %wait %idle physc %entc lbusy app vcsw phint
v6.5 12.40 0.01 0.00 87.60 71.05 177.62 12.40 24.61 98047012 85
+patch 12.47 0.02 0.00 87.50 41.06 102.65 12.50 54.90 77821678 158

ebizzy -t 160 -S 200 (5 iterations) Records per second. (Higher is better)
Kernel N Min Max Median Avg Stddev %Change
v6.5 5 12378356 12946633 12780732 12682369 266135.73
+patch 5 16756702 17676670 17406971 17341585 346054.89 36.7377

>From lparstat (when the workload stabilized)
Kernel %user %sys %wait %idle physc %entc lbusy app vcsw phint
v6.5 24.56 0.09 0.15 75.19 77.42 193.55 24.65 17.94 135625276 98
+patch 24.78 0.03 0.00 75.19 78.33 195.83 24.81 17.17 107826112 215
-------------------------------------------------------------------------

System Configuration
type=Shared mode=Capped smt=8 lcpu=40 mem=1066732672 kB cpus=96 ent=40.00
So *40 Entitled cores / 40 Virtual processors* scenario.

lscpu
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 320
On-line CPU(s) list: 0-319
Model name: POWER10 (architected), altivec supported
Model: 2.0 (pvr 0080 0200)
Thread(s) per core: 8
Core(s) per socket: 10
Socket(s): 4
Hypervisor vendor: pHyp
Virtualization type: para
L1d cache: 2.5 MiB (80 instances)
L1i cache: 3.8 MiB (80 instances)
NUMA node(s): 4
NUMA node0 CPU(s): 0-7,32-39,64-71,96-103,128-135,160-167,192-199,224-231,256-263,288-295
NUMA node1 CPU(s): 8-15,40-47,72-79,104-111,136-143,168-175,200-207,232-239,264-271,296-303
NUMA node2 CPU(s): 16-23,48-55,80-87,112-119,144-151,176-183,208-215,240-247,272-279,304-311
NUMA node3 CPU(s): 24-31,56-63,88-95,120-127,152-159,184-191,216-223,248-255,280-287,312-319

ebizzy -t 40 -S 200 (5 iterations) Records per second. (Higher is better)
Kernel N Min Max Median Avg Stddev %Change
v6.5 5 4966196 5148045 5078348 5072977.4 66572.122
+patch 5 5035210 5232882 5158456 5151734 78906.893 1.55247

>From lparstat (when the workload stabilized)
Kernel %user %sys %wait %idle physc %entc lbusy app vcsw phint
v6.5 12.58 0.02 0.00 87.41 40.00 100.00 12.59 55.97 1029603 82
+patch 12.58 0.02 0.00 87.40 21.16 52.90 12.60 74.82 1188571 657

ebizzy -t 80 -S 200 (5 iterations) Records per second. (Higher is better)
Kernel N Min Max Median Avg Stddev %Change
v6.5 5 10081713 10162128 10145721 10128119 35603.196
+patch 5 9928483 10430256 10338097 10218466 221155.16 0.892041

>From lparstat (when the workload stabilized)
Kernel %user %sys %wait %idle physc %entc lbusy app vcsw phint
v6.5 25.02 0.06 0.00 74.93 40.00 100.00 25.07 55.99 1530297 92
+patch 25.03 0.04 0.00 74.93 40.00 100.00 25.07 55.99 2475875 667

ebizzy -t 160 -S 200 (5 iterations) Records per second. (Higher is better)
Kernel N Min Max Median Avg Stddev %Change
v6.5 5 9064802 9169798 9115250 9123968.2 44901.261
+patch 5 9064533 9235200 9072374 9119558.2 76260.411 -0.0483342

>From lparstat (when the workload stabilized)
Kernel %user %sys %wait %idle physc %entc lbusy app vcsw phint
v6.5 49.94 0.03 0.00 50.03 40.06 100.15 49.97 55.99 2058879 93
+patch 49.94 0.03 0.00 50.03 40.06 100.15 49.97 55.99 2058879 93
-------------------------------------------------------------------------

Observation:
We are able to see Improvement in ebizzy throughput even with lesser
core utilization (almost half the core utilization) in low utilization
scenarios while still retaining throughput in mid and higher utilization
scenarios.
Note: The numbers are with Uncapped + no-noise case. In the Capped and/or
noise case, due to contention on the Cores, the numbers are expected to
further improve.

Srikar Dronamraju (4):
powerpc/smp: Cache CPU has Asymmetric SMP
powerpc/smp: Move shared_processor static key to smp.h
powerpc/smp: Enable Asym packing for cores on shared processor
powerpc/smp: Disable MC domain for shared processor

arch/powerpc/include/asm/paravirt.h | 12 -----------
arch/powerpc/include/asm/smp.h | 14 +++++++++++++
arch/powerpc/kernel/smp.c | 31 +++++++++++++++++++----------
3 files changed, 35 insertions(+), 22 deletions(-)

--
2.41.0