[RFC 00/12] ARM: MPAM: add support for priority partitioning control

From: Amit Singh Tomar
Date: Tue Aug 15 2023 - 11:28:40 EST


Arm Memory System Resource Partitioning and Monitoring (MPAM) supports
different controls that can be applied to different resources in the system
For instance, an optional priority partitioning control where priority
value is generated from one MSC, propagates over interconnect to other MSC
(known as downstream priority), or can be applied within an MSC for internal
operations.

Marvell implementation of ARM MPAM supports priority partitioning control
that allows LLC MSC to generate priority values that gets propagated (along with
read/write request from upstream) to DDR Block. Within the DDR block the
priority values is mapped to different traffic class under DDR QoS strategy.
The link[1] gives some idea about DDR QoS strategy, and terms like LPR, VPR
and HPR.

Setup priority partitioning control under Resource control
----------------------------------------------------------
At present, resource control (resctrl) provides basic interface to configure/set-up
CAT (Cache Allocation Technology) and MBA (Memory Bandwidth Allocation) capabilities.
ARM MPAM uses it to support controls like Cache portion partition (CPOR), and
MPAM bandwidth partitioning.

As an example, "schemata" file under resource control group contains information about
cache portion bitmaps, and memory bandwidth allocation, and these are used to configure
Cache portion partition (CPOR), and MPAM bandwidth partitioning controls.

MB:0=0100
L3:0=ffff

But resctrl doesn't provide a way to set-up other control that ARM MPAM provides
(For instance, Priority partitioning control as mentioned above). To support this,
James has suggested to use already existing schemata to be compatible with
portable software, and this is the main idea behind this RFC is to have some kind
of discussion on how resctrl can be extended to support priority partitioning control.

To support Priority partitioning control, "schemata" file is updated to accommodate
priority field (upon priority partitioning capability detection), separated from CPBM
using delimiter ",".

L3:0=ffff,f where f indicates downstream priority max value.

These dspri value gets programmed per partition, that can be used to override
QoS value coming from upstream (CPU).

RFC patch-set[2] is based on James Morse's MPAM snapshot[3] for 6.2, and ACPI
table is based on DEN0065A_MPAM_ACPI_2.0.

Test set-up and results:
------------------------

The downstream priority value feeds into DRAM controller, and one of the important
thing that it does with this value is to service the requests sooner (based on the
traffic class), hence reducing latency without affecting performance.

Within the DDR QoS traffic class.

0--5 ----> Low priority value
6-10 ----> Medium priority value
11-15 ----> High priority value

Benchmark[4] used is multichase.

Two partition P1 and P2:

Partition P1:
-------------
Assigned core 0
100% BW assignment

Partition P2:
-------------
Assigned cores 1-79
100% BW assignment

Test Script:
-----------
mkdir p1
cd p1
echo 1 > cpus
echo L3:1=8000,5 > schemata ##### DSPRI set as 5 (lpr)
echo "MB:0=100" > schemata

mkdir p2
cd p2
echo ffff,ffffffff,fffffffe > cpus
echo L3:1=8000,0 > schemata
echo "MB:0=100" > schemata

### Loaded latency run, core 0 does chaseload (pointer chase) with low priority value 5, and cores 1-79 does memory bandwidth run ###
./multiload -v -n 10 -t 80 -m 1G -c chaseload

cd /sys/fs/resctrl/p1

echo L3:1=8000,a > schemata ##### DSPRI set as 0xa (vpr)

### Loaded latency run, core 0 does chaseload (pointer chase) with medium priority value a, and cores 1-79 does memory bandwidth run ###
./multiload -v -n 10 -t 80 -m 1G -c chaseload

cd /sys/fs/resctrl/p1

echo L3:1=8000,f > schemata ##### DSPRI set as 0xf (hpr)

### Loaded latency run where core 0 does chaseload (pointer chase) with high priority value f, and cores 1-79 does memory bandwidth run ###
./multiload -v -n 10 -t 80 -m 1G -c chaseload

Results[5]:

LPR average latency is 204.862(ns) vs VPR average latency is 161.018(ns) vs HPR average latency is 134.210(ns).

[1]: https://drops.dagstuhl.de/opus/volltexte/2021/13934/pdf/LIPIcs-ECRTS-2021-3.pdf
[2]: https://github.com/Amit-Radur/linux/commits/mpam_downstream_priority_work
[3]: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/snapshot/v6.2
[4]: https://github.com/google/multichase
[5]:

root@localhost:# ./dspri_test.sh
Info: Loaded Latency chase selected. A -l memload can be used to select a specific memory load
nr_threads = 80
page_size = 4096 bytes
total_memory = 1073741824 (1024.0 MiB)
stride = 256
tlb_locality = 262144
chase = chaseload
memload = stream-sum
run_test_type = RUN_CHASE_LOADED
main: sample_no=0
main: sample_no=1 avg=204.9(ns)
main: threads=79, Total(MiB/s)=343018.0, PerThread=4342
main: sample_no=2 avg=206.0(ns)
main: threads=79, Total(MiB/s)=343038.0, PerThread=4342
main: sample_no=3 avg=206.4(ns)
main: threads=79, Total(MiB/s)=342443.0, PerThread=4335
main: sample_no=4 avg=206.3(ns)
main: threads=79, Total(MiB/s)=345156.0, PerThread=4369
main: sample_no=5 avg=205.6(ns)
main: threads=79, Total(MiB/s)=343807.0, PerThread=4352
main: sample_no=6 avg=205.9(ns)
main: threads=79, Total(MiB/s)=343593.0, PerThread=4349
main: sample_no=7 avg=206.3(ns)
main: threads=79, Total(MiB/s)=344770.0, PerThread=4364
main: sample_no=8 avg=205.7(ns)
main: threads=79, Total(MiB/s)=344935.0, PerThread=4366
main: sample_no=9 avg=205.3(ns)
main: threads=79, Total(MiB/s)=343189.0, PerThread=4344
main: sample_no=10 avg=206.1(ns)
main: threads=79, Total(MiB/s)=344455.0, PerThread=4360
ChasAVG=205.848485, ChasGEO=205.847944, ChasBEST=204.861518, ChasWORST=206.443386, ChasDEV=0.008
LdAvgMibs=343840.400000, LdMaxMibs=345156.000000, LdMinMibs=342443.000000, LdDevMibs=0.008
Samples , Byte/thd , ChaseThds , ChaseNS , ChaseMibs , ChDeviate , LoadThds , LdMaxMibs , LdAvgMibs , LdDeviate , ChaseArg , MemLdArg
10 , 1073741824 , 1 , 204.862 , 37 , 0.008 , 79 , 345156 , 343840 , 0.008 , chaseload , stream-sum
Info: Loaded Latency chase selected. A -l memload can be used to select a specific memory load
nr_threads = 80
page_size = 4096 bytes
total_memory = 1073741824 (1024.0 MiB)
stride = 256
tlb_locality = 262144
chase = chaseload
memload = stream-sum
run_test_type = RUN_CHASE_LOADED
main: sample_no=0
main: sample_no=1 avg=161.4(ns)
main: threads=79, Total(MiB/s)=342023.0, PerThread=4329
main: sample_no=2 avg=161.3(ns)
main: threads=79, Total(MiB/s)=341773.0, PerThread=4326
main: sample_no=3 avg=161.4(ns)
main: threads=79, Total(MiB/s)=342780.0, PerThread=4339
main: sample_no=4 avg=161.6(ns)
main: threads=79, Total(MiB/s)=341275.0, PerThread=4320
main: sample_no=5 avg=161.0(ns)
main: threads=79, Total(MiB/s)=342680.0, PerThread=4338
main: sample_no=6 avg=161.9(ns)
main: threads=79, Total(MiB/s)=341538.0, PerThread=4323
main: sample_no=7 avg=161.5(ns)
main: threads=79, Total(MiB/s)=345302.0, PerThread=4371
main: sample_no=8 avg=161.5(ns)
main: threads=79, Total(MiB/s)=341352.0, PerThread=4321
main: sample_no=9 avg=161.5(ns)
main: threads=79, Total(MiB/s)=341200.0, PerThread=4319
main: sample_no=10 avg=161.5(ns)
main: threads=79, Total(MiB/s)=341874.0, PerThread=4328
ChasAVG=161.458012, ChasGEO=161.457856, ChasBEST=161.017587, ChasWORST=161.935907, ChasDEV=0.006
LdAvgMibs=342179.700000, LdMaxMibs=345302.000000, LdMinMibs=341200.000000, LdDevMibs=0.012
Samples , Byte/thd , ChaseThds , ChaseNS , ChaseMibs , ChDeviate , LoadThds , LdMaxMibs , LdAvgMibs , LdDeviate , ChaseArg , MemLdArg
10 , 1073741824 , 1 , 161.018 , 47 , 0.006 , 79 , 345302 , 342180 , 0.012 , chaseload , stream-sum
Info: Loaded Latency chase selected. A -l memload can be used to select a specific memory load
nr_threads = 80
page_size = 4096 bytes
total_memory = 1073741824 (1024.0 MiB)
stride = 256
tlb_locality = 262144
chase = chaseload
memload = stream-sum
run_test_type = RUN_CHASE_LOADED
main: sample_no=0
main: sample_no=1 avg=134.3(ns)
main: threads=79, Total(MiB/s)=345284.0, PerThread=4371
main: sample_no=2 avg=134.7(ns)
main: threads=79, Total(MiB/s)=345295.0, PerThread=4371
main: sample_no=3 avg=134.4(ns)
main: threads=79, Total(MiB/s)=344421.0, PerThread=4360
main: sample_no=4 avg=134.9(ns)
main: threads=79, Total(MiB/s)=343273.0, PerThread=4345
main: sample_no=5 avg=134.5(ns)
main: threads=79, Total(MiB/s)=345518.0, PerThread=4374
main: sample_no=6 avg=134.5(ns)
main: threads=79, Total(MiB/s)=346052.0, PerThread=4380
main: sample_no=7 avg=134.5(ns)
main: threads=79, Total(MiB/s)=342852.0, PerThread=4340
main: sample_no=8 avg=134.7(ns)
main: threads=79, Total(MiB/s)=345818.0, PerThread=4377
main: sample_no=9 avg=134.2(ns)
main: threads=79, Total(MiB/s)=344045.0, PerThread=4355
main: sample_no=10 avg=134.7(ns)
main: threads=79, Total(MiB/s)=344345.0, PerThread=4359
ChasAVG=134.547983, ChasGEO=134.547841, ChasBEST=134.210254, ChasWORST=134.863073, ChasDEV=0.005
LdAvgMibs=344690.300000, LdMaxMibs=346052.000000, LdMinMibs=342852.000000, LdDevMibs=0.009
Samples , Byte/thd , ChaseThds , ChaseNS , ChaseMibs , ChDeviate , LoadThds , LdMaxMibs , LdAvgMibs , LdDeviate , ChaseArg , MemLdArg
10 , 1073741824 , 1 , 134.210 , 57 , 0.005 , 79 , 346052 , 344690 , 0.009 , chaseload , stream-sum

Amit Singh Tomar (12):
arm_mpam: Handle resource instances mapped to different controls
arm_mpam: resctrl: Detect priority partitioning capability
arm_mpam: resctrl: Define new schemata format for priority partition
fs/resctrl: Obtain CPBM upon priority partition presence
fs/resctrl: Set-up downstream priority partition resources
fs/resctrl: Extend schemata read for priority partition control
arm_mpam: resctrl: Retrieve priority values from arch code
fs/resctrl: Schemata write only for intended resource
fs/resctrl: Extend schemata write for priority partition control
arm_mpam: resctrl: Facilitate writing downstream priority value
arm_mpam: Fix Downstream priority mask
arm_mpam: Program Downstream priority value

drivers/platform/mpam/mpam_devices.c | 38 +++++++--
drivers/platform/mpam/mpam_internal.h | 1 +
drivers/platform/mpam/mpam_resctrl.c | 64 +++++++++++---
fs/resctrl/ctrlmondata.c | 118 ++++++++++++++++++++++++--
fs/resctrl/rdtgroup.c | 30 +++++++
include/linux/resctrl.h | 12 +++
6 files changed, 235 insertions(+), 28 deletions(-)

--
2.25.1