Re: Initial testing of MPAM patches

From: James Morse
Date: Fri Aug 18 2023 - 11:52:05 EST


Hi Carl,

(CC: +Akanksha)

On 18/08/2023 15:13, Carl Worth wrote:
> I'm just getting the chance to start testing your MPAM patches on an
> Ampere implementation. Specifically, I was working with your
> mpam/snapshot/v6.3 branch [1], an earlier version of which you posted to
> the list [2].
>
> I have a few initial comments/questions. These are mostly from a
> black-box point of view, (without yet having looked at the code
> much). Later on, when I do dig into the code, I can follow up on some
> of these issues within the context of the relevant patches.


> 1. Is there a way to query the MPAM PARTID for a particular resctrl group?

Deliberately: no.

This is private to the kernel, the way these get allocated may change in the future, so
user-space must not depend on it.

AMD want to add debug hooks for this - in a way that can't work properly on MPAM as there
is no value that directly corresponds with RMID.


> I see a file named "id" in the directory for each resource group,
> but I get "Operation not supported" if I try to cat it.

Try mounting with '-o this_is_not_abi".
That file is for passing an identifier from resctrl to resctrl_pmu so that the kernel can
query the control-group properties internally.


> Meanwhile,
> in the code it looks like PARTIDs are being XORed with some random
> number. Is there a deliberate attempt to obscure the PARTID?

It is. Otherwise some joker will hard-code their user-space script, instead of reading the
id from resctrl - and scream regression if any of the kernel internals that allocate a
partid change.

If you 'know' the default control group has all zeros for its 'id', you can use that to
xor out the set - but you are on your own!


The general theme here is I don't trust user-space not to depend on any value exposed.


> I don't know how much an end user will care about PARTID values,
> (so it's nice that the driver manages these implicitly), but for
> me, while debugging this stuff, it would be nice to be able to
> query them.

This would only matter if you could somehow inspect the hardware - which you probably can.
- but users of deployed systems can't.

Sorry if this isn't the answer you want, but I'm trying to only publish patches to
kernel.org that I intend to upstream in some form.
There are a few one-offs if some debug needs something complicated, but I don't carry them
around as the tree is big enough as it is. Rebasing feedback for the series on the list is
painful enough already!

You can probably just hack out the safety-nets around the 'id' file and use that for your
platform bringup.


> 2. Top-level resctrl resource group is not being made to take effect
>
> When I poke at the schemata of the top-level resource group, I can
> see that it is associated with PARTID 1.

That was a transient attempt to allow interrupts to switch to a reserved PARTIDs to avoid
inheritance. The folk who wanted that behaviour said it doesn't solve their problem, so
I've dropped it in later versions.


> That is, if I do something like:
>
> echo L3:1=face > /sys/fs/resctrl/schemata
>
> I can see the value 0xface showing up in the cache portion bitmap
> registers associated with PARTID 1. So far, so good.
>
> But meanwhile, I am not seeing the MPAM0_EL1 system register being
> modified to associate the various cores in this resource group with
> PARTID 1.

This should have already been done by the call to mpam_resctrl_move_all_users_tasks().


> In contrast, for any lower-level resource group I create, I do see
> MPAM0_EL1 being set with PARTID values for the cores that I put
> into the 'cpus' node.
>
> So it appears that PARTID 1 and its top-level resource group will
> have no effect currently. Am I seeing that correctly?

This sounds like a bug in the patches that shift the 'default' resctrl group. They only
appeared in one version of the tree, its probably best to just rebase, or revert them.

That code may come back if the behaviour of hardware the mpam driver doesn't know how to
program is severely affected by resctrl messing with the configuration of PARTID-0. The
scary case is slow delivery of LPIs because the GIC ITS is being throttled.

My current view is that platforms are likely to be short of PARTID as some folk are also
using them for monitorring, so reserving a PARTID


> 3. Is there special treatment of cpu 0?
>
> When I put cpu 0 into a resource group I see both MPAM2_EL2 as well
> as MPAM0_EL1 for cpu 0 being set to the group's corresponding
> PARTID. But when I set any cpu other than 0, only MPAM0_EL1 is
> being set for that cpu. Is this the desired behavior?

That sounds strange.

The versions of mpam_thread_switch() you have always touches MPAM1_EL1 and MPAM0_EL1,
always together.

How are you identifying its MPAM2_EL2 that is touched? The switch_to() code only uses
MPAM1_EL1, these have different encodings but access the same register if HCR_EL2.E2H is
set. If it really is the MPAM2_EL2 encoding, that suggests the KVM code is being run.

> I know that PARTID 0 is treated as reserved by the code, but is cpu
> 0 given any special treatment?

No - can you reproduce this on the latest branch?


> 4. The current schemata allows for cache portion, but not cache capacity

See KNOWN_ISSUES:
| Only features that match what resctrl already supports are supported.
| This is very deliberate.


> The schemata file allows me to specify a bitmask to control
> cache-portion partitioning. But I don't see any mechanism (nor
> backing code) to control cache capacity partitioning.
>
> Is this due to a limitation in mapping MPAM to the current resctrl
> interface?

It is. Getting feature parity with x86 is the critical path to getting this upstream.
Supporting other bits of MPAM can come next - we'd need a discussion with Intel about how
any changes should be done, so that they can support them too if they ever have similar
features.

This conversation can't happen until we have some support upstream.


> Or would it be easy to extend the schemata to include a
> cache capacity field as well?

Probably - it just needs doing in a way that stands a chance of being portable to other
architectures.


> I see that Amit has proposed patches recentl for extending the
> schemata to add priority partitioning control. So I'd like to do
> something similar for capacity partitioning control as well.

Beware this may be premature, as you can't assume mainline will merge the same
user-visible interface. Having out of tree users won't be a compelling argument.

To quote KNOWN_ISSUES:
| Until this code is merged by mainline, we should rigidly stick to the user-space
| interface as it is supported by Intel RDT. This is the only way to ensure
| user-space code remains portable.
|
| Once the code is merged, adding support for the missing pieces, and how
| this impacts user-space can be discussed.


I haven't got to Amit's patches yet, but I plan to pick things like this into the tree on
kernel.org so its all in one place.


Different ways of configuring an existing resource should be low risk, as where the
domain-id's come from should be the same as the existing schema.


> 5. Linked-list corruption with missing cache entries in PPTT
>
> At one point, I tried booting with the MPAM ACPI table populated
> for my L3 cache, but without the necessary entries in the PPTT ACPI
> table. The driver fell over with linked-list corruption, halting
> Linux boot. I'll follow up this report with more details.

This kind of thing won't have seen much testing. Any details you can share would help!


> 6. No support for memory-side caches
>
> MPAM allows for controlling memory-side caches, and the ACPI
> specification allows them to be described. But the current code
> appears to ignore any such caches, and won't expose them via
> resctrl. I'm very interested to know what work would need to be
> done to allow a memory-side cache to be supported.

This is another case of waiting for the code to be upstream so we can have a discussion
about new schema with other architectures.

I'd argue these should use some identifier that is visible elsewhere in sysfs as the
domain-id. This needs discussing with Intel/AMD as the way they would build a similar
feature may have a bearing on the user-space interface.


> 7. Cache portion configuration for incorrect PARTID is applied
>
> I've seen a case where, when trying to use a PARTID other than 1,
> (that is, a resource group other than the top-level), the
> configuration I've set in that resource group does not take
> effect. Instead, the configuration for PARTID 1 takes effect.
>
> Querying the controlling MPAM registers does not obviously explain
> the buggy behavior. Things look correct. I'll be examining this
> case more closely to see what's happening.

I'd be curious what the monitors report when the control group with PARTID 1 is empty -
that should give you a hint if they are seeing traffic with the wrong label, or something
else is going awry.

There are some one-off cranky debug patches on my latest tree that may help debug this. If
you end up using them I can share the log of what I get as the expected behaviour on the
FVP/FPGA.


> So, that's what I've encountered on an initial look. I didn't call out
> all the things that are obviously working well, but there's a lot
> there. So that hasn't gone unnoticed.
>
> Thanks again for this series, and I'm looking forward to engaging on
> it more going forward.



Thanks,

James
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.