Re: [PATCH 1/2] platform/x86/amd: Introduce AMD Address Translation Library

From: Yazen Ghannam
Date: Tue Aug 08 2023 - 15:08:27 EST


On 8/8/2023 11:58 AM, Borislav Petkov wrote:
On Tue, Aug 08, 2023 at 11:18:07AM -0400, Yazen Ghannam wrote:
I think it would be better to avoid dependencies between independent things.

If they really are independent then I guess. Not that it all ends up in
a twisty dependency where you wish you should've merged the two
together. So think about all deps before you design this - it needs to
handle all cases without hackery.


I agree. My goal with this is to avoid the hackery. :)

I guess we'll see how it goes...

For example, amd_smn_read() is mostly used in amd64_edac. EDAC was the
original user of SMN accesses, and all the SMN stuff could have been
included in EDAC. However, SMN is not specifically for EDAC, so it was added
to amd_nb.c to be commonly available. Currently, SMN accesses are done in
other modules. I don't think it would have been a good idea to force other
modules or subsystems to require EDAC to be used.

What does that have to do with this? SMN access is generic and should be
in amd_nb.c as it is needed by other stuff. EDAC, RAS, whatever are all
users of that thing.


Right, but that's my point. The translation is "generic" and tied to the Data Fabric which is the central "thing" in an modern AMD SoC. Anything that needs memory address translation for the fabric will use it.

This is my reasoning for a separate, independent module for the translation.
EDAC is the first user of this. But there will be future code that can
leverage this, like CXL, and even the MCE subsystem. And, yes, mce_amd may
be already loaded, but this isn't a given. A person may want MCE and CXL
support without wanting to use EDAC.

Is that a real use case or just a hypothetical thing?


Real. There are actually two use cases for CXL. First is memory error reporting and page offline which is analogous to MCA. Second is general memory online/offline support which will be used for hot-plug/hot-swap cases.

In actuality, both CXL cases need the same functionality. Take an AMD Data Fabric "normalized" address and translate it to a system physical address for the OS to take action.

Furthermore, some things using the translation will be built-in, so the
translation module will need to be built-in.

This sounds weird.


Yes, and this is my intention for using the "imply" Kconfig thing you see in the second patch. Any config options that need this code will "imply" it. And the default option for this code will take the strictest setting of all the options that "imply" it.

I agree. And I don't think much of the existing things in EDAC should be
moved out. But this is new code, so there's an opportunity to have it in a
more appropriate place.

And, thinking on it more, this could be another example for future "common
RAS" functionality. Isn't that why the CEC is in drivers/ras?

It is there because it doesn't need EDAC at all. If your translation
doesn't need EDAC and EDAC is going to be only a user of it, then good.


Yep, that's the intent.

But if you're going to have to need the MCA error decoded by EDAC and
then the error translation done by this thing, then you'd need to
synchronize between the two. I'm not saying it is impossible - it should
be well thought out first though before you go coding.


Yes, I agree. That's the reason for this code to take a raw struct mce, and do the translation independent of EDAC, etc. (for the MCA case).

The same will be the goal for CXL. This code will take a raw struct pci_dev, or whatever, and do the translation independent of other CXL code.

It seems like things go into EDAC because it's thought of as the de
facto RAS location. But why have something in EDAC if it doesn't
provide EDAC functionality? Other RAS things, like AER, APEI, etc.,
don't live in EDAC.

AER is part of PCI so we haven't considereed tying it into EDAC. And
there wasn't any desire to do so.

As to APEI, there's ghes_edac...


True. Though the intent for ghes_edac is for it to provide the EDAC functionality.

The other features that don't need EDAC functionality aren't in EDAC.


Going back to the dependency concerns... there are a lot of inter-dependencies between RAS code in various subsystems. Maybe some of these can be streamlined by moving common things to drivers/ras? But I guess this can be a discussion for a later time...

Thanks,
Yazen