Re: [PATCH v6 5/5] EDAC/amd64: Enumerate memory on Aldebaran GPU nodes

From: Chatradhi, Naveen Krishna
Date: Mon Nov 15 2021 - 10:25:33 EST


Hi Boris

On 11/11/2021 6:42 PM, Borislav Petkov wrote:
[CAUTION: External Email]

On Thu, Oct 28, 2021 at 06:31:06PM +0530, Naveen Krishna Chatradhi wrote:
On newer heterogeneous systems with AMD CPUs, the data fabrics of the GPUs
are connected directly via custom links.

One such system, where Aldebaran GPU nodes are connected to the
Family 19h, model 30h family of CPU nodes, the Aldebaran GPUs can report
memory errors via SMCA banks.

Aldebaran GPU support was added to DRM framework
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Farchives%2Famd-gfx%2F2021-February%2F059694.html&data=04%7C01%7CNaveenKrishna.Chatradhi%40amd.com%7Ca6aa66df219641c880d008d9a514e5bf%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637722331481685552%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Pnx%2FQxgmXS2MfWzu8lVEs22As3yJSWoAsjOvp%2FFOJYw%3D&reserved=0

The GPU nodes comes with HBM2 memory in-built, ECC support is enabled by
default and the UMCs on GPU node are different from the UMCs on CPU nodes.

GPU specific ops routines are defined to extend the amd64_edac
module to enumerate HBM memory leveraging the existing edac and the
amd64 specific data structures.

Note: The UMC Phys on GPU nodes are enumerated as csrows and the UMC
channels connected to HBM banks are enumerated as ranks.
For all your future commit messages, from
Documentation/process/submitting-patches.rst:

"Describe your changes in imperative mood, e.g. "make xyzzy do frotz"
instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy
to do frotz", as if you are giving orders to the codebase to change
its behaviour."

Also, do not talk about what your patch does - that should hopefully be
visible in the diff itself. Rather, talk about *why* you're doing what
you're doing.
Sure.

Cc: Yazen Ghannam <yazen.ghannam@xxxxxxx>
Co-developed-by: Muralidhara M K <muralimk@xxxxxxx>
Signed-off-by: Muralidhara M K <muralimk@xxxxxxx>
Signed-off-by: Naveen Krishna Chatradhi <nchatrad@xxxxxxx>
Link: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2F20210823185437.94417-4-nchatrad%40amd.com&amp;data=04%7C01%7CNaveenKrishna.Chatradhi%40amd.com%7Ca6aa66df219641c880d008d9a514e5bf%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637722331481685552%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=aD3jFWS4RrPrIRF5mFoggBs%2BYqUseU3adJFqlrbD2D8%3D&amp;reserved=0
---
Changes since v5:
Removed else condition in per_family_init for 19h family

Changes since v4:
Split "f17_addr_mask_to_cs_size" instead as did in 3rd patch earlier

Changes since v3:
1. Bifurcated the GPU code from v2

Changes since v2:
1. Restored line deletions and handled minor comments
2. Modified commit message and some of the function comments
3. variable df_inst_id is introduced instead of umc_num

Changes since v1:
1. Modifed the commit message
2. Change the edac_cap
3. kept sizes of both cpu and noncpu together
4. return success if the !F3 condition true and remove unnecessary validation

drivers/edac/amd64_edac.c | 298 +++++++++++++++++++++++++++++++++-----
drivers/edac/amd64_edac.h | 27 ++++
2 files changed, 292 insertions(+), 33 deletions(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 7953ffe9d547..b404fa5b03ce 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -1121,6 +1121,20 @@ static void debug_display_dimm_sizes_df(struct amd64_pvt *pvt, u8 ctrl)
}
}

+static void debug_display_dimm_sizes_gpu(struct amd64_pvt *pvt, u8 ctrl)
+{
+ int size, cs = 0, cs_mode;
+
+ edac_printk(KERN_DEBUG, EDAC_MC, "UMC%d chip selects:\n", ctrl);
+
+ cs_mode = CS_EVEN_PRIMARY | CS_ODD_PRIMARY;
+
+ for_each_chip_select(cs, ctrl, pvt) {
+ size = pvt->ops->dbam_to_cs(pvt, ctrl, cs_mode, cs);
+ amd64_info(EDAC_MC ": %d: %5dMB\n", cs, size);
+ }
+}
+
static void __dump_misc_regs_df(struct amd64_pvt *pvt)
{
struct amd64_umc *umc;
@@ -1165,6 +1179,27 @@ static void __dump_misc_regs_df(struct amd64_pvt *pvt)
pvt->dhar, dhar_base(pvt));
}

+static void __dump_misc_regs_gpu(struct amd64_pvt *pvt)
The function pointer this gets assigned to is called *get*_misc_regs.
But this function is is doing *dump*.

When I see __dump_misc_regs_gpu() I'd expect this function to be called
by a higher level dump_misc_regs() as the "__" denotes layering usually.

There is, in fact, dump_misc_regs() but that one calls
->get_misc_regs().

So this all needs fixing - right now I see a mess.

Also, there's <function_name>_gpu and gpu_<function_name> with the "gpu"
being either prefix or suffix. You need to fix the current ones to be
only prefixes - in a pre-patch - and then add yours as prefixes only too.
Will modify the names to use prefixes

And in talking about pre-patches - this one is doing a bunch of things
at once and needs splitting. You do the preparatory work like carving
out common functionality and other refactoring in pre-patches, and then
you add the new functionality ontop.
Sure, i will split this.

Also, I don't like ->is_gpu one bit, it being sprinkled like that around
the code. This says that the per-family attrs and ops assignment is
lacking.
Yes, adding few more ops routines should help get rid of this if conditions.


@@ -2982,7 +3132,17 @@ static void decode_umc_error(int node_id, struct mce *m)

pvt->ops->get_umc_err_info(m, &err);

- if (umc_normaddr_to_sysaddr(m->addr, pvt->mc_node_id, err.channel, &sys_addr)) {
+ /*
+ * GPU node has #phys[X] which has #channels[Y] each.
+ * On GPUs, df_inst_id = [X] * num_ch_per_phy + [Y].
+ * On CPUs, "Channel"="UMC Number"="DF Instance ID".
+ */
This comment doesn't look like it is destined for human beings to read
but maybe to be parsed by programs?

The present system supports the following setup

A GPU node includes four Unified Memory Controllers (UMC). Each UMC contains eight UMCCH instances.

Each UMCCH controls one 128-bit HBM2e (2GB) channel. The UMC interfaces to the system via the Data Fabric (DF) Coherent Slave.


For a generalized function, i will modify the comment as

"A GPU node has 'X' number of UMC PHYs and 'Y' number of UMCCH instances each. This results in 'X*Y' number of

instances in the Data Fabric. Therefore the Data Fabric ID of an instance can be found with the following formula:

df_inst_id = 'X' * number of channels per PHY + 'Y'


+ if (pvt->is_gpu)
+ df_inst_id = (err.csrow * pvt->channel_count / mci->nr_csrows) + err.channel;
I'm not sure we want to log ECCs from the GPUs: what is going to happen
to them, how does the further error recovery look like?

AMD GPU has support for memory error reporting via the SMCA register banks, functionality is similar to the CPU dies.

The customer/end users of this product want to count memory errors and they want to do this in the OS using similar

mechanisms as they are used to on the CPU side.


Also, EDAC sysfs structure currently has only DIMMs, below's from my
box.

I don't see how that structure fits the GPUs so here's the $10^6
question: why does EDAC need to know about GPUs?

The errors are not specific to GPUs, the errors are originating from HBM2e memory chips on the GPU.

As a first step, I'm trying to leverage the existing EDAC interfaces to report the HBM errors.

Page retirement and storing the bad pages info on a persistent storage can be the next steps.


What is the strategy here?

Your 0th message talks about the "what" but what gets added is not
important - the "why" is.
Sure, i will update with the why.

$ grep -r . /sys/devices/system/edac/mc/ 2>/dev/null
/sys/devices/system/edac/mc/power/runtime_active_time:0
/sys/devices/system/edac/mc/power/runtime_status:unsupported
/sys/devices/system/edac/mc/power/runtime_suspended_time:0
/sys/devices/system/edac/mc/power/control:auto
/sys/devices/system/edac/mc/mc0/ce_count:0
/sys/devices/system/edac/mc/mc0/rank7/dimm_ue_count:0
/sys/devices/system/edac/mc/mc0/rank7/dimm_mem_type:Unbuffered-DDR4
/sys/devices/system/edac/mc/mc0/rank7/power/runtime_active_time:0
/sys/devices/system/edac/mc/mc0/rank7/power/runtime_status:unsupported
/sys/devices/system/edac/mc/mc0/rank7/power/runtime_suspended_time:0
/sys/devices/system/edac/mc/mc0/rank7/power/control:auto
/sys/devices/system/edac/mc/mc0/rank7/dimm_dev_type:Unknown
/sys/devices/system/edac/mc/mc0/rank7/size:8192
/sys/devices/system/edac/mc/mc0/rank7/dimm_ce_count:0
/sys/devices/system/edac/mc/mc0/rank7/dimm_label:mc#0csrow#3channel#1
/sys/devices/system/edac/mc/mc0/rank7/dimm_location:csrow 3 channel 1
/sys/devices/system/edac/mc/mc0/rank7/dimm_edac_mode:SECDED
/sys/devices/system/edac/mc/mc0/topmem:0x00000000e0000000
/sys/devices/system/edac/mc/mc0/mc_name:F17h
/sys/devices/system/edac/mc/mc0/rank5/dimm_ue_count:0
/sys/devices/system/edac/mc/mc0/rank5/dimm_mem_type:Unbuffered-DDR4
/sys/devices/system/edac/mc/mc0/rank5/power/runtime_active_time:0
/sys/devices/system/edac/mc/mc0/rank5/power/runtime_status:unsupported
/sys/devices/system/edac/mc/mc0/rank5/power/runtime_suspended_time:0
/sys/devices/system/edac/mc/mc0/rank5/power/control:auto
/sys/devices/system/edac/mc/mc0/rank5/dimm_dev_type:Unknown
/sys/devices/system/edac/mc/mc0/rank5/size:8192
/sys/devices/system/edac/mc/mc0/rank5/dimm_ce_count:0
/sys/devices/system/edac/mc/mc0/rank5/dimm_label:mc#0csrow#2channel#1
/sys/devices/system/edac/mc/mc0/rank5/dimm_location:csrow 2 channel 1
/sys/devices/system/edac/mc/mc0/rank5/dimm_edac_mode:SECDED
/sys/devices/system/edac/mc/mc0/dram_hole:0 0 0
/sys/devices/system/edac/mc/mc0/ce_noinfo_count:0
...

--
Regards/Gruss,
Boris.

Regards,

Naveen


https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpeople.kernel.org%2Ftglx%2Fnotes-about-netiquette&amp;data=04%7C01%7CNaveenKrishna.Chatradhi%40amd.com%7Ca6aa66df219641c880d008d9a514e5bf%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637722331481685552%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=fw%2B%2F8rCVRhhZ0xp8XLUW5rvoDUnfNQzwJleHVoPmDkw%3D&amp;reserved=0