Re: [PATCH 6/7] EDAC/amd64: Add error instance get_err_info() to pvt->ops

From: Yazen Ghannam
Date: Fri Jul 21 2023 - 10:48:09 EST


On 7/20/2023 8:54 AM, Muralidhara M K wrote:
From: Muralidhara M K <muralidhara.mk@xxxxxxx>

On CPUs the data fabric ID of an instance on a CPU is equal to the
UMC number. since the UMC number and channel are equal in CPU nodes,
the channel can be used as the data fabric ID of the instance.

GPU node has 'X' number of PHYs and 'Y' number of channels.
This results in 'X*Y' number of instances in the data fabric.
Therefore the data fabric ID of an instance in GPU as below:
df_inst_id = 'X' * number of channels per PHY + 'Y'

Co-developed-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@xxxxxxx>
Signed-off-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@xxxxxxx>
Signed-off-by: Muralidhara M K <muralidhara.mk@xxxxxxx>
---
drivers/edac/amd64_edac.c | 36 +++++++++++++++++++++++++++++++++++-
drivers/edac/amd64_edac.h | 2 ++
2 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 45d8093c117a..74b2b47cc22a 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -3047,6 +3047,17 @@ static inline void decode_bus_error(int node_id, struct mce *m)
__log_ecc_error(mci, &err, ecc_type);
}
+/*
+ * On CPUs, The data fabric ID of an instance is equal to the UMC number.
+ * and since the UMC number and channel are equal in CPU nodes, the channel can be
+ * used as the data fabric ID of the instance.
+ */
+static int umc_inst_id(struct mem_ctl_info *mci, struct amd64_pvt *pvt,
+ struct err_info *err)
+{
+ return err->channel;
+}
+
/*
* To find the UMC channel represented by this bank we need to match on its
* instance_id. The instance_id of a bank is held in the lower 32 bits of its
@@ -3071,6 +3082,7 @@ static void decode_umc_error(int node_id, struct mce *m)
struct mem_ctl_info *mci;
struct amd64_pvt *pvt;
struct err_info err;
+ u8 df_inst_id;
u64 sys_addr;
node_id = fixup_node_id(node_id, m);
@@ -3101,8 +3113,9 @@ static void decode_umc_error(int node_id, struct mce *m)
}
pvt->ops->get_err_info(m, &err);
+ df_inst_id = pvt->ops->get_inst_id(mci, pvt, &err);
- if (umc_normaddr_to_sysaddr(m->addr, pvt->mc_node_id, err.channel, &sys_addr)) {
+ if (umc_normaddr_to_sysaddr(m->addr, pvt->mc_node_id, df_inst_id, &sys_addr)) {
err.err_code = ERR_NORM_ADDR;
goto log_error;
}

This patch is not useful until the address translation is updated. So lets drop this for now. And these changes can be included as part of the address translation updates.

Thanks,
Yazen