RE: [RFC PATCH] EDAC, ghes: Enable per-layer error reporting for ARM

From: wufan
Date: Fri Aug 24 2018 - 10:30:19 EST


Hi James,

> Why get avoid the layer stuff? Isn't counting DIMM/memory-devices what
> EDAC_MC_LAYER_SLOT is for?

Borislav has explained it in his response. Here let me elaborate a little more. To use the layer information you need an accurate way to pinpoint each component in the layer and the parent components in the layers above. For example, to use EDAC_MC_LAYER_SLOT you also need information for the parent layer say EDAC_MC_LAYER_CHANNEL, or another layer on top say EDAC_MC_LAYER_BRANCH. There are no clear ways to get the information from SMBIOS table. In the case of "memory channel" we looked at type 37 which has the exact spelling but it was introduced to support RamBus and Synclink. Not sure we can readily use it for modern architecture concept of "channel/slot".

I think it is good enough if we can pin each error to the corresponding DIMM. At the end of the day DIMMs are what customer can replace in the memory system and that's all that they care about. For the manufacturers of the board/chips they have the knowledge to map the specific DIMMs to the upper layer components, so they can easily collect error counter data for upper layers.

> CPER's "Memory Error Record 2" thinks that "NODE, CARD and MODULE
> should provide the information necessary to identify the failing FRU". As
> EDAC has three 'levels', these are what they should correspond to for ghes-
> edac.
>
> I assume NODE means rack/chassis in some distributed system. Lets ignore it
> as it doesn't seem to map to anything in the SMBIOS table.

How about type 4 "Processor Information"?

> 'Card' doesn't mean much to me, but it maps to SMBIOS:17 "Memory Array
> Structure", which the Memory Device structure also points to.
> Card then must mean "a collection of memory devices (DIMMs) that operate
> together to form an address space".
>
> This might be what I think of as a memory-controller, or it might be
> something more complicated. Regardless, the CPER records think its relevant.

Originally I thought "Card" were memory channel. But looking at the definition of "Card Handle" in CPER: "... this field contains the SMBIOS handle for the Type 16 Memory Array Structure that represents the memory card". So Card is memory controller or something similar to that. Right now ghes-edac assumes one mc. We probably need to map mc(s) to the type 16 instances in SMBIOS table.

Thanks,
Fan