RE: [PATCH v2 5/7] EDAC/ghes: Prevent chipset-specific edac from loading after ghes_edac is registered

From: Elliott, Robert (Servers)
Date: Fri Aug 19 2022 - 14:58:17 EST




> -----Original Message-----
> From: Kani, Toshi <toshi.kani@xxxxxxx>
> Sent: Friday, August 19, 2022 12:49 PM
> To: Justin He <Justin.He@xxxxxxx>
> Cc: linux-acpi@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; linux-
> edac@xxxxxxxxxxxxxxx; devel@xxxxxxxxxx; Rafael J . Wysocki
> <rafael@xxxxxxxxxx>; Shuai Xue <xueshuai@xxxxxxxxxxxxxxxxx>; Jarkko
> Sakkinen <jarkko@xxxxxxxxxx>; linux-efi@xxxxxxxxxxxxxxx; nd
> <nd@xxxxxxx>; Ard Biesheuvel <ardb@xxxxxxxxxx>; Len Brown
> <lenb@xxxxxxxxxx>; James Morse <James.Morse@xxxxxxx>; Tony Luck
> <tony.luck@xxxxxxxxx>; Borislav Petkov <bp@xxxxxxxxx>; Mauro Carvalho
> Chehab <mchehab@xxxxxxxxxx>; Robert Richter <rric@xxxxxxxxxx>; Robert
> Moore <robert.moore@xxxxxxxxx>; Qiuxu Zhuo <qiuxu.zhuo@xxxxxxxxx>;
> Yazen Ghannam <yazen.ghannam@xxxxxxx>
> Subject: RE: [PATCH v2 5/7] EDAC/ghes: Prevent chipset-specific edac
> from loading after ghes_edac is registered
>
> On Friday, August 19, 2022 4:35 AM, Justin He wrote:
> > > > @@ -1382,6 +1395,7 @@ static int ghes_probe(struct
> platform_device
> > > > *ghes_dev)
> > > > platform_set_drvdata(ghes_dev, ghes);
> > > >
> > > > ghes->dev = &ghes_dev->dev;
> > > > + set_ghes_devs_registered(false);
> > >
> > > This does not look right to me.
> > >
> > > The condition of using ghes_edac is (ghes-present) && (ghes-
> preferred),
> > > where:
> > > - ghes-present is latched on ghes_probe()
> > > - ghes-preferred is true if the platform-check passes.
> > >
> > > ghes_get_device() introduced in the previous patch works as the
> > > ghes-preferred check.
> > >
> > > We cannot use ghes_edac registration as the ghes-present check in
> this
> > > scheme since it is deferred to module_init().
> >
> > What is the logic for ghes-present check? In this patch, I assumed it
> is equal to
> > "ghes_edac devices have been registered". Seems it is not correct.
>
> Using (ghes_edac-registered) is a wrong check in this scheme
> since ghes_edac registration is deferred. This check is fine in
> the current scheme since ghes_edac gets registered before
> any other chipset-specific edac drivers.
>
> > But should we consider the case as follows:
> > What if sbridge_init () is invoked before ghes_edac_init()? i.e.
> Should we get
> > sb_edac driver selected when ghes_edac is not loaded yet (e.g. on
> HPE)?
>
> No. The point is that ghes_edac driver needs to be selected,
> "regardless of the module ordering", on a system with GHES
> present & preferred.
>
> Note that this new scheme leads to the following condition
> change:
> - Current: (ghes-present) && (ghes-preferred) && (ghes_edac registered)
> - New: (ghes-present) && (ghes-preferred)
>
> The option I suggested previously keeps the current condition,
> but this new scheme does not for better modularity.
>
> What this means is that if ghes_edac is not enabled (but ghes
> is enabled) on a system with GHES present & preferred, no edac
> driver gets registered. This change is fine from my (HPE) perspective
> and should be fine for other GHES systems. GHES systems have
> chipset-specific edac driver in FW. OS-based chipset-specific edac
> driver is not necessary and may lead to a conflict of chipset register
> ownership.

Currently, running with this on the kernel command line
ghes.disable

causes the ACPI ghes driver to quit early in acpi_ghes_init():

/*
* This driver isn't really modular, however for the time being,
* continuing to use module_param is the easiest way to remain
* compatible with existing boot arg use cases.
*/
bool ghes_disable;
module_param_named(disable, ghes_disable, bool, 0);

which results in the skx_edac module assuming it should run:
[ 33.628140] calling skx_init+0x0/0xe5a [skx_edac] @ 1444
[ 33.628531] EDAC MC0: Giving out device to module skx_edac controller Skylake Socket#0 IMC#0: DEV 0000:36:0a.0 (INTERRUPT)
[ 33.641432] EDAC MC1: Giving out device to module skx_edac controller Skylake Socket#0 IMC#1: DEV 0000:36:0c.0 (INTERRUPT)
[ 33.653256] EDAC MC2: Giving out device to module skx_edac controller Skylake Socket#1 IMC#0: DEV 0000:ae:0a.0 (INTERRUPT)
[ 33.665055] EDAC MC3: Giving out device to module skx_edac controller Skylake Socket#1 IMC#1: DEV 0000:ae:0c.0 (INTERRUPT)
[ 33.676801] initcall skx_init+0x0/0xe5a [skx_edac] returned 0 after 36343 usecs

We might need to differentiate between the system ROM really not
offering GHES vs. the ghes module not running.