RE: [PATCH v2 4/4] x86/mce: Add AMD SMCA support to SRAO notifier

From: Ghannam, Yazen
Date: Wed Mar 22 2017 - 17:41:09 EST


> -----Original Message-----
> From: Borislav Petkov [mailto:bp@xxxxxxxxx]
> Sent: Wednesday, March 22, 2017 5:13 PM
> To: Ghannam, Yazen <Yazen.Ghannam@xxxxxxx>
> Cc: linux-edac@xxxxxxxxxxxxxxx; Tony Luck <tony.luck@xxxxxxxxx>;
> x86@xxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx
> Subject: Re: [PATCH v2 4/4] x86/mce: Add AMD SMCA support to SRAO
> notifier
>
> On Mon, Mar 20, 2017 at 03:26:54PM -0500, Yazen Ghannam wrote:
> > From: Yazen Ghannam <yazen.ghannam@xxxxxxx>
> >
> > Deferred errors on AMD systems may get an Action Optional severity
> > with the goal of being handled by the SRAO notifier block. However,
> > the process of determining if an address is usable is different
> > between Intel and AMD. So define vendor-specific functions for this.
> >
> > Also, check for the AO severity before determining if an address is
> > usable to possibly save some cycles.
> >
> > Signed-off-by: Yazen Ghannam <yazen.ghannam@xxxxxxx>
> > ---
> > Link:
> > http://lkml.kernel.org/r/1486760120-60944-3-git-send-email-Yazen.Ghann
> > am@xxxxxxx
> >
> > v1->v2:
> > - New in v2. Based on v1 patch 3.
> > - Update SRAO notifier block to handle errors from SMCA systems.
> >
> > arch/x86/kernel/cpu/mcheck/mce.c | 52
> > ++++++++++++++++++++++++++++++----------
> > 1 file changed, 40 insertions(+), 12 deletions(-)
> >
> > diff --git a/arch/x86/kernel/cpu/mcheck/mce.c
> > b/arch/x86/kernel/cpu/mcheck/mce.c
> > index 5e365a2..1a2669d 100644
> > --- a/arch/x86/kernel/cpu/mcheck/mce.c
> > +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> > @@ -547,19 +547,49 @@ static void mce_report_event(struct pt_regs
> *regs)
> > * be somewhat complicated (e.g. segment offset would require an
> instruction
> > * parser). So only support physical addresses up to page granuality for
> now.
> > */
> > -static int mce_usable_address(struct mce *m)
> > +static int mce_usable_address_intel(struct mce *m, unsigned long
> > +*pfn)
>
> So this function is basically saying whether the address is usable but then it is
> also returning it if so.
>
> And then it is using an I/O argument. Yuck.
>
> So it sounds to me like this functionality needs redesign: something like have a
> get_usable_address() function (the "mce_" prefix is not really needed as it is
> static) which returns an invalid value when it determines that it doesn't have a
> usable address and the address itself if it succeeds.
>

Okay, I'll redo it.

> > {
> > - if (!(m->status & MCI_STATUS_MISCV) || !(m->status &
> MCI_STATUS_ADDRV))
> > + if (!(m->status & MCI_STATUS_MISCV))
> > return 0;
> > -
> > - /* Checks after this one are Intel-specific: */
> > - if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
> > - return 1;
> > -
> > if (MCI_MISC_ADDR_LSB(m->misc) > PAGE_SHIFT)
> > return 0;
> > if (MCI_MISC_ADDR_MODE(m->misc) != MCI_MISC_ADDR_PHYS)
> > return 0;
> > +
> > + *pfn = m->addr >> PAGE_SHIFT;
> > + return 1;
> > +}
> > +
> > +/* Only support this on SMCA systems and errors logged from a UMC. */
> > +static int mce_usable_address_amd(struct mce *m, unsigned long *pfn)
> > +{
> > + u8 umc;
> > + u16 nid = cpu_to_node(m->extcpu);
> > + u64 addr;
> > +
> > + if (!mce_flags.smca)
> > + return 0;
>
> So on !SMCA systems there'll be no usable address ever! Even with
> MCI_STATUS_ADDRV set.
>
> Please *test* your stuff on all affected hardware before submitting.
>

I was thinking of this for use with the SRAO notifier. But since this can be
used in other places then the SMCA check should be grouped with the
code below.

> > +
> > + umc = find_umc_channel(m);
> > +
> > + if (umc < 0 || umc_normaddr_to_sysaddr(m->addr, nid, umc, &addr))
> > + return 0;
> > +
> > + *pfn = addr >> PAGE_SHIFT;
> > + return 1;
> > +}
> > +
> > +static int mce_usable_address(struct mce *m, unsigned long *pfn) {
> > + if (!(m->status & MCI_STATUS_ADDRV))
> > + return 0;
>
> What happened to the MCI_STATUS_MISCV bit check?
>

It was moved into the Intel function. It's not necessary on AMD.

> > + if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
> > + return mce_usable_address_intel(m, pfn);
> > +
> > + if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD)
> > + return mce_usable_address_amd(m, pfn);
> > +
> > return 1;
>
> We definitely don't want to say that the address is usable on a third vendor. It
> would be most likely a lie even if we never reach this code on a third vendor.
>

Okay.

Thanks,
Yazen