Re: [PATCH] x86/MCE/AMD: Decrement threshold_bank refcount when removing threshold blocks

From: Yazen Ghannam
Date: Tue Nov 01 2022 - 22:36:29 EST


On Wed, Oct 26, 2022 at 10:12:15PM +0200, Borislav Petkov wrote:
> On Wed, Oct 26, 2022 at 07:44:17PM +0000, Yazen Ghannam wrote:
> > 1) Apply the patch I submitted as a simple fix/workaround for the presented
> > symptom. I tried to keep it small and well described to be a stable backport.
> > Obviously I wrote it without knowing the shared kobject behavior isn't ideal.
>
> We'll see.
>
> > 2) Address the shared kobject thing.
> > Here are some options:
> > a. Only set up the thresholding kobject on a single CPU per "AMD Node".
> > Technically MCA Bank 4 is "shared" on legacy systems. But AFAICT from
> > looking at old BKDG docs, in practice only the "Node Base Core" can access
> > the registers. This behavior is controlled by a bit in NB which BIOS is
> > supposed to set. Maybe some BIOSes don't do this, but I think that's a
> > "broken BIOS on legacy system" issue if so.
>
> I guess we can do that. And I even think we have some code which finds
> out which the NBC is...
>
> /me greps a bit:
>
> ah, there it is: get_nbc_for_node() in arch/x86/kernel/cpu/mce/inject.c.
>
>
> > b. Disable the MCA Thresholding interface for Families before 0x17.
>
> Can't. It is user-visible and you don't know for sure whether someone is
> using it or not.
>
> Believe me, I have been wanting to disable this thing forever. I've
> never heard of anyone using it and all the energy we put in it was for
> nothing. :-\
>
> We could try to deprecate it, though, make it default=n in Kconfig and
> see who complains. And after a couple of releases, kill it.
>
> > This is an undocumented interface,
>
> Of course it is documented - it is in the old BKDGs.
>
> > and I don't know if anyone is using it on older systems.
>
> Yap.
>
> > The issue we're discussing here started because of a splat during
> > suspend/resume/CPU hotplug. In disable_err_thresholding(), we disable
> > MCA Thresholding for bank 4 on Family 15h, so there's some precedent.
> > c. Do nothing at the moment. I *really* want to clean up the MCA
> > Thresholding interface, and the shared kobject thing may get resolved
> > in that.
>
> Clean it up how exactly?
>
> Put it behind a Kconfig item, disable it and remove it after a while?
>
> :-)
>
> If so, I wouldn't mind. No one's using this. At least I haven't heard of
> a single bug report or of a use case. Only when CPU hotplug explodes and
> that thing is involved, only then.
>
> Might as well remove it. And then remove it in the hardware too. RAS
> folks would love to get rid of some of that crap which takes up verif
> resources for no good reason.
>
> :-)
>

Cool beans. I think this'll be a long process, so let me start by removing the
shared bank stuff. Thanks!

-Yazen