Re: [PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation

From: Borislav Petkov
Date: Thu Dec 14 2023 - 11:59:08 EST


On Mon, Nov 27, 2023 at 04:42:02PM -0800, Tony Luck wrote:
> On Mon, Nov 27, 2023 at 12:14:28PM -0800, Tony Luck wrote:
> > On Mon, Nov 27, 2023 at 11:50:26AM -0800, Tony Luck wrote:
> > > On Tue, Nov 21, 2023 at 12:54:48PM +0100, Borislav Petkov wrote:
> > > > On Tue, Nov 14, 2023 at 02:04:46PM -0800, Tony Luck wrote:
> > > But it isn't doing the same thing. The timer calls:
> > >
> > > machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));
> > >
> > > and cmci_mc_poll_banks() calls:
> > >
> > > machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));
>
> machine_check_poll(0, this_cpu_ptr(&mce_banks_owned));

Hmm, so I applied your v10 and this call with mce_banks_owned is done in
cmci_recheck() only. Which is on some init path.

The thresholding interrupt calls it too.

The timer ends up calling mc_poll_banks_default() which does

machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));

I presume we don't do:

if (!cmci_supported(&banks)) {
mc_poll_banks = cmci_mc_poll_banks;
return;
}

usually on Intel. And even if we did, cmci_mc_poll_banks() calls

machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));

too.

So regardless what machine you have, you do call the mc_poll_banks
pointer which in both cases does

machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));

The *thresholding* interrupt does

machine_check_poll(0, this_cpu_ptr(&mce_banks_owned));

and you're saying

mce_poll_banks and mce_banks_owned

are disjoint.

That's what you mean, right?

Because if so, yes, that makes sense. If the sets of MCA banks polled
and handled in the thresholding interrupt are disjoint, we should be ok.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette