Re: [PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation

From: Tony Luck
Date: Tue Nov 14 2023 - 17:04:52 EST


On Tue, Nov 14, 2023 at 08:23:24PM +0100, Borislav Petkov wrote:
> On Mon, Oct 23, 2023 at 11:14:04AM -0700, Tony Luck wrote:
> > I want to track whether each bank is in storm mode, or not. But there's
> > no indication when a CMCI is delivered which bank was the source. Code
> > just has to scan all the banks, and might find more than one with an
> > error. While no bank is in polling mode, there isn't a set time interval
> > between scanning banks.
>
> Well, if no bank is in *storm* polling mode - I presume you mean that
> when you say "polling mode" - then we have the default polling interval
> of machine_check_poll() of INITIAL_CHECK_INTERVAL, i.e., 5 mins, I'd
> say.

Initial state is that no banks are in storm mode. Errors logged in any
bank are signaled with a CMCI to all logical CPUs that share that bank.
But the "mce_banks_owned" per-cpu bitmask means that only one of those
CPUs will get past the test_bit() check at the top of the loop in
machine_check_poll():

for (i = 0; i < this_cpu_read(mce_num_banks); i++) {
if (!mce_banks[i].ctl || !test_bit(i, *b))
continue;

Before any storm happens, machine_check_poll() may only be called once
a month, or less, when errors occur.

When a storm is detected for a bank, that bank (and any others in storm
mode) will be checked once per second.

> > A scan is just triggered when a CMCI happens. So it's non-trivial to
> > compute a rate. Would require saving a timestamp for every logged
> > error.
>
> So what I'm trying to establish first is, what our entry vectors into
> the storm code are.
>
> 1. You can enter storm tracking when you poll normally. I.e.,
> machine_check_poll() each 5 mins once.

For a bank that doesn't support CMCI, then polling is the only way
to find errors. You are right, these will feed into the history
tracker, but while at 5-minute interval will not be able to trigger
a storm. Since that 5-minute interval is halved every time error is
found consecutively, it is possible at the 1-second poll interval to
fill out enough bits to indicate a storm. I think I need to add some
code to handle that case as it makes no sense to mess with the CMCI
threshold in IA32_MCi_CTL2 for a bank that doesn't support CMCI.
Probably will just skip tracking for any such banks.

> mce_track_storm() tracks history only for MCI_STATUS_VAL=1b CEs, which
> is what I was wondering. It is hidden a bit down in that function.
>
> 2. As a result of a CMCI interrupt. That will call machine_check_poll()
> too and go down the same path.
>
> So that flow should be called out in the commit message so that the big
> picture is clear - how we're doing that storm tracking.

Yes. Those are both inputs to the storm tracker. I'll update the commit
message to describe that.

> Now, what is protecting this against concurrent runs of
> machine_check_poll()? Imagine the timer fires and a CMCI happens at the
> same time and on the same core.

Aren't interrupts disabled while running the code after the timer fires?
Whichever of the timer and the CMCI happens first will run. Second to
arrive will pend the interrupt and be handled when interrupts are
enabled as the first completes.

... rest of message trimmed, no open discussion items there.

-Tony