Re: [RFC PATCH 1/2] x86/mce: Handle AMD threshold interrupt storms

From: Luck, Tony
Date: Thu Feb 17 2022 - 12:28:15 EST


On Thu, Feb 17, 2022 at 08:16:08AM -0600, Smita Koralahalli wrote:
> Extend the logic of handling CMCI storms to AMD threshold interrupts.
>
> Similar to CMCI storm handling, keep track of the rate at which each
> processor sees interrupts. If it exceeds threshold, disable interrupts
> and switch to polling of machine check banks.

I've been sitting on some partially done patches to re-work
storm handling for Intel ... which rips out all the existing
storm bits and replaces with something all new. I'll post the
2-part series as replies to this.

Two-part motivation:

1) Disabling CMCI globally is an overly big hammer (as you note
in your patches which to a more gentle per-CPU disable.

2) Intel signals some UNCORRECTED errors using CMCI (yes, turns
out that was a poorly chosen name given the later evolution of
the architecture). Since we don't want to miss those, the proposed
storm code just bumps the threshold to (almost) maximum to mitigate,
but not eliminate the storm. Note that the threshold only applies
to corrected errors.

-Tony