Re: [PATCH v7 2/3] x86/mce: Add per-bank CMCI storm mitigation

From: Yazen Ghannam
Date: Wed Sep 20 2023 - 11:56:34 EST


On 7/18/23 5:08 PM, Tony Luck wrote:
> This is the core functionality to track CMCI storms at the
> machine check bank granularity. Subsequent patches will add
> the vendor specific hooks to supply input to the storm
> detection and take actions on the start/end of a storm.
>
> Maintain a bitmap history for each bank showing whether the bank
> logged an corrected error or not each time it is polled.
>
> In normal operation the interval between polls of this banks
> determines how far to shift the history. The 64 bit width corresponds
> to about one second.
>
> When a storm is observed a CPU vendor specific action is taken to reduce
> or stop CMCI from the bank that is the source of the storm. The bank
> is added to the bitmap of banks for this CPU to poll. The polling rate
> is increased to once per second. During a storm each bit in the history
> indicates the status of the bank each time it is polled. Thus the history
> covers just over a minute.
>
> Declare a storm for that bank if the number of corrected interrupts
> seen in that history is above some threshold (defined as 5 in this
> series, could be tuned later if there is data to suggest a better
> value).
>
> A storm on a bank ends if enough consecutive polls of the bank show
> no corrected errors (defined as 30, may also change). That calls the
> CPU vendor specific function to revert to normal operational mode,
> and changes the polling rate back to the default.
>
> Signed-off-by: Tony Luck <tony.luck@xxxxxxxxx>
> ---
> arch/x86/kernel/cpu/mce/internal.h | 41 ++++++++++-
> arch/x86/kernel/cpu/mce/core.c | 108 ++++++++++++++++++++++++++---
> 2 files changed, 140 insertions(+), 9 deletions(-)
>

I was just thinking, could we put all this code in threshold.c? That is
the place for common thresholding support. And the CMCI storm handling
seems like it'd be part of that.

Thanks,
Yazen