[PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

From: Naik, Avadhut
Date: Thu Feb 15 2024 - 15:15:22 EST


Hi,

On 2/12/2024 02:58, Borislav Petkov wrote:
> On Sun, Feb 11, 2024 at 08:54:29PM -0600, Naik, Avadhut wrote:
>> Okay. Will make changes to allocate memory and set size of the pool
>> when it is created. Also, will remove the command line parameter and
>> resubmit.
>
> Before you do, go read that original thread again but this time take
> your time to grok it.
>
> And then try answering those questions:
>
> * Why are *you* fixing this? I know what the AWS reason is, what is
> yours?
>
I think this issue of genpool getting full with MCE records can occur
on AMD system too since the pool doesn't scale up with the number of
CPUs and memory in the system. The probability of issue occurrence
only increases as CPU count and memory increases. Feel that the genpool
size should be proportional to, at least, the CPU count of the system.

> * Can you think of a slick deduplication scheme instead of blindly
> raising the buffer size?
>
> * What's wrong with not logging some early errors, can we live with that
> too? If it were firmware-first, it cannot simply extend its buffer size
> because it has limited space. So what does firmware do in such cases?
>
Think that we can live with not logging some early errors, as long as they
are correctable.
Not very sure about what you mean by Firmware First. Do you mean handling
of MCEs through HEST and GHES? Or something else?

> Think long and hard about the big picture, analyze the problem properly
> and from all angles before you go and do patches.
>
> Thx.
>

--
Thanks,
Avadhut Naik