Re: [PATCH] x86/mce: Increase the size of the MCE pool from 2 to 8 pages

From: Yazen Ghannam
Date: Mon Oct 16 2023 - 10:14:37 EST


On 10/12/23 11:49 AM, Dave Hansen wrote:
> On 10/12/23 04:46, Sironi, Filippo wrote:
>> There's correlation across the errors that we're seeing, indeed,
>> we're looking at the same row being responsible for multiple CPUs
>> tripping and running into #MC. I still don't like the full lack of
>> visibility; it's not uncommon in a large fleet to see to take a
>> server out of production, replace a DIMM and shortly after taking it
>> out of production again to replace another DIMM just because some of
>> the errors weren't properly logged.
>
> So you had two nearly simultaneous DIMM failures. The first failed,
> filled up the buffer and then the second failed, but there was no room.
> The second failed *SO* soon after the first that there was no
> opportunity to empty the buffer between.
>
> Right?
>
> How do you know that storing 8 pages of records will catch this case as
> opposed to storing 2?
>
>>> Is there any way that the size of the pool can be more automatically
>>> determined? Is the likelihood of a bunch errors proportional to the
>>> number of CPUs or amount of RAM or some other aspect of the hardware?
>>>
>>> Could the pool be emptied more aggressively so that it does not fill up?
>
> You didn't really address the additional questions I posed there.
>
> I'll add one more: how many of the messages are duplicates or
> *effectively* duplicates? Or is that hard to determine at the time that
> the entries are being made that they are duplicates?
>
> It _should_ also be fairly easy to enlarge the buffer on demand, say, if
> it got half full. What's the time scale over which the buffer filled
> up? Did a single #MC fill it up?
>
> I really think we need to understand what the problem is and have _some_
> confidence that the proposed solution will fix that, even if we're just
> talking about a new config option.

I've seen a similar issue, and it's not just related to memory errors.
In my experience it was MCA errors from a variety of hardware blocks.
For example, a bad link internal to an SoC could spew MCA errors
regardless of the scale of RAM or CPUs. Same thing is possible for a bad
cache, etc.

These were during pre-production testing, and the easy workaround is to
increase the MCE genpool size at build time.

I don't think this needs to be the default though.

How about this to start?

1) Keep the current config size for boot time.
2) Add a kernel parameter and/or sysfs file to allow users to request
additional genpool capacity.
3) Use gen_pool_add(), or whichever, to add the capacity based on user
input.

Maybe this can be expanded later to be automatic. But I think it simpler
to start with explicit user input.

Thanks,
Yazen