[PATCH 1/2] x86/MCE: Extend size of the MCE Records pool

From: Naik, Avadhut
Date: Fri Feb 09 2024 - 14:52:32 EST


Hi,

On 2/8/2024 15:09, Sohil Mehta wrote:
> On 2/7/2024 2:56 PM, Avadhut Naik wrote:
>
>> Extend the size of MCE Records pool to better serve modern systems. The
>> increase in size depends on the CPU count of the system. Currently, since
>> size of struct mce is 124 bytes, each logical CPU of the system will have
>> space for at least 2 MCE records available in the pool. To get around the
>> allocation woes during early boot time, the same is undertaken using
>> late_initcall().
>>
>
> I guess making this proportional to the number of CPUs is probably fine
> assuming CPUs and memory capacity *would* generally increase in sync.
>
> But, is there some logic to having 2 MCE records per logical cpu or it
> is just a heuristic approach? In practice, the pool is shared amongst
> all MCE sources and can be filled by anyone, right?
>
Yes, the pool is shared among all MCE sources but the logic for 256 is
that the genpool was set to 2 pages i.e. 8192 bytes in 2015.
Around that time, AFAIK, the max number of logical CPUs on a system was
32.
So, in the maximum case, each CPU will have around 256 bytes (8192/32) in
the pool. It equates to approximately 2 MCE records since sizeof(struct mce)
back then was 88 bytes.
>> Signed-off-by: Avadhut Naik <avadhut.naik@xxxxxxx>
>> ---
>> arch/x86/kernel/cpu/mce/core.c | 3 +++
>> arch/x86/kernel/cpu/mce/genpool.c | 22 ++++++++++++++++++++++
>> arch/x86/kernel/cpu/mce/internal.h | 1 +
>> 3 files changed, 26 insertions(+)
>>
>> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
>> index b5cc557cfc37..5d6d7994d549 100644
>> --- a/arch/x86/kernel/cpu/mce/core.c
>> +++ b/arch/x86/kernel/cpu/mce/core.c
>> @@ -2901,6 +2901,9 @@ static int __init mcheck_late_init(void)
>> if (mca_cfg.recovery)
>> enable_copy_mc_fragile();
>>
>> + if (mce_gen_pool_extend())
>> + pr_info("Couldn't extend MCE records pool!\n");
>> +
>
> Why do this unconditionally? For a vast majority of low core-count, low
> memory systems the default 2 pages would be good enough.
>
> Should there be a threshold beyond which the extension becomes active?
> Let's say, for example, a check for num_present_cpus() > 32 (Roughly
> based on 8Kb memory and 124b*2 estimate per logical CPU).
>
> Whatever you choose, a comment above the code would be helpful
> describing when the extension is expected to be useful.
>
Put it in unconditionally because IMO the increase in memory even for
low-core systems didn't seem to be substantial. Just an additional page
for systems with less than 16 CPUs.

But I do get your point. Will add a check in mcheck_late_init() for CPUs
present. Something like below:

@@ -2901,7 +2901,7 @@ static int __init mcheck_late_init(void)
if (mca_cfg.recovery)
enable_copy_mc_fragile();

- if (mce_gen_pool_extend())
+ if ((num_present_cpus() > 32) && mce_gen_pool_extend())
pr_info("Couldn't extend MCE records pool!\n");

Does this look good? The genpool extension will then be undertaken only for
systems with more than 32 CPUs. Will explain the same in a comment.

>> mcheck_debugfs_init();
>>
>> /*
>> diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c
>> index fbe8b61c3413..aed01612d342 100644
>> --- a/arch/x86/kernel/cpu/mce/genpool.c
>> +++ b/arch/x86/kernel/cpu/mce/genpool.c
>> @@ -20,6 +20,7 @@
>> * 2 pages to save MCE events for now (~80 MCE records at most).
>> */
>> #define MCE_POOLSZ (2 * PAGE_SIZE)
>> +#define CPU_GEN_MEMSZ 256
>>
>
> The comment above MCE_POOLSZ probably needs a complete re-write. Right
> now, it reads as follows:
>
> * This memory pool is only to be used to save MCE records in MCE context.
> * MCE events are rare, so a fixed size memory pool should be enough. Use
> * 2 pages to save MCE events for now (~80 MCE records at most).
>
> Apart from the numbers being incorrect since sizeof(struct mce) has
> increased, this patch is based on the assumption that the current MCE
> memory pool is no longer enough in certain cases.
>
Yes, will change the comment to something like below:

* This memory pool is only to be used to save MCE records in MCE context.
* Though MCE events are rare, their frequency typically depends on the
* system's memory and CPU count.
* Allocate 2 pages to the MCE Records pool during early boot with the
* option to extend the pool, as needed, through command line, for systems
* with CPU count of more than 32.
* By default, each logical CPU can have around 2 MCE records in the pool
* at the same time.

Sounds good?

>> static struct gen_pool *mce_evt_pool;
>> static LLIST_HEAD(mce_event_llist);
>> @@ -116,6 +117,27 @@ int mce_gen_pool_add(struct mce *mce)
>> return 0;
>> }
>>
>> +int mce_gen_pool_extend(void)
>> +{
>> + unsigned long addr, len;
>
> s/len/size/
>
Noted.
>> + int ret = -ENOMEM;
>> + u32 num_threads;
>> +
>> + num_threads = num_present_cpus();
>> + len = PAGE_ALIGN(num_threads * CPU_GEN_MEMSZ);
>
> Nit: Can the use of the num_threads variable be avoided?
> How about:
>
> size = PAGE_ALIGN(num_present_cpus() * CPU_GEN_MEMSZ);
>
Will do.
>
>
>> + addr = (unsigned long)kzalloc(len, GFP_KERNEL);
>
> Also, shouldn't the new allocation be incremental to the 2 pages already
> present?
>
> Let's say, for example, that you have a 40-cpu system and the calculated
> size in this case comes out to 40 * 2 * 128b = 9920bytes i.e. 3 pages.
> You only need to allocate 1 additional page to add to mce_evt_pool
> instead of the 3 pages that the current code does.
>
Will make it incremental when genpool extension is being undertaken through
the default means. Something like below:

@@ -129,6 +134,7 @@ int mce_gen_pool_extend(void)
} else {
num_threads = num_present_cpus();
len = PAGE_ALIGN(num_threads * CPU_GEN_MEMSZ);
+ len -= MCE_POOLSZ;

Does this sound good?

--
Thanks,
Avadhut Naik

> Sohil
>
>> +
>> + if (!addr)
>> + goto out;
>> +
>> + ret = gen_pool_add(mce_evt_pool, addr, len, -1);
>> + if (ret)
>> + kfree((void *)addr);
>> +
>> +out:
>> + return ret;
>> +}
>> +
>> static int mce_gen_pool_create(void)
>> {
>> struct gen_pool *tmpp;
>
>