Re: [PATCH v2 0/2] Update mce_record tracepoint

From: Naik, Avadhut
Date: Thu Jan 25 2024 - 15:27:36 EST


Hi,

On 1/25/2024 1:19 PM, Luck, Tony wrote:
>>> The first patch adds PPIN (Protected Processor Inventory Number) field to
>>> the tracepoint.
>>>
>>> The second patch adds the microcode field (Microcode Revision) to the
>>> tracepoint.
>>
>> This is a lot of static information to add to *every* MCE.
>
> 8 bytes for PPIN, 4 more for microcode.
>
> Number of recoverable machine checks per system .... I hope the monthly rate should
> be countable on my fingers. If a system is getting more than that, then people should
> be looking at fixing the underlying problem.
>
> Corrected errors are much more common. Though Linux takes action to limit the
> rate when storms occur. So maybe hundreds or small numbers of thousands of
> error trace records? Increase in trace buffer consumption still measured in Kbytes
> not Mbytes. Server systems that do machine check reporting now start at tens of
> GBytes memory.
>
>> And where does it end? Stick full dmesg in the tracepoint too?
>
> Seems like overkill.
>
>> What is the real-life use case here?
>
> Systems using rasdaemon to track errors will be able to track both of these
> (I assume that Naik has plans to update rasdaemon to capture and save these
> new fields).
>
Yes, I do intend to submit a pull request to the rasdaemon to parse and log these
new fields.

> PPIN is useful when talking to the CPU vendor about patterns of similar errors
> seen across a cluster.
>
> MICROCODE - gives a fast path to root cause problems that have already
> been fixed in a microcode update.
>
> -Tony

--
Thanks,
Avadhut Naik