RE: [PATCH v2 0/2] Update mce_record tracepoint

From: Luck, Tony
Date: Fri Jan 26 2024 - 12:10:35 EST


> > 8 bytes for PPIN, 4 more for microcode.
>
> I know, nothing leads to bloat like 0.01% here, 0.001% there...

12 extra bytes divided by (say) 64GB (a very small server these days, may laptop has that much)
= 0.00000001746%

We will need 57000 changes like this one before we get to 0.001% :-)

> > Number of recoverable machine checks per system .... I hope the
> > monthly rate should be countable on my fingers...
>
> That's not the point. Rather, when you look at MCE reports, you pretty
> much almost always go and collect additional information from the target
> machine because you want to figure out what exactly is going on.
>
> So what's stopping you from collecting all that static information
> instead of parrotting it through the tracepoint with each error?

PPIN is static. So with good tracking to keep source platform information
attached to the error record as it gets passed around people trying to triage
the problem, you could say it can be retrieved later (or earlier when setting
up a database of attributes of each machine in the fleet.

But the key there is keeping the details of the source machine attached to
the error record. My first contact with machine check debugging is always
just the raw error record (from mcelog, rasdaemon, or console log).

> > PPIN is useful when talking to the CPU vendor about patterns of
> > similar errors seen across a cluster.
>
> I guess that is perhaps the only thing of the two that makes some sense
> at least - the identifier uniquely describes which CPU the error comes
> from...
>
> > MICROCODE - gives a fast path to root cause problems that have already
> > been fixed in a microcode update.
>
> But that, nah. See above.

Knowing which microcode version was loaded on a core *at the time of the error*
is critical. You've spent enough time with Ashok and Thomas tweaking the Linux
microcode driver to know that going back to the machine the next day to ask about
microcode version has a bunch of ways to get a wrong answer.

-Tony