RE: [PATCH v2 3/8] efi: Decode IA32/X64 Processor Error Info Structure

From: Ghannam, Yazen
Date: Tue Feb 27 2018 - 16:32:26 EST


> -----Original Message-----
> From: Borislav Petkov [mailto:bp@xxxxxxx]
> Sent: Tuesday, February 27, 2018 2:10 PM
> To: Ghannam, Yazen <Yazen.Ghannam@xxxxxxx>
> Cc: linux-efi@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx;
> ard.biesheuvel@xxxxxxxxxx; x86@xxxxxxxxxx; Tony Luck
> <tony.luck@xxxxxxxxx>
> Subject: Re: [PATCH v2 3/8] efi: Decode IA32/X64 Processor Error Info
> Structure
>
...
> > 1) No one except debug and HW design folks, who will eventually get a
> > report from a user.
>
> Hahahha, yeah right.
>
> The only people who get those reports are the maintainers of the code in
> the kernel and the distro people who get all the bugs assigned to them.
>
> And if they can't decode the error - it is Tony and me.
>
> HW folks hear about it from us. And we go and decode the damn crap
> *every* time. Do you catch my drift now?
>

That's not true. You may get reports but so do platform and silicon vendors
directly.

> > [ 1.990948] [Hardware Error]: Error 1, type: corrected
> > [ 1.995789] [Hardware Error]: fru_text: ProcessorError
> > [ 2.000632] [Hardware Error]: section_type: IA32/X64 processor error
> > [ 2.005796] [Hardware Error]: Validation Bits: 0x0000000000000207
> > [ 2.010953] [Hardware Error]: Local APIC_ID: 0x0
> > [ 2.015991] [Hardware Error]: CPUID Info:
> > [ 2.020747] [Hardware Error]: 00000000: 00800f12 00000000 00400800
> 00000000
> > [ 2.025595] [Hardware Error]: 00000010: 76d8320b 00000000 178bfbff
> 00000000
> > [ 2.030423] [Hardware Error]: 00000020: 00000000 00000000 00000000
> 00000000
> > [ 2.035198] [Hardware Error]: Error Information Structure 0:
> > [ 2.039961] [Hardware Error]: Error Structure Type: a55701f5-e3ef-
> 43de-ac72-249b573fad2c
> > [ 2.049608] [Hardware Error]: Error Structure Type: cache error
> > [ 2.054344] [Hardware Error]: Validation Bits: 0x0000000000000001
> > [ 2.059046] [Hardware Error]: Check Information: 0x0000000020540087
> > [ 2.063625] [Hardware Error]: Validation Bits: 0x0087
> > [ 2.068032] [Hardware Error]: Transaction Type: 0, Instruction
> > [ 2.072423] [Hardware Error]: Operation: 5, instruction fetch
> > [ 2.076776] [Hardware Error]: Level: 1
> > [ 2.081073] [Hardware Error]: Overflow: true
> > [ 2.085360] [Hardware Error]: Context Information Structure 0:
> > [ 2.089661] [Hardware Error]: Register Context Type: MSR Registers
> (Machine Check and other MSRs)
> > [ 2.098487] [Hardware Error]: Register Array Size: 0x0050
> > [ 2.103113] [Hardware Error]: MSR Address: 0xc0002011
> > [ 2.107742] [Hardware Error]: Register Array:
> > [ 2.112270] [Hardware Error]: 00000000: d8200000000a0151
> 0000000000000000
> > [ 2.116845] [Hardware Error]: 00000010: d010000000000000
> 0000000300000031
> > [ 2.121228] [Hardware Error]: 00000020: 000100b000000000
> 000000004a000000
> > [ 2.125514] [Hardware Error]: 00000030: 0000000000000000
> 0000000000000000
> > [ 2.129747] [Hardware Error]: 00000040: 0000000000000000
> 0000000000000000
>
> Lemme simplify that error record:
>
> [Hardware Error]: Corrected Processor Error
> [Hardware Error]: APIC_ID: 0x0 | CPUID: 0x17|0x1|0x2
> [Hardware Error]: Type: cache error during instruction fetch
> [Hardware Error]: cache level 1
> [Hardware Error]: Overflow: true
>
> See how much more readable it got! And it is only 5 lines. I can make it
> even smaller.
>

Not much more readable. It's still vague and confusing to a user and devoid
of any real info so an expert can't help. And now the information is printed
arbitrarily, so someone needs to read the source to figure out what it really
means.

> If it were a critical, uncorrectable error, every line counts: imagine
> you do the above fat record and the machine freezes at line 5.
>

Maybe. But these records are generated by Platform Firmware. Why would
FW report the error knowing the system is about to die? Most likely we'll
only see CPERs in HEST if the OS can do something, or we'll see them in
BERT after recovering from a hang/reset.

> Now, I admit that my vesion of the record is not enough to debug it
> but it needs to contain only information which is clear and humanly
> readable to debug. You can always dump the raw data underneath from the
> tracepoint but make the beginning human readable.
>
> Do you know what users say about your error record?
>
> "Err, it says hardware error, is my machine broken? I need to replace my
> CPU."
>

Your example still says "Hardware Error" and odds are general users won't
understand what the error type means or what a corrected error is. So it's
not much better.

> I read that on a weekly basis.
>
> Do you know how expensive support calls are about such errors which are
> completely unreadable to people? 20 engineers need to get on a call to
> realize it was a dumb correctable error? Btw, this is one of the reasons
> why we did the error collector.
>

Exactly! The more info available (usually) the more quickly it can be
diagnosed.

Hardware errors are generally rare and hard to reproduce. So when one
does occur we should capture the data and provide it.

Here are a couple of scenarios based on similar experiences I've had:

Scenario 1 (with minimal info)
User: "I have a hardware error. What does it mean?"

Debugger: "Do you have any more info?"

User: "Corrected Processor Error; cache error during instruction fetch"

Debugger: "'Corrected' sounds okay, but not sure about 'cache error during
fetch'. Linux dev, what does that mean?"

Linux dev: "It means this field and that field equal this."

Debugger: "What about the rest of the data?"

Linux dev: "It's in a trace buffer. Please use this tool or set this sysfs file to
this and that to collect the data."

Debugger: "There aren't any errors in the trace buffer. User, when did
this occur?"

User: "This error occurred last week. We've reset the machine a few times
since then. The error seems to happen every few days under a very
specific workload."

Debugger: "Okay, let's try to reproduce this."
*wait a few days, run a lot of systems to up the odds*

Debugger: "We finally have data and can now help you"

Scenario 2 (with all platform provided data)
User: "I have a hardware error. What does it mean?"

Debugger: "Do you have any more info?"

User: *provides complete error record*

Debugger: "Thank you, we have everything we need to help you."


I'll send a V3 set with the following changes:
1) Fix table numbering in commit messages.
2) Remove "Validation Bits" lines.
3) Only print error type GUID for unknown types.

I think this set should focus on printing the x86 CPER based on the UEFI
spec and the convention of the other CPER code. CPERs are generated
by Platform Firmware. So errors are explicitly intended to be viewed by
the user and all info should be displayed. Other errors not intended to
be seen by the user may be thresholded or masked by the Platform.

I *have* been thinking that it would be nice to take the CPER and pipe it
through the MCA decoding in arch/x86 and EDAC. This would be really
nice for when the CPER includes the MCA registers in the Context info.
So we'd get our usual MCA decoding instead of a binary blob of registers.

I was thinking that the MCA decoding would be in addition to this. But
based on Boris's comments, maybe we can make it a default selection.
For example, if MCA/EDAC decoding is available, use it. Otherwise, print
the CPER fields in a generic way like we do for the other CPER types.

That would be a separate set though.

Thanks,
Yazen