答复: 答复: Re: [tip:x86/urgent] x86/mce: Ensure offline CPUs don' t participate in rendezvous process

From: Tony W Wang-oc
Date: Fri May 31 2019 - 00:45:22 EST


On Fri, May 31, 2019, Raj, Ashok wrote:
> On Thu, May 30, 2019 at 09:13:39AM +0000, Tony W Wang-oc wrote:
> > On Thu, May 30, 2019, Tony W Wang-oc wrote:
> > > Hi Ashok,
> > > I have two questions about this patch, could you help to check:
> > >
> > > 1, for broadcast #MC exceptions, this patch seems require #MC exception
> > > errors
> > > set MCG_STATUS_RIPV = 1.
> > > But for Intel CPU, some #MC exception errors set MCG_STATUS_RIPV = 0
> > > (like "Recoverable-not-continuable SRAR Type" Errors), for these errors
> > > the patch doesn't seem to work, is that okay?
> > >
> > > 2, for LMCE exceptions, this patch seems require #MC exception errors
> > > set MCG_STATUS_RIPV = 0 to make sure LMCE be handled normally even
> > > on offline CPU.
> > > For LMCE errors set MCG_STAUS_RIPV = 1, the patch prevents offline CPU
> > > handle these LMCE errors, is that okay?
> > >
> >
> > More specifically, this patch seems require #MC exceptions meet the
> condition
> > "MCG_STATUS_RIPV ^ MCG_STATUS_LMCES == 1"; But on a Xeon X5650
> machine (SMP),
>
> The offline CPU will never get a LMCE=1, since those only happen on the CPU
> that's doing active work. Offline CPUs just sitting in idle.
>
> The specific error here is a PCC=1, so irrespective of what happens
> We do capture the errors in the per-cpu log, and kernel would panic.
>
> What specifically this patch tries to achieve is to leave an error
> sitting with MCG-STATUS.MCIP=1 and another recoverable error would shut
> the
> system dowm.
Yes, agree with you for this point.

But for question 1, When some #MC exception errors broadcast to offline CPU,
like "Recoverable-not-continuable SRAR Type" Errors, set MCG_STATUS_RIPV = 0,
PCC = 0, is there also the problem : " Kernel panic - not syncing: Timeout: Not all CPUs
entered broadcast exception handler"?

Thanks
>
> I don't see anything wrong with what this patch does..
>
> > "Data CACHE Level-2 Generic Error" does not meet this condition.
> >
> > I got below message from:
> https://www.centos.org/forums/viewtopic.php?p=292742
> >
> > Hardware event. This is not a software error.
> > MCE 0
> > CPU 4 BANK 6 TSC b7065eeaa18b0
> > TIME 1545643603 Mon Dec 24 10:26:43 2018
> > MCG status:MCIP
> > MCi status:
> > Uncorrected error
> > Error enabled
> > Processor context corrupt
> > MCA: Data CACHE Level-2 Generic Error
> > STATUS b200000080000106 MCGSTATUS 4
> > MCGCAP 1c09 APICID 4 SOCKETID 0
> > CPUID Vendor Intel Family 6 Model 44
> >
> > > Thanks
> > > Tony W Wang-oc