Re: [tip:x86/mce] x86, mce: Make xeon75xx memory driver dependenton PCI

From: Thomas Gleixner
Date: Fri Feb 19 2010 - 05:51:28 EST


On Tue, 16 Feb 2010, Andi Kleen wrote:
> >
> > Please work with Mauro on the Nehalem EDAC bits, they seem rather advanced
> > to
> > me for v2.6.34, and _far_ cleaner and more capable as well. See those Intel
> > support bits at:
>
> Hi Ingo,
>
> core_i7 and EDAC has nothing to do with this code and
> it has nothing to do with the problem this patch is
> solving.
>
> This is for a different chip (xeon75xx)
> which has a completely different memory subsystem
> and reports memory errors in a completely different way
> than xeon75xx/core_i7.
>
> For core_i7/xeon55xx there is no additional event interface needed;
> it's all supplied by the hardware on the existing interfaces.
>
> The point of this code is to annotate the CE events on Xeon 75xx
> and to implement specific backend actions (page offlining, triggers)
> based on specific events. These backend actions are already implemented
> on 55xx without additional changes (no need for EDAC)
>
> EDAC does not provide an event interface that can
> be polled, just counts, so this cannot be done with EDAC.
> It's simply a topology enumeration with error counts.
> mcelog is not a topology interface, it's a event
> notification mechanism.
>
> EDAC and mcelog are orthogonal, they don't solve the same
> problem.
>
> So your nack is based on incorrect assumptions and doesn't make
> sense. What you're asking for cannot be done with current
> EDAC as far as I know.

It does not matter at all that current EDAC cannot do that right
now. Fact is that you are stubbornly ignoring any request from the x86
maintainers to rework MCE, consolidate it with EDAC and integrate it
into perf as the suitable event logging mechanism.

MCE has no design at all, it's a specialized hack which is limited to
a specific subset of the overall machine health monitoring and
reporting facilities.

You refuse to even think about consolidating the handling of all
health monitoring and reporting facilities into a well designed and
integrated framework.

Your sole argument is that mce can do it and EDAC or whatever can
not. That's not a technical argument at all. MCE does not become a
better design just because you hacked another feature into it.

Ingo's NAK is completely correct and he has my full support for it.

We do not want new crap in the already horrible MCE code. We simply
request a consolidation of machine health monitoring/reporting
facilities before adding new stuff.

You have been ignoring our technical requests for more than a
year. You are refusing to work with other people on a well designed
solution. You just follow your own agenda and try to squeeze more
stuff into MCE.

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/