Re: -mm merge plans for 2.6.21

From: Andi Kleen
Date: Sat Feb 10 2007 - 09:43:34 EST


On Saturday 10 February 2007 14:48, Alan wrote:
> > Well it's a technical issue -- it conflicts with the machine check
> > code that is already implemented by stealing away its events.
> > You would first need a CONFIG_MCE=n on x86-64 at least, otherwise
> > one of them will be unhappy.
>
> That is a fair point, albeit after far too long sitting stuck in the tree.

I'm pretty sure I mentioned it earlier already.

> I'll have a look into this a bit further, probably MCE_K8 should feed off
> the output of the MCE driver.

You could probably write a simple edac driver that just feeds of the events
mce.c generates and presents that as the EDAC drivers. Currently
the user space device is the only consumer, but it wouldn't be hard
to full it off to other kernel users.

Possibly keeping the existing code that reads the DIMMs and
then reads the address map of the NB and match that, then you
get the same output.

[However see below -- i think matching it to SMBIOS using the address
is a lot more useful for the user in the end]

>
> > The other issue is that the existing code does everything EDAC
> > K8 does already perfectly fine, just in a small fraction of
> > the code.
>
> Which is false. It provided a totally inconsistent interface to user
> space, while the EDAC layer provides the consistency many people need and
> want.

I still don't get that argument because unlike mce.c+mcelog EDAC cannot
actually tell you where the DIMM that failed is located on your motherboard.

As far as i can make it out it's only useful for people who either
have full schematics of their motherboard and know how to read them
or did a full try'n'error of all combinations run once to
figure out which channel is located where.

Somehow I cannot imagine either of that is very common in "enterprises" ;-)

Ok one argument was that some LinuxBIOS users don't seem to
set up these tables and have the necessary information at hand
for their custom cluster hardware, but that's still a weird uncommon
case and it would be far better if they just fixed LB.

> Also unless it has changed remarkably the MCE driver doesn't do
> scrubbing which is needed in software on some chip revisions.

True, adding that might be a good idea.

[Although I guess that could be done with a small user utility
using /dev/mem though if you really wanted it?]

One thing EDAC also does that mcelog doesn't is decode the ECC
syndromes -- but I haven't figured out a use case yet where that
is actually useful to know afterwards.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/