Re: [RFC] How drivers notice a HW error?

From: Andi Kleen
Date: Thu Nov 27 2003 - 06:39:46 EST


Hidetoshi Seto <seto.hidetoshi@xxxxxxxxxxxxxx> writes:

> On some platform, for example IA64, the chipset detects an error caused by
> driver's operation such as I/O read, and reports it to kernel. Linux kernel
> analyzes the error and decides to kill the driver or reboot at worst.
> I want to convey the error information to the offending driver, and want to
> enable the driver to recover the failed operation.
>A
> So, just a plan, I think about a readb_check function that has checking ability
> enable it to return error value if error is occurred on read. Drivers could use
> readb_check instead of usual readb, and could diagnosis whether a retry be
> required or not, by the return value of readb_check.

I don't think that's an good portable API. On many architectures it is hard to
associate an MCE with an specific instruction because the MCE
happnes asynchronously. All the MCE handler gets is an address. Also
adding error checks to every read* would make the driver source quite
unreadable.

Also I think most drivers would not attempt to specially handle every
access but just implement a generic handler that shutdowns the device
(otherwise it would be a testing nightmare).

So better would be:

Add a callback to the pci_dev/device. When an error occurs in a mmio
area associated with a driver call that callback.

Add another function to register other memory areas (in case a driver
does mmio not visible in PCI config) for error handling.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/