Re: [PATCH/RFC] I/O-check interface for driver's error handling

From: Hidetoshi Seto
Date: Fri Mar 04 2005 - 07:56:49 EST

Next message: Richard Purdie: "Re: RFD: Kernel release numbering"
Previous message: Eugeny S. Mints: "Re: [patch] Real-Time Preemption, deactivate() scheduling issue"
In reply to: Benjamin Herrenschmidt: "Re: [PATCH/RFC] I/O-check interface for driver's error handling"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Thanks for all comments!

OK, I'd like to sort our situation:

################

$ Here are 2 features:
- iochk_clear/read() interface for error "detection"
by Seto ... me :-)
- callback, thread, and event notification for error "recovery"
by Linas ... expert in PPC64

$ What will "detection" interface provides?
- allow drivers to get error information
- device/bus was isolated/going-reset/re-enabled/etc.
- error status which hardware and PCI subsystem provides
- allow drivers to do "simple retry" easily
- major soft errors(*1) would be recovered by a simple retry
- in cases that device/bus was re-enabled but a retry is required

$ What will "recovery" infrastructure provides?
- allow drivers to help OS's recovery
- usually OS cannot re-enable affected devices by itself
- allow drivers to respond asynchronous error event
- allow drivers to implement "device specific recovery"

$ Difference of stance
- "detection"
- Assume that the number of soft error is far more than that of
hard error. (PCI-Express has ECC, but traditional PCI does not.)
- Assume that it isn't too late that attempt of device isolation
and/or recovery comes after a simple retry(*2), and that a retry
would be required even if the recovery had go well.
- It isn't matter whether device isolation is actually possible or
not for the arch. The fundamental intention of this interface is
prevent user applications from data pollution.
- Currently DMA and asynchronous I/O is not target.
- "recovery"
- (I'd appreciate it if Linas could fill here by his suitable words.)
- (Maybe,) it is based on assuming that erroneous device should be
isolated immediately irrespective of type of the error.
- (I guess that) once a device was isolated, it become harder to
re-enable it. It seems like a kind of hotplug feature.
- Currently there are few platform which can isolate devices and
attempt to recover from the I/O error.

$ How to use
- "detection" ... easy.
- clip I/Os by iochk_clear() and iochk_read()
- if iochk_read() returns non-0, retry once and/or notify the error
to user application.
- "recovery" ... rather hard.
- (I'd appreciate it if Linas could fill here by his suitable words.)
- write callback function for each event(*3)

-----

*1:
Traditionally, there are 2 types of error:
- soft error:
data was broken (ex. due to low voltage, natural radiation etc.)
temporary error
- hard error:
device or bus was physically broken (i.e. uncorrectable)
permanent error

*2:
it's difficult to distinguish hard errors from soft errors, without
any retry.

*3:
Linas, how many stages/events would you prefer to be there? is 3 enough?

ex. IMHO:

IOERR_DETECTED
- An error was detected, so error logging or device isolation would be
major request. On PPC64, isolation would be already done by hardware.
IOERR_PREPARE_RECOVERY
- Require preparation before attempting error recovery by OS.
IOERR_DO_RECOVERY
- Require device specific recovery and result of the recovery.
OS will gather all results and will decide recovered or not.
IOERR_RECOVERED
- OS recovery was succeeded.
IOERR_DEAD
- OS recovery was failed.

And as Ben said and as you already proposed, I also think only one callback
is enough and better, like:
int pci_emergency_callback(pci_dev *dev, err_event event, void *extra)

It allows us to add new event if desired.

################

Thanks,
H.Seto

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Richard Purdie: "Re: RFD: Kernel release numbering"
Previous message: Eugeny S. Mints: "Re: [patch] Real-Time Preemption, deactivate() scheduling issue"
In reply to: Benjamin Herrenschmidt: "Re: [PATCH/RFC] I/O-check interface for driver's error handling"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]