Re: libata error handling

From: Tejun Heo
Date: Fri Aug 19 2005 - 00:41:03 EST



Hi, Jeff.

Jeff Garzik wrote:

Tejun,

In an email I cannot find anymore, you asked why I was interested in converting libata to use the fine-grained EH hooks in the SCSI layer, rather than continued with the current ->eh_strategy_handler() method.

Several reasons:

1) The fine-grained hooks of the SCSI layer are somewhat standard for block devices. The events they signify -- timeout, abort cmd, dev reset, bus reset, and host reset -- map precisely to the events that we must deal with at the ATA level.

I genearally agree that the events are somewhat standard for block devices but IMHO SCSI EH also has fair amount SCSI-specific assumptions and ATA is a bit too different from SCSI to fit cleanly into it. For example, when handling NCQ errors, the whole task set is aborted and the status is retrieved with read log page. This can be worked around in one of the hooks and emulate SCSI behavior, but it just doesn't really fit well. And I think that recovering via translation layer is a bit too much translation.

So, my thought is that SCSI EH assumptions are a bit too specific to be used as standard for block devices.

But be warned of false sharing, as I talk about in #2...

2) When libata SAT translation layer becomes optional, and libata drives a "true" block device, use of ->eh_strategy_handler() will actually be an obstacle due to false sharing of code paths. ->eh_strategy_handler() is indeed a single "do it all" EH entrypoint, but within that entrypoint you must perform several SCSI-specific tasks.

It's true that we must do SCSI specific tasks inside libata if we use eh_strategy_handler but I don't think switching to fine-grained EH will reduce the amount of SCSI-specific things inside libata. I think as long as we can insulate LLDD's from SCSI layer, either way should be okay later.


3) ->eh_strategy_handler() has continually proven to be a method of error handling poorly supported by the SCSI layer. There are many assumption coded into the SCSI layer that this is -not- the path taken by LLD EH code, and libata must constantly work around these assumptions.

4) libata is the -only- user of ->eh_strategy_handler(), and oddballs must be stomped out. It creates a maintenance burden on the SCSI layer that should be eliminated.

I agree that being the only user does incur difficulties, but my very subjective feeling is that the original libata EH implementation was just a bit too fragile to start with. eg. not grabbing host lock on EH entrance causing command completion vs. EH handling race and handling errors in several different ways.

Heh... Maybe I'm just reluctant to let go of my patches. Anyways, I'll now stand down and see how things go and try to help.

Thanks, always.

--
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/