Re: [PATCH 2/2] mmc_block: ensure all sectors that do not have errorsare read

From: Adrian Hunter
Date: Mon Nov 10 2008 - 03:13:42 EST


Adrian Hunter wrote:
Pierre Ossman wrote:
On Thu, 16 Oct 2008 16:26:57 +0300
Adrian Hunter <ext-adrian.hunter@xxxxxxxxx> wrote:

If a card encounters an ECC error while reading a sector it will
timeout. Instead of reporting the entire I/O request as having
an error, redo the I/O one sector at a time so that all readable
sectors are provided to the upper layers.

Signed-off-by: Adrian Hunter <ext-adrian.hunter@xxxxxxxxx>
---

We actually had something like this on the table some time ago. It got
scrapped because of data integrity problems. This is just for reads
though, so I guess it should be safe.

@@ -278,6 +279,9 @@ static int mmc_blk_issue_rq(struct mmc_queue *mq, struct request *req)
brq.stop.flags = MMC_RSP_SPI_R1B | MMC_RSP_R1B | MMC_CMD_AC;
brq.data.blocks = req->nr_sectors;
+ if (disable_multi && brq.data.blocks > 1)
+ brq.data.blocks = 1;
+

A comment here would be nice.

Ok

You also need to adjust the sg list when you change the block count.
There was code there that did that previously, but it got removed in
2.6.27-rc1.

That is not necessary. It is an optimisation. In general, optimising an
error path serves no purpose.

@@ -312,6 +318,13 @@ static int mmc_blk_issue_rq(struct mmc_queue *mq, struct request *req)
mmc_queue_bounce_post(mq);
+ if (multi && rq_data_dir(req) == READ &&
+ brq.data.error == -ETIMEDOUT) {
+ /* Redo read one sector at a time */
+ disable_multi = 1;
+ continue;
+ }
+

Some concerns here:

1. "brq.data.blocks > 1" doesn't need to be optimised into its own
variable. It just obscures things.

But you have to assume that no driver changes the 'blocks' variable e.g.
counts it down. It is not an optimisation, it is just to improve
reliability and readability. What does it obscure?

2. A comment here as well. Explain what this does and why it is safe
(so people don't try to extend it to writes)

ok

3. You should check all errors, not just data.error and ETIMEDOUT.

No. Data timeout is a special case. The other errors are system errors.
If there is a command error or stop error (which is also a command error)
it means either there is a bug in the kernel or the controller or card
has failed to follow the specification. Under those circumstances

Data timeout on the other hand just means the data could not be retrieved
- in the case we have seen because of ECC error.

4. You should first report the successfully transferred blocks as ok.

That is another optimisation of the error path i.e. not necessary. It
is simpler to just start processing the request again - which the patch
does.

@@ -360,14 +373,21 @@ static int mmc_blk_issue_rq(struct mmc_queue *mq, struct request *req)
#endif
}
- if (brq.cmd.error || brq.data.error || brq.stop.error)
+ if (brq.cmd.error || brq.stop.error)
goto cmd_err;

Move your code to inside this if clause and you'll solve 3. and 4. in a
neat manner.

Well, I do not agree with 3 and 4.

You might also want to print something so that it is
visible that the driver retried the transfer.

There are already two error messages per sector (one from this function
and one from '__blk_end_request()', so another message is too much.

- /*
- * A block was successfully transferred.
- */
+ if (brq.data.error) {
+ if (brq.data.error == -ETIMEDOUT &&
+ rq_data_dir(req) == READ) {
+ err = -EIO;
+ brq.data.bytes_xfered = brq.data.blksz;
+ } else
+ goto cmd_err;
+ } else
+ err = 0;
+
spin_lock_irq(&md->lock);
- ret = __blk_end_request(req, 0, brq.data.bytes_xfered);
+ ret = __blk_end_request(req, err, brq.data.bytes_xfered);
spin_unlock_irq(&md->lock);
} while (ret);

Instead of this big song and dance routine, just have a dedicated piece
of code for calling __blk_end_request() for the single sector failure.

Ok

Amended patch follows:

What is the status of this patch?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/