Re: [PATCH v5 1/4] mtd: nand: increase ready wait timeout and report timeouts

From: Brian Norris
Date: Wed Sep 09 2015 - 19:49:48 EST


+ Niklas

On Tue, Sep 08, 2015 at 10:10:50AM +0100, Alex Smith wrote:
> If nand_wait_ready() times out, this is silently ignored, and its
> caller will then proceed to read from/write to the chip before it is
> ready. This can potentially result in corruption with no indication as
> to why.
>
> While a 20ms timeout seems like it should be plenty enough, certain
> behaviour can cause it to timeout much earlier than expected. The
> situation which prompted this change was that CPU 0, which is
> responsible for updating jiffies, was holding interrupts disabled
> for a fairly long time while writing to the console during a printk,
> causing several jiffies updates to be delayed. If CPU 1 happens to
> enter the timeout loop in nand_wait_ready() just before CPU 0 re-
> enables interrupts and updates jiffies, CPU 1 will immediately time
> out when the delayed jiffies updates are made. The result of this is
> that nand_wait_ready() actually waits less time than the NAND chip
> would normally take to be ready, and then read_page() proceeds to
> read out bad data from the chip.
>
> The situation described above may seem unlikely, but in fact it can be
> reproduced almost every boot on the MIPS Creator Ci20.
>
> Debugging this was made more difficult by the misleading comment above
> nand_wait_ready() stating "The timeout is caught later" - no timeout
> was ever reported, leading me away from the real source of the problem.
>
> Therefore, this patch increases the timeout to 200ms. This should be
> enough to cover cases where jiffies updates get delayed. Additionally,
> add a pr_warn() when a timeout does occur so that it is easier to
> pinpoint any problems in future caused by the chip not becoming ready.

Did you examine other solutions? I've seen patches for hrtimer support
previously:

http://patchwork.ozlabs.org/patch/160333/
http://patchwork.ozlabs.org/patch/431066/

A few things have been cleaned up since then, so some of the initial
objections to the hrtimer patch don't make sense anymore, I believe.

Anyway, I think just increasing the timeout looks OK to me (as long as
we never have a 200ms jiffies jump... can this happen??), so hrtimer may
be over-engineering. I just want to make sure both options have been
considered before officially choosing one over the other.

Brian

> Signed-off-by: Alex Smith <alex.smith@xxxxxxxxxx>
> Reviewed-by: Ezequiel Garcia <ezequiel@xxxxxxxxxxxxxxxxxxxx>
> Cc: Zubair Lutfullah Kakakhel <Zubair.Kakakhel@xxxxxxxxxx>
> Cc: David Woodhouse <dwmw2@xxxxxxxxxxxxx>
> Cc: Brian Norris <computersforpeace@xxxxxxxxx>
> Cc: linux-mtd@xxxxxxxxxxxxxxxxxxx
> Cc: linux-kernel@xxxxxxxxxxxxxxx
> ---
> v4 -> v5:
> - Remove spurious change.
> - Add Ezequiel's Reviewed-by.
>
> v3 -> v4:
> - New patch to fix issue encountered in external Ci20 3.18 kernel
> branch which also applies upstream.
> ---
> drivers/mtd/nand/nand_base.c | 14 +++++++++++---
> 1 file changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/mtd/nand/nand_base.c b/drivers/mtd/nand/nand_base.c
> index ceb68ca8277a..07b831b94e5c 100644
> --- a/drivers/mtd/nand/nand_base.c
> +++ b/drivers/mtd/nand/nand_base.c
> @@ -543,11 +543,16 @@ static void panic_nand_wait_ready(struct mtd_info *mtd, unsigned long timeo)
> }
> }
>
> -/* Wait for the ready pin, after a command. The timeout is caught later. */
> +/**
> + * nand_wait_ready - [GENERIC] Wait for the ready pin after commands.
> + * @mtd: MTD device structure
> + *
> + * Wait for the ready pin after a command, and warn if a timeout occurs.
> + */
> void nand_wait_ready(struct mtd_info *mtd)
> {
> struct nand_chip *chip = mtd->priv;
> - unsigned long timeo = jiffies + msecs_to_jiffies(20);
> + unsigned long timeo = jiffies + msecs_to_jiffies(200);
>
> /* 400ms timeout */
> if (in_interrupt() || oops_in_progress)
> @@ -557,9 +562,12 @@ void nand_wait_ready(struct mtd_info *mtd)
> /* Wait until command is processed or timeout occurs */
> do {
> if (chip->dev_ready(mtd))
> - break;
> + goto out;
> touch_softlockup_watchdog();
> } while (time_before(jiffies, timeo));
> +
> + pr_warn("timeout while waiting for chip to become ready\n");
> +out:
> led_trigger_event(nand_led_trigger, LED_OFF);
> }
> EXPORT_SYMBOL_GPL(nand_wait_ready);
> --
> 2.5.0
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/