Re: [PATCH] mtd:nand: Increment IFC_TIMEOUT_MSECS for nand controller response

From: Miquel Raynal
Date: Mon Nov 13 2023 - 08:44:52 EST


Hi Ronald,

Thanks for the patch, here are a couple of comments I'd like you to
address before taking the fix.

debug.penguin32@xxxxxxxxx wrote on Mon, 13 Nov 2023 23:16:28 +1000:

The title prefix needs to be aligned with today's policy, you can check
it with git log --oneline -- <your file>.

> The nand controller not responding scenario occurs causing blocked tasks
> and rcu_prempt warnings of stall on cpus. Incrementing the
> IFC_TIMEOUT_MSECS appears to solve the nand controller not responding
> issue.

I would rephrase a bit this paragraph with more confidence. Under heavy
load it is likely that the controller is done with its own task but the
thread unlocking the wait look is never scheduled (or not in time)
resulting in such kind of error. Maybe there is something else wrong in
the code which stalls the CPU in this case, (hence the first message).

Enlarging the timeout to 1s in this case is fine, but maybe there is
still something wrong aside.

> ** ID_531 main/smrcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> rcu: Tasks blocked on level-0 rcu_node (CPUs 0-1): P116/2:b..l
> (detected by 1, t=2102 jiffies, g=12653, q=518)
> task:irq/31-arm-irq1 state:D stack: 0 pid: 116 ppid: 2 flags:0x00000000
> [<8064b97f>] (__schedule) from [<8064bb01>] (schedule+0x8d/0xc2)
> [<8064bb01>] (schedule) from [<8064fa65>] (schedule_timeout+0x6d/0xa0)
> [<8064fa65>] (schedule_timeout) from [<804ba353>] (fsl_ifc_run_command+0x6f/0x178)
> [<804ba353>] (fsl_ifc_run_command) from [<804ba72f>] (fsl_ifc_cmdfunc+0x203/0x2b8)
> [<804ba72f>] (fsl_ifc_cmdfunc) from [<804b135f>] (nand_status_op+0xaf/0xe0)
> [<804b135f>] (nand_status_op) from [<804b13b3>] (nand_check_wp+0x23/0x48)
> [<804b13b3>] (nand_check_wp) from [<804b231d>] (nand_do_write_ops+0x99/0x2b8)
> [<804b231d>] (nand_do_write_ops) from [<804b5355>] (nand_write_oob+0x3b/0x4a)
> [<804b5355>] (nand_write_oob) from [<804a3585>] (mtd_write+0x41/0x5c)
> [<804a3585>] (mtd_write) from [<804c1d47>] (ubi_io_write+0x17f/0x22c)
> [<804c1d47>] (ubi_io_write) from [<804c047b>] (ubi_eba_write_leb+0x5b/0x1d0)
> [<804c047b>] (ubi_eba_write_leb) from [<804bf62d>] (ubi_leb_write+0x75/0x90)
> [<804bf62d>] (ubi_leb_write) from [<803745b7>] (ubifs_leb_write+0x4b/0x8c)
> [<803745b7>] (ubifs_leb_write) from [<80374bbb>] (ubifs_wbuf_sync_nolock+0x10f/0x1a4)
> [<80374bbb>] (ubifs_wbuf_sync_nolock) from [<8036c6dd>] (ubifs_jnl_update+0x1e9/0x36c)
> [<8036c6dd>] (ubifs_jnl_update) from [<80370933>] (ubifs_create+0xb3/0x130)
> [<80370933>] (ubifs_create) from [<802cf0c7>] (lookup_open+0x173/0x1c4)
> [<802cf0c7>] (lookup_open) from [<802cf8a3>] (open_last_lookups+0xd7/0x16c)
> [<802cf8a3>] (open_last_lookups) from [<802d08e5>] (path_openat+0x91/0x104)
> [<802d08e5>] (path_openat) from [<802d0989>] (do_filp_open+0x31/0x74)
> [<802d0989>] (do_filp_open) from [<802c4fb3>] (file_open_name+0x33/0x48)
> [<802c4fb3>] (file_open_name) from [<802c4fe9>] (filp_open+0x21/0x2e)
> [<802c4fe9>] (filp_open) from [<80490bd3>] (irq1_handler+0x53/0xa4)
> [<80490bd3>] (irq1_handler) from [<80247dd7>] (irq_forced_thread_fn+0x1f/0x4c)
> [<80247dd7>] (irq_forced_thread_fn) from [<80247cd9>] (irq_thread+0x89/0x114)
> [<80247cd9>] (irq_thread) from [<8022ca67>] (kthread+0xcf/0xe4)
> [<8022ca67>] (kthread) from [<80200149>] (ret_from_fork+0x11/0x28)
> Exception stack(0x822bbfb0 to 0x822bbff8)
> bfa0: 00000000 00000000 00000000 00000000
> bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> bfe0: 00000000 00000000 00000000 00000000 00000013 00000000
>
> fsl,ifc-nand 7e800000.nand: Controller is not responding
> ID_531 main/smp_fsm.c:1884 <inrcu: rcu_preempt detected stalls on CPUs:
> rcu: Tasks blocked on level-0 rcu_node (CPUs 0-1): P116/2:b..l
> (detected by 1, t=2102 jiffies, g=7729, q=754)
> task:irq/31-arm-irq1 state:D stack: 0 pid: 116 ppid: 2 flags:0x00000000
> [<8064b97f>] (__schedule) from [<8064bb01>] (schedule+0x8d/0xc2)
> [<8064bb01>] (schedule) from [<8064dacd>] (rt_mutex_slowlock_block.con)
> [<8064dacd>] (rt_mutex_slowlock_block.constprop.0) from [<8064db57>]
> [<8064db57>] (__rt_mutex_slowlock.constprop.0) from [<8064dbf7>]
> [<8064dbf7>] (rt_mutex_slowlock.constprop.0) from [<804b2047>]
> [<804b2047>] (nand_get_device) from [<804b5335>] (nand_write_oob+0x1b/0x4a)
> [<804b5335>] (nand_write_oob) from [<804a3585>] (mtd_write+0x41/0x5c)
> [<804a3585>] (mtd_write) from [<804c1d47>] (ubi_io_write+0x17f/0x22c)
> [<804c1d47>] (ubi_io_write) from [<804c047b>] (ubi_eba_write_leb+0x5b/0x1d0)
> [<804c047b>] (ubi_eba_write_leb) from [<804bf62d>] (ubi_leb_write+0x75/0x90)
> [<804bf62d>] (ubi_leb_write) from [<803745b7>] (ubifs_leb_write+0x4b/0x8c)
> [<803745b7>] (ubifs_leb_write) from [<80374bbb>] (ubifs_wbuf_sync_nolock+0x10f/0x1a4)
> [<80374bbb>] (ubifs_wbuf_sync_nolock) from [<8036c6dd>] (ubifs_jnl_update+0x1e9/0x36c)

You can trim down the traces to only show the interesting part.

Here you need a Fixes: and Cc: stable tag.

> Signed-off-by: Ronald Monthero <debug.penguin32@xxxxxxxxx>
> ---
> drivers/mtd/nand/raw/fsl_ifc_nand.c | 2 +-
> drivers/mtd/nand/raw/nand_base.c | 5 ++++-
> 2 files changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/mtd/nand/raw/fsl_ifc_nand.c b/drivers/mtd/nand/raw/fsl_ifc_nand.c
> index 20bb1e0cb5eb..42f8ea46b6a8 100644
> --- a/drivers/mtd/nand/raw/fsl_ifc_nand.c
> +++ b/drivers/mtd/nand/raw/fsl_ifc_nand.c
> @@ -21,7 +21,7 @@
>
> #define ERR_BYTE 0xFF /* Value returned for read
> bytes when read failed */
> -#define IFC_TIMEOUT_MSECS 500 /* Maximum number of mSecs to wait
> +#define IFC_TIMEOUT_MSECS 1000 /* Maximum number of mSecs to wait
> for IFC NAND Machine */
>
> struct fsl_ifc_ctrl;
> diff --git a/drivers/mtd/nand/raw/nand_base.c b/drivers/mtd/nand/raw/nand_base.c
> index 9e24bedffd89..05b52ed41f4c 100644
> --- a/drivers/mtd/nand/raw/nand_base.c
> +++ b/drivers/mtd/nand/raw/nand_base.c
> @@ -363,8 +363,11 @@ static int nand_check_wp(struct nand_chip *chip)
> int ret;
>
> /* Broken xD cards report WP despite being writable */
> - if (chip->options & NAND_BROKEN_XD)
> + if (chip->options & NAND_BROKEN_XD) {
> + pr_info("nand_chip->options indicates NAND_BROKEN_XD %d\n",
> + (chip->options & NAND_BROKEN_XD));
> return 0;
> + }

This is an unrelated debug message and should be dropped.

>
> /* Check the WP bit */
> ret = nand_status_op(chip, &status);

Thanks,
Miquèl