Re: [PATCH] mtd:nand: Increment IFC_TIMEOUT_MSECS for nand controller response

From: Ronald Monthero
Date: Mon Nov 13 2023 - 10:17:41 EST


On Mon, Nov 13, 2023 at 11:44 PM Miquel Raynal
<miquel.raynal@xxxxxxxxxxx> wrote:
>
> Hi Ronald,
>
> Thanks for the patch, here are a couple of comments I'd like you to
> address before taking the fix.
>
> debug.penguin32@xxxxxxxxx wrote on Mon, 13 Nov 2023 23:16:28 +1000:
>
> The title prefix needs to be aligned with today's policy, you can check
> it with git log --oneline -- <your file>.
>
> > The nand controller not responding scenario occurs causing blocked tasks
> > and rcu_prempt warnings of stall on cpus. Incrementing the
> > IFC_TIMEOUT_MSECS appears to solve the nand controller not responding
> > issue.
>
> I would rephrase a bit this paragraph with more confidence. Under heavy
> load it is likely that the controller is done with its own task but the
> thread unlocking the wait look is never scheduled (or not in time)
> resulting in such kind of error. Maybe there is something else wrong in
> the code which stalls the CPU in this case, (hence the first message).
>
> Enlarging the timeout to 1s in this case is fine, but maybe there is
> still something wrong aside.
>
> > ** ID_531 main/smrcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > rcu: Tasks blocked on level-0 rcu_node (CPUs 0-1): P116/2:b..l
> > (detected by 1, t=2102 jiffies, g=12653, q=518)
> > task:irq/31-arm-irq1 state:D stack: 0 pid: 116 ppid: 2 flags:0x00000000
> > [<8064b97f>] (__schedule) from [<8064bb01>] (schedule+0x8d/0xc2)
> > [<8064bb01>] (schedule) from [<8064fa65>] (schedule_timeout+0x6d/0xa0)
> > [<8064fa65>] (schedule_timeout) from [<804ba353>] (fsl_ifc_run_command+0x6f/0x178)
> > [<804ba353>] (fsl_ifc_run_command) from [<804ba72f>] (fsl_ifc_cmdfunc+0x203/0x2b8)
> > [<804ba72f>] (fsl_ifc_cmdfunc) from [<804b135f>] (nand_status_op+0xaf/0xe0)
> > [<804b135f>] (nand_status_op) from [<804b13b3>] (nand_check_wp+0x23/0x48)
> > [<804b13b3>] (nand_check_wp) from [<804b231d>] (nand_do_write_ops+0x99/0x2b8)
> > [<804b231d>] (nand_do_write_ops) from [<804b5355>] (nand_write_oob+0x3b/0x4a)
> > [<804b5355>] (nand_write_oob) from [<804a3585>] (mtd_write+0x41/0x5c)
> > [<804a3585>] (mtd_write) from [<804c1d47>] (ubi_io_write+0x17f/0x22c)
> > [<804c1d47>] (ubi_io_write) from [<804c047b>] (ubi_eba_write_leb+0x5b/0x1d0)
> > [<804c047b>] (ubi_eba_write_leb) from [<804bf62d>] (ubi_leb_write+0x75/0x90)
> > [<804bf62d>] (ubi_leb_write) from [<803745b7>] (ubifs_leb_write+0x4b/0x8c)
> > [<803745b7>] (ubifs_leb_write) from [<80374bbb>] (ubifs_wbuf_sync_nolock+0x10f/0x1a4)
> > [<80374bbb>] (ubifs_wbuf_sync_nolock) from [<8036c6dd>] (ubifs_jnl_update+0x1e9/0x36c)
> > [<8036c6dd>] (ubifs_jnl_update) from [<80370933>] (ubifs_create+0xb3/0x130)
> > [<80370933>] (ubifs_create) from [<802cf0c7>] (lookup_open+0x173/0x1c4)
> > [<802cf0c7>] (lookup_open) from [<802cf8a3>] (open_last_lookups+0xd7/0x16c)
> > [<802cf8a3>] (open_last_lookups) from [<802d08e5>] (path_openat+0x91/0x104)
> > [<802d08e5>] (path_openat) from [<802d0989>] (do_filp_open+0x31/0x74)
> > [<802d0989>] (do_filp_open) from [<802c4fb3>] (file_open_name+0x33/0x48)
> > [<802c4fb3>] (file_open_name) from [<802c4fe9>] (filp_open+0x21/0x2e)
> > [<802c4fe9>] (filp_open) from [<80490bd3>] (irq1_handler+0x53/0xa4)
> > [<80490bd3>] (irq1_handler) from [<80247dd7>] (irq_forced_thread_fn+0x1f/0x4c)
> > [<80247dd7>] (irq_forced_thread_fn) from [<80247cd9>] (irq_thread+0x89/0x114)
> > [<80247cd9>] (irq_thread) from [<8022ca67>] (kthread+0xcf/0xe4)
> > [<8022ca67>] (kthread) from [<80200149>] (ret_from_fork+0x11/0x28)
> > Exception stack(0x822bbfb0 to 0x822bbff8)
> > bfa0: 00000000 00000000 00000000 00000000
> > bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> > bfe0: 00000000 00000000 00000000 00000000 00000013 00000000
> >
> > fsl,ifc-nand 7e800000.nand: Controller is not responding
> > ID_531 main/smp_fsm.c:1884 <inrcu: rcu_preempt detected stalls on CPUs:
> > rcu: Tasks blocked on level-0 rcu_node (CPUs 0-1): P116/2:b..l
> > (detected by 1, t=2102 jiffies, g=7729, q=754)
> > task:irq/31-arm-irq1 state:D stack: 0 pid: 116 ppid: 2 flags:0x00000000
> > [<8064b97f>] (__schedule) from [<8064bb01>] (schedule+0x8d/0xc2)
> > [<8064bb01>] (schedule) from [<8064dacd>] (rt_mutex_slowlock_block.con)
> > [<8064dacd>] (rt_mutex_slowlock_block.constprop.0) from [<8064db57>]
> > [<8064db57>] (__rt_mutex_slowlock.constprop.0) from [<8064dbf7>]
> > [<8064dbf7>] (rt_mutex_slowlock.constprop.0) from [<804b2047>]
> > [<804b2047>] (nand_get_device) from [<804b5335>] (nand_write_oob+0x1b/0x4a)
> > [<804b5335>] (nand_write_oob) from [<804a3585>] (mtd_write+0x41/0x5c)
> > [<804a3585>] (mtd_write) from [<804c1d47>] (ubi_io_write+0x17f/0x22c)
> > [<804c1d47>] (ubi_io_write) from [<804c047b>] (ubi_eba_write_leb+0x5b/0x1d0)
> > [<804c047b>] (ubi_eba_write_leb) from [<804bf62d>] (ubi_leb_write+0x75/0x90)
> > [<804bf62d>] (ubi_leb_write) from [<803745b7>] (ubifs_leb_write+0x4b/0x8c)
> > [<803745b7>] (ubifs_leb_write) from [<80374bbb>] (ubifs_wbuf_sync_nolock+0x10f/0x1a4)
> > [<80374bbb>] (ubifs_wbuf_sync_nolock) from [<8036c6dd>] (ubifs_jnl_update+0x1e9/0x36c)
>
> You can trim down the traces to only show the interesting part.
>
> Here you need a Fixes: and Cc: stable tag.
>
> > Signed-off-by: Ronald Monthero <debug.penguin32@xxxxxxxxx>
> > ---
> > drivers/mtd/nand/raw/fsl_ifc_nand.c | 2 +-
> > drivers/mtd/nand/raw/nand_base.c | 5 ++++-
> > 2 files changed, 5 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/mtd/nand/raw/fsl_ifc_nand.c b/drivers/mtd/nand/raw/fsl_ifc_nand.c
> > index 20bb1e0cb5eb..42f8ea46b6a8 100644
> > --- a/drivers/mtd/nand/raw/fsl_ifc_nand.c
> > +++ b/drivers/mtd/nand/raw/fsl_ifc_nand.c
> > @@ -21,7 +21,7 @@
> >
> > #define ERR_BYTE 0xFF /* Value returned for read
> > bytes when read failed */
> > -#define IFC_TIMEOUT_MSECS 500 /* Maximum number of mSecs to wait
> > +#define IFC_TIMEOUT_MSECS 1000 /* Maximum number of mSecs to wait
> > for IFC NAND Machine */
> >
> > struct fsl_ifc_ctrl;
> > diff --git a/drivers/mtd/nand/raw/nand_base.c b/drivers/mtd/nand/raw/nand_base.c
> > index 9e24bedffd89..05b52ed41f4c 100644
> > --- a/drivers/mtd/nand/raw/nand_base.c
> > +++ b/drivers/mtd/nand/raw/nand_base.c
> > @@ -363,8 +363,11 @@ static int nand_check_wp(struct nand_chip *chip)
> > int ret;
> >
> > /* Broken xD cards report WP despite being writable */
> > - if (chip->options & NAND_BROKEN_XD)
> > + if (chip->options & NAND_BROKEN_XD) {
> > + pr_info("nand_chip->options indicates NAND_BROKEN_XD %d\n",
> > + (chip->options & NAND_BROKEN_XD));
> > return 0;
> > + }
>
> This is an unrelated debug message and should be dropped.
>
> >
> > /* Check the WP bit */
> > ret = nand_status_op(chip, &status);
>
> Thanks,
> Miquèl

Hi Miquel,
Thanks for the review and suggestions, I will send out a v2 patch shortly.
BR,
Ronald