Re: [RFC PATCH] mtd: rawnand: use mutex to protect access while in suspend

From: Sean Nyekjaer
Date: Mon Oct 04 2021 - 04:55:15 EST


On Mon, Oct 04, 2021 at 10:41:47AM +0200, Boris Brezillon wrote:
> On Mon, 4 Oct 2021 08:56:09 +0200
> Sean Nyekjaer <sean@xxxxxxxxxx> wrote:
>
> > This will prevent nand_get_device() from returning -EBUSY.
> > It will force mtd_write()/mtd_read() to wait for the nand_resume() to unlock
> > access to the mtd device.
> >
> > Then we avoid -EBUSY is returned to ubifsi via mtd_write()/mtd_read(),
> > that will in turn hard error on every error returened.
> > We have seen during ubifs tries to call mtd_write before the mtd device
> > is resumed.
>
> I think the problem is here. Why would UBIFS/UBI try to write something
> to a device that's not resumed yet (or has been suspended already, if
> you hit this in the suspend path).
>
> >
> > Exec_op[0] speed things up, so we see this race when the device is
> > resuming. But it's actually "mtd: rawnand: Simplify the locking" that
> > allows it to return -EBUSY, before that commit it would have waited for
> > the mtd device to resume.
>
> Uh, wait. If nand_resume() was called before any writes/reads this
> wouldn't happen. IMHO, the problem is not that we return -EBUSY without
> blocking, the problem is that someone issues a write/read before calling
> mtd_resume().
>

The commit msg from "mtd: rawnand: Simplify the locking" states this clearly.

"""
Last important change to mention: we now return -EBUSY when someone
tries to access a device that as been suspended, and propagate this
error to the upper layer.
"""

IMHO "mtd: rawnand: Simplify the locking" should never had been merged
before the upper layers was fixed to handle -EBUSY. ;)
Which they still not are...

Yes, guess there is data in the ubifs queue when going into suspend,
then the ubifs kthread is starting writing when the cpu resumes.
Before mtd_resume() and other pm_resume() handles are called.

How would you have ubifs to wait for mtd_resume()? If you don't like
this mutex solution?

> >
> > Tested on a iMX6ULL.
> >
> > [0]:
> > ef347c0cfd61 ("mtd: rawnand: gpmi: Implement exec_op")
> >
> > Fixes: 013e6292aaf5 ("mtd: rawnand: Simplify the locking")
> > Signed-off-by: Sean Nyekjaer <sean@xxxxxxxxxx>
> > ---
> >
> > I did this a RFC as we probably will need to remove the suspended
> > variable as it's kinda made obsolute by this change.
> > Should we introduce a new mutex? Or maybe a spin_lock?
> >
> > drivers/mtd/nand/raw/nand_base.c | 2 --
> > 1 file changed, 2 deletions(-)
> >
> > diff --git a/drivers/mtd/nand/raw/nand_base.c b/drivers/mtd/nand/raw/nand_base.c
> > index 3d6c6e880520..0ea343404cac 100644
> > --- a/drivers/mtd/nand/raw/nand_base.c
> > +++ b/drivers/mtd/nand/raw/nand_base.c
> > @@ -4567,7 +4567,6 @@ static int nand_suspend(struct mtd_info *mtd)
> > ret = chip->ops.suspend(chip);
> > if (!ret)
> > chip->suspended = 1;
> > - mutex_unlock(&chip->lock);
>
> Hm, I'm not sure keeping the lock when you're in a suspended state
> is a good idea. It just papers over another bug IMO (see above).
>
> >
> > return ret;
> > }
> > @@ -4580,7 +4579,6 @@ static void nand_resume(struct mtd_info *mtd)
> > {
> > struct nand_chip *chip = mtd_to_nand(mtd);
> >
> > - mutex_lock(&chip->lock);
> > if (chip->suspended) {
> > if (chip->ops.resume)
> > chip->ops.resume(chip);
>