Re: [PATCH] media: cedrus: Add watchdog for job completion

From: Jernej Škrabec
Date: Sat Feb 05 2022 - 03:43:26 EST


Dne sobota, 05. februar 2022 ob 08:40:21 CET je Jernej Škrabec napisal(a):
> Dne petek, 04. februar 2022 ob 10:09:56 CET je Paul Kocialkowski napisal(a):
> > Hi Jernej,
> >
> > On Tue 01 Feb 22, 19:33, Jernej Skrabec wrote:
> > > Currently, if job is not completed for whatever reason, userspace
> > > application can hang on ioctl and thus become unkillable.
> > >
> > > In order to prevent that, implement watchdog, which will complete job
> > > after 2 seconds with error state.
> > >
> > > Concept is borrowed from hantro driver.
> >
> > Good idea to implement a watchdog here, thanks!
> > See comments below.
> >
> > > Signed-off-by: Jernej Skrabec <jernej.skrabec@xxxxxxxxx>
> > > ---
> > >
> > > drivers/staging/media/sunxi/cedrus/cedrus.c | 2 ++
> > > drivers/staging/media/sunxi/cedrus/cedrus.h | 3 +++
> > > .../staging/media/sunxi/cedrus/cedrus_dec.c | 4 +++
> > > .../staging/media/sunxi/cedrus/cedrus_hw.c | 25 +++++++++++++++++++
> > > .../staging/media/sunxi/cedrus/cedrus_hw.h | 2 ++
> > > 5 files changed, 36 insertions(+)
> > >
> > > diff --git a/drivers/staging/media/sunxi/cedrus/cedrus.c
> > > b/drivers/staging/media/sunxi/cedrus/cedrus.c index
> > > 4a4b714b0f26..68b3dcdb5df3 100644
> > > --- a/drivers/staging/media/sunxi/cedrus/cedrus.c
> > > +++ b/drivers/staging/media/sunxi/cedrus/cedrus.c
> > > @@ -439,6 +439,8 @@ static int cedrus_probe(struct platform_device
> > > *pdev)
> > >
> > > mutex_init(&dev->dev_mutex);
> > >
> > > + INIT_DELAYED_WORK(&dev->watchdog_work, cedrus_watchdog);
> > > +
> > >
> > > ret = v4l2_device_register(&pdev->dev, &dev->v4l2_dev);
> > > if (ret) {
> > >
> > > dev_err(&pdev->dev, "Failed to register V4L2
>
> device\n");
>
> > > diff --git a/drivers/staging/media/sunxi/cedrus/cedrus.h
> > > b/drivers/staging/media/sunxi/cedrus/cedrus.h index
> > > c345f2984041..3bc094eb497f 100644
> > > --- a/drivers/staging/media/sunxi/cedrus/cedrus.h
> > > +++ b/drivers/staging/media/sunxi/cedrus/cedrus.h
> > > @@ -24,6 +24,7 @@
> > >
> > > #include <linux/iopoll.h>
> > > #include <linux/platform_device.h>
> > >
> > > +#include <linux/workqueue.h>
> > >
> > > #define CEDRUS_NAME "cedrus"
> > >
> > > @@ -194,6 +195,8 @@ struct cedrus_dev {
> > >
> > > struct reset_control *rstc;
> > >
> > > unsigned int capabilities;
> > >
> > > +
> > > + struct delayed_work watchdog_work;
> > >
> > > };
> > >
> > > extern struct cedrus_dec_ops cedrus_dec_ops_mpeg2;
> > >
> > > diff --git a/drivers/staging/media/sunxi/cedrus/cedrus_dec.c
> > > b/drivers/staging/media/sunxi/cedrus/cedrus_dec.c index
> > > a16c1422558f..9c7200299465 100644
> > > --- a/drivers/staging/media/sunxi/cedrus/cedrus_dec.c
> > > +++ b/drivers/staging/media/sunxi/cedrus/cedrus_dec.c
> > > @@ -97,4 +97,8 @@ void cedrus_device_run(void *priv)
> > >
> > > v4l2_ctrl_request_complete(src_req, &ctx->hdl);
> > >
> > > dev->dec_ops[ctx->current_codec]->trigger(ctx);
> > >
> > > +
> > > + /* Start the watchdog timer. */
> > > + schedule_delayed_work(&dev->watchdog_work,
> > > + msecs_to_jiffies(2000));
> > >
> > > }
> > >
> > > diff --git a/drivers/staging/media/sunxi/cedrus/cedrus_hw.c
> > > b/drivers/staging/media/sunxi/cedrus/cedrus_hw.c index
> > > 2d7663726467..a6470a89851e 100644
> > > --- a/drivers/staging/media/sunxi/cedrus/cedrus_hw.c
> > > +++ b/drivers/staging/media/sunxi/cedrus/cedrus_hw.c
> > > @@ -118,6 +118,13 @@ static irqreturn_t cedrus_irq(int irq, void *data)
> > >
> > > enum vb2_buffer_state state;
> > > enum cedrus_irq_status status;
> > >
> > > + /*
> > > + * If cancel_delayed_work returns false it means watchdog already
> > > + * executed and finished the job.
> > > + */
> > > + if (!cancel_delayed_work(&dev->watchdog_work))
> > > + return IRQ_HANDLED;
> > > +
> > >
> > > ctx = v4l2_m2m_get_curr_priv(dev->m2m_dev);
> > > if (!ctx) {
> > >
> > > v4l2_err(&dev->v4l2_dev,
> > >
> > > @@ -143,6 +150,24 @@ static irqreturn_t cedrus_irq(int irq, void *data)
> > >
> > > return IRQ_HANDLED;
> > >
> > > }
> > >
> > > +void cedrus_watchdog(struct work_struct *work)
> > > +{
> > > + struct cedrus_dev *dev;
> > > + struct cedrus_ctx *ctx;
> > > +
> > > + dev = container_of(to_delayed_work(work),
> > > + struct cedrus_dev, watchdog_work);
> > > +
> > > + ctx = v4l2_m2m_get_curr_priv(dev->m2m_dev);
> > > + if (!ctx)
> > > + return;
> > > +
> > > + v4l2_err(&dev->v4l2_dev, "frame processing timed out!\n");
> > > + reset_control_reset(dev->rstc);
> >
> > I don't think playing with the reset is the right approach here.
> > First we don't really know if the reset is shared or not, so this might
> > have no effect.
>
> AFAIK only few reset lines are shared in all Allwinner SoC, never for Cedrus
> and even then, this is considered as HW issue. So, I'm good with using
> reset line. This principle is also taken from Hantro driver.
>
> > Then even if it does, wouldn't this just reset the state of the
> > registers to an unconfigured state, making it impossible to decode any
> > future frame in the same context?
>
> Being stateless core, all context is held in auxiliary buffers, reference
> frames and controls, which are not reset with pulsing reset line, so no,
> state is not lost. Anyway, if decoding fails, you're generally screwed
> until next key frame. You'll have to deal with decoding issues/artefacts
> nevertheless.
> > Honestly I'm not sure what a good process would be to get back on track
> > here so I would be tempted to just do nothing an return errors.
> >
> > That's already better than being stuck.
>
> Doing nothing will solve only current job, but HW will still be stuck in
> decoding state. I doubt reprogramming registers and triggering new decoding
> will actually do anything.
>
> I'll check BSP lib sources again. Maybe selecting non-existing decoding mode
> would reset the core. That is already suggested as good thing to do in
> order to put core in low power mode.

BSP kernel driver also pulses reset line:
https://github.com/orangepi-xunlong/linux-orangepi/blob/orange-pi-4.9-sun50iw9/drivers/media/cedar-ve/cedar_ve.c#L848-L851

I still think this is the way to go.

Best regards,
Jernej

>
> IMO we have to do something. Doing nothing will probably just lock up the
> core until next reboot or maybe until trying different decoding mode.
>
> Anyway, I have to find another way to cause decoding job to time out.
> Currently I'm doing this with IOMMU on H6, but that brings down several
> other things, which requires reboot anyway.
>
> Best regards,
> Jernej
>
> > Paul
> >
> > > + v4l2_m2m_buf_done_and_job_finish(ctx->dev->m2m_dev, ctx-
> >
> >fh.m2m_ctx,
> >
> > > + VB2_BUF_STATE_ERROR);
> > > +}
> > > +
> > >
> > > int cedrus_hw_suspend(struct device *device)
> > > {
> > >
> > > struct cedrus_dev *dev = dev_get_drvdata(device);
> > >
> > > diff --git a/drivers/staging/media/sunxi/cedrus/cedrus_hw.h
> > > b/drivers/staging/media/sunxi/cedrus/cedrus_hw.h index
> > > 45f641f0bfa2..7c92f00e36da 100644
> > > --- a/drivers/staging/media/sunxi/cedrus/cedrus_hw.h
> > > +++ b/drivers/staging/media/sunxi/cedrus/cedrus_hw.h
> > > @@ -28,4 +28,6 @@ int cedrus_hw_resume(struct device *device);
> > >
> > > int cedrus_hw_probe(struct cedrus_dev *dev);
> > > void cedrus_hw_remove(struct cedrus_dev *dev);
> > >
> > > +void cedrus_watchdog(struct work_struct *work);
> > > +
> > >
> > > #endif