Re: [PATCH v3 09/14] remoteproc: Deal with synchronisation when crashing

From: Mathieu Poirier
Date: Thu Apr 30 2020 - 16:11:16 EST


On Wed, Apr 29, 2020 at 09:44:02AM +0200, Arnaud POULIQUEN wrote:
> Hi Mathieu,
>
> On 4/24/20 10:01 PM, Mathieu Poirier wrote:
> > Refactor function rproc_trigger_recovery() in order to avoid
> > reloading the firmware image when synchronising with a remote
> > processor rather than booting it. Also part of the process,
> > properly set the synchronisation flag in order to properly
> > recover the system.
> >
> > Signed-off-by: Mathieu Poirier <mathieu.poirier@xxxxxxxxxx>
> > ---
> > drivers/remoteproc/remoteproc_core.c | 23 ++++++++++++++------
> > drivers/remoteproc/remoteproc_internal.h | 27 ++++++++++++++++++++++++
> > 2 files changed, 43 insertions(+), 7 deletions(-)
> >
> > diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
> > index ef88d3e84bfb..3a84a38ba37b 100644
> > --- a/drivers/remoteproc/remoteproc_core.c
> > +++ b/drivers/remoteproc/remoteproc_core.c
> > @@ -1697,7 +1697,7 @@ static void rproc_coredump(struct rproc *rproc)
> > */
> > int rproc_trigger_recovery(struct rproc *rproc)
> > {
> > - const struct firmware *firmware_p;
> > + const struct firmware *firmware_p = NULL;
> > struct device *dev = &rproc->dev;
> > int ret;
> >
> > @@ -1718,14 +1718,16 @@ int rproc_trigger_recovery(struct rproc *rproc)
> > /* generate coredump */
> > rproc_coredump(rproc);
> >
> > - /* load firmware */
> > - ret = request_firmware(&firmware_p, rproc->firmware, dev);
> > - if (ret < 0) {
> > - dev_err(dev, "request_firmware failed: %d\n", ret);
> > - goto unlock_mutex;
> > + /* load firmware if need be */
> > + if (!rproc_needs_syncing(rproc)) {
> > + ret = request_firmware(&firmware_p, rproc->firmware, dev);
> > + if (ret < 0) {
> > + dev_err(dev, "request_firmware failed: %d\n", ret);
> > + goto unlock_mutex;
> > + }
>
> If we started in syncing mode then rpoc->firmware is null
> rproc_set_sync_flag(rproc, RPROC_SYNC_STATE_CRASHED) can make rproc_needs_syncing(rproc)
> false.

You are correct, I will add an additional check in rproc_set_machine() to
prevent a situation where rproc_alloc() has been called without an ops and any
of the synchronisation flags are set to false.

It is also possible that someone would call proc_alloc() without an ops and
doesn't call rproc_set_state_machine(), in which case both ops and sync_ops
would be NULL. Adding a check in rproc_add() is probably the best location to
catch such a condition.


> In this case here we fail the recovery an leave in RPROC_STOP state.
> As you proposed in Loic RFC[1], what about adding a more explicit message to inform that the recovery
> failed.

Right, that's a different problem.

>
> [1]https://lkml.org/lkml/2020/3/11/402
>
> Regards,
> Arnaud
> > }
> >
> > - /* boot the remote processor up again */
> > + /* boot up or synchronise with the remote processor again */
> > ret = rproc_start(rproc, firmware_p);
> >
> > release_firmware(firmware_p);
> > @@ -1761,6 +1763,13 @@ static void rproc_crash_handler_work(struct work_struct *work)
> > dev_err(dev, "handling crash #%u in %s\n", ++rproc->crash_cnt,
> > rproc->name);
> >
> > + /*
> > + * The remote processor has crashed - tell the core what operation
> > + * to use from hereon, i.e whether an external entity will reboot
> > + * the MCU or it is now the remoteproc core's responsability.
> > + */
> > + rproc_set_sync_flag(rproc, RPROC_SYNC_STATE_CRASHED);
> > +
> > mutex_unlock(&rproc->lock);
> >
> > if (!rproc->recovery_disabled)
> > diff --git a/drivers/remoteproc/remoteproc_internal.h b/drivers/remoteproc/remoteproc_internal.h
> > index 3985c084b184..61500981155c 100644
> > --- a/drivers/remoteproc/remoteproc_internal.h
> > +++ b/drivers/remoteproc/remoteproc_internal.h
> > @@ -24,6 +24,33 @@ struct rproc_debug_trace {
> > struct rproc_mem_entry trace_mem;
> > };
> >
> > +/*
> > + * enum rproc_sync_states - remote processsor sync states
> > + *
> > + * @RPROC_SYNC_STATE_CRASHED state to use after the remote processor
> > + * has crashed but has not been recovered by
> > + * the remoteproc core yet.
> > + *
> > + * Keeping these separate from the enum rproc_state in order to avoid
> > + * introducing coupling between the state of the MCU and the synchronisation
> > + * operation to use.
> > + */
> > +enum rproc_sync_states {
> > + RPROC_SYNC_STATE_CRASHED,
> > +};
> > +
> > +static inline void rproc_set_sync_flag(struct rproc *rproc,
> > + enum rproc_sync_states state)
> > +{
> > + switch (state) {
> > + case RPROC_SYNC_STATE_CRASHED:
> > + rproc->sync_with_rproc = rproc->sync_flags.after_crash;
> > + break;
> > + default:
> > + break;
> > + }
> > +}
> > +
> > /* from remoteproc_core.c */
> > void rproc_release(struct kref *kref);
> > irqreturn_t rproc_vq_interrupt(struct rproc *rproc, int vq_id);
> >