Re: [PATCH 1/2] mmc: cqhci: Add a quirk to clear stale TC

From: Kornel Dulęba
Date: Wed Nov 01 2023 - 07:31:21 EST


On Mon, Oct 30, 2023 at 8:31 PM Adrian Hunter <adrian.hunter@xxxxxxxxx> wrote:
>
> On 27/10/23 17:56, Kornel Dulęba wrote:
> > This fix addresses a stale task completion event issued right after the
> > CQE recovery. As it's a hardware issue the fix is done in form of a
> > quirk.
> >
> > When error interrupt is received the driver runs recovery logic is run.
> > It halts the controller, clears all pending tasks, and then re-enables
> > it. On some platforms a stale task completion event is observed,
> > regardless of the CQHCI_CLEAR_ALL_TASKS bit being set.
> >
> > This results in either:
> > a) Spurious TC completion event for an empty slot.
> > b) Corrupted data being passed up the stack, as a result of premature
> > completion for a newly added task.
> >
> > To fix that re-enable the controller, clear task completion bits,
> > interrupt status register and halt it again.
> > This is done at the end of the recovery process, right before interrupts
> > are re-enabled.
> >
> > Signed-off-by: Kornel Dulęba <korneld@xxxxxxxxxxxx>
> > ---
> > drivers/mmc/host/cqhci-core.c | 42 +++++++++++++++++++++++++++++++++++
> > drivers/mmc/host/cqhci.h | 1 +
> > 2 files changed, 43 insertions(+)
> >
> > diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c
> > index b3d7d6d8d654..e534222df90c 100644
> > --- a/drivers/mmc/host/cqhci-core.c
> > +++ b/drivers/mmc/host/cqhci-core.c
> > @@ -1062,6 +1062,45 @@ static void cqhci_recover_mrqs(struct cqhci_host *cq_host)
> > /* CQHCI could be expected to clear it's internal state pretty quickly */
> > #define CQHCI_CLEAR_TIMEOUT 20
> >
> > +/*
> > + * During CQE recovery all pending tasks are cleared from the
> > + * controller and its state is being reset.
> > + * On some platforms the controller sets a task completion bit for
> > + * a stale(previously cleared) task right after being re-enabled.
> > + * This results in a spurious interrupt at best and corrupted data
> > + * being passed up the stack at worst. The latter happens when
> > + * the driver enqueues a new request on the problematic task slot
> > + * before the "spurious" task completion interrupt is handled.
> > + * To fix it:
> > + * 1. Re-enable controller by clearing the halt flag.
> > + * 2. Clear interrupt status and the task completion register.
> > + * 3. Halt the controller again to be consistent with quirkless logic.
> > + *
> > + * This assumes that there are no pending requests on the queue.
> > + */
> > +static void cqhci_quirk_clear_stale_tc(struct cqhci_host *cq_host)
> > +{
> > + u32 reg;
> > +
> > + WARN_ON(cq_host->qcnt);
> > + cqhci_writel(cq_host, 0, CQHCI_CTL);
> > + if ((cqhci_readl(cq_host, CQHCI_CTL) & CQHCI_HALT)) {
> > + pr_err("%s: cqhci: CQE failed to exit halt state\n",
> > + mmc_hostname(cq_host->mmc));
> > + }
> > + reg = cqhci_readl(cq_host, CQHCI_TCN);
> > + cqhci_writel(cq_host, reg, CQHCI_TCN);
> > + reg = cqhci_readl(cq_host, CQHCI_IS);
> > + cqhci_writel(cq_host, reg, CQHCI_IS);
> > +
> > + /*
> > + * Halt the controller again.
> > + * This is only needed so that we're consistent across quirk
> > + * and quirkless logic.
> > + */
> > + cqhci_halt(cq_host->mmc, CQHCI_FINISH_HALT_TIMEOUT);
> > +}
>
> Thanks a lot for tracking this down!
>
> It could be that the "un-halt" starts a task, so it would be
> better to force the "clear" to work if possible, which
> should be the case if CQE is disabled.
>
> Would you mind trying the code below? Note the increased
> CQHCI_START_HALT_TIMEOUT helps avoid trying to clear tasks
> when CQE has not halted.

Sure, I'll try it out tomorrow, as I don't have access to the DUT today.
BTW do we even need to halt the controller in the recovery_finish logic?
It has already been halted in recovery_start, I guess it could be
there in case the recovery_start halt didn't work.
But in that case shouldn't we do this disable/re-enable dance in recovery_start?

>
>
> diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c
> index b3d7d6d8d654..534c13069833 100644
> --- a/drivers/mmc/host/cqhci-core.c
> +++ b/drivers/mmc/host/cqhci-core.c
> @@ -987,7 +987,7 @@ static bool cqhci_halt(struct mmc_host *mmc, unsigned int timeout)
> * layers will need to send a STOP command), so we set the timeout based on a
> * generous command timeout.
> */
> -#define CQHCI_START_HALT_TIMEOUT 5
> +#define CQHCI_START_HALT_TIMEOUT 500
>
> static void cqhci_recovery_start(struct mmc_host *mmc)
> {
> @@ -1075,28 +1075,27 @@ static void cqhci_recovery_finish(struct mmc_host *mmc)
>
> ok = cqhci_halt(mmc, CQHCI_FINISH_HALT_TIMEOUT);
>
> - if (!cqhci_clear_all_tasks(mmc, CQHCI_CLEAR_TIMEOUT))
> - ok = false;
> -
> /*
> * The specification contradicts itself, by saying that tasks cannot be
> * cleared if CQHCI does not halt, but if CQHCI does not halt, it should
> * be disabled/re-enabled, but not to disable before clearing tasks.
> * Have a go anyway.
> */
> - if (!ok) {
> - pr_debug("%s: cqhci: disable / re-enable\n", mmc_hostname(mmc));
> - cqcfg = cqhci_readl(cq_host, CQHCI_CFG);
> - cqcfg &= ~CQHCI_ENABLE;
> - cqhci_writel(cq_host, cqcfg, CQHCI_CFG);
> - cqcfg |= CQHCI_ENABLE;
> - cqhci_writel(cq_host, cqcfg, CQHCI_CFG);
> - /* Be sure that there are no tasks */
> - ok = cqhci_halt(mmc, CQHCI_FINISH_HALT_TIMEOUT);
> - if (!cqhci_clear_all_tasks(mmc, CQHCI_CLEAR_TIMEOUT))
> - ok = false;
> - WARN_ON(!ok);
> - }
> + if (!cqhci_clear_all_tasks(mmc, CQHCI_CLEAR_TIMEOUT))
> + ok = false;
> +
> + cqcfg = cqhci_readl(cq_host, CQHCI_CFG);
> + cqcfg &= ~CQHCI_ENABLE;
> + cqhci_writel(cq_host, cqcfg, CQHCI_CFG);
> +
> + cqcfg = cqhci_readl(cq_host, CQHCI_CFG);
> + cqcfg |= CQHCI_ENABLE;
> + cqhci_writel(cq_host, cqcfg, CQHCI_CFG);
> +
> + cqhci_halt(mmc, CQHCI_FINISH_HALT_TIMEOUT);
> +
> + if (!ok)
> + cqhci_clear_all_tasks(mmc, CQHCI_CLEAR_TIMEOUT);
>
> cqhci_recover_mrqs(cq_host);
>
>