Re: [PATCH] spi: spi-geni-qcom: Fix NULL pointer access in geni_spi_isr

From: Doug Anderson
Date: Mon Dec 14 2020 - 19:33:36 EST


Hi,

On Fri, Dec 11, 2020 at 5:32 PM Stephen Boyd <swboyd@xxxxxxxxxxxx> wrote:
>
> Quoting Doug Anderson (2020-12-10 17:51:53)
> > Hi,
> >
> > On Thu, Dec 10, 2020 at 5:39 PM Stephen Boyd <swboyd@xxxxxxxxxxxx> wrote:
> > >
> > > Quoting Doug Anderson (2020-12-10 17:30:17)
> > > > On Thu, Dec 10, 2020 at 5:21 PM Stephen Boyd <swboyd@xxxxxxxxxxxx> wrote:
> > > > >
> > > > > Yeah and so if it comes way later because it timed out then what's the
> > > > > point of calling synchronize_irq() again? To make the completion
> > > > > variable set when it won't be tested again until it is reinitialized?
> > > >
> > > > Presumably the idea is to try to recover to a somewhat usable state
> > > > again? We're not rebooting the machine so, even though this transfer
> > > > failed, we will undoubtedly do another transfer later. If that
> > > > "abort" interrupt comes way later while we're setting up the next
> > > > transfer we'll really confuse ourselves.
> > >
> > > The interrupt handler just sets a completion variable. What does that
> > > confuse?
> >
> > The interrupt handler sees a "DONE" interrupt. If we've made it far
> > enough into setting up the next transfer that "cur_xfer" has been set
> > then it might do more, no?
>
> I thought it saw a cancel/abort EN bit?
>
> if (m_irq & M_CMD_CANCEL_EN)
> complete(&mas->cancel_done);
> if (m_irq & M_CMD_ABORT_EN)
> complete(&mas->abort_done)
>
> and only a DONE bit if a transfer happened.

Ah, true. The crazy thing is that since we do abort / cancel with
commands we get them together with "done". That "done" could
potentially confuse the next transfer... In theory we could ignore
DONE if we see ABORT / CANCEL, but I've now spent a bunch of time on
this and I think the best thing is to just make sure we won't start
the next transfer if any IRQs are pending. I'll post patches...


> > > > I guess you could go the route of adding a synchronize_irq() at the
> > > > start of the next transfer, but I'd rather add the overhead in the
> > > > exceptional case (the timeout) than the normal case. In the normal
> > > > case we don't need to worry about random IRQs from the past transfer
> > > > suddenly showing up.
> > > >
> > >
> > > How does adding synchronize_irq() at the end guarantee that the abort is
> > > cleared out of the hardware though? It seems to assume that the abort is
> > > pending at the GIC when it could still be running through the hardware
> > > and not executed yet. It seems like a synchronize_irq() for that is
> > > wishful thinking that the irq is merely pending even though it timed
> > > out and possibly never ran. Maybe it's stuck in a write buffer in the
> > > CPU?
> >
> > I guess I'm asserting that if a full second passed (because we timed
> > out) and after that full second no interrupts are pending then the
> > interrupt will never come. That seems a reasonable assumption to me.
> > It seems hard to believe it'd be stuck in a write buffer for a full
> > second?
> >
>
> Ok, so if we don't expect an irq to come in why are we calling
> synchronize_irq()? I'm lost.

It turns out that synchronize_irq() doesn't do what I thought it did,
actually. :( Despite __synchronize_hardirq() talking about waiting
for "pending" interrupts, it actually passes in "IRQCHIP_STATE_ACTIVE"
and not "IRQCHIP_STATE_PENDING". So much for that.

...but, if it did, I guess my point (which no longer matters) was:

a) If you wait a second but don't wait for pending interrupts to be
done, interrupts might still come later if the CPU servicing
interrupts was blocked.

b) If you don't wait a second but wait for pending interrupts to be
done, interrupts might still come later because maybe the transaction
wasn't finished first.

c) If you wait a second (enough for the transaction to finish) and
then wait for pending interrupts (to handle ISR being blocked) then
you're good.

---

So I got tired of all this conjecture and decided to write some code.
I reproduced the problem with some test code that let me call
local_irq_disable() for a set amount of time based on sysfs.

In terminal 1:
while true; do ectool version > /dev/null; done

In terminal 2, disable interrupts on cpu0 for 2000 ms:
taskset -c 0 echo 2000 > /sys/module/spi_geni_qcom/parameters/doug_test

Of course, I got the timeout and the NULL dereference.


Then I could poke at all the corner cases. Posting patches for what I
think is the best solution...

-Doug