Re: [PATCH] spi: spi-geni-qcom: Fix NULL pointer access in geni_spi_isr

From: Doug Anderson
Date: Thu Dec 10 2020 - 18:09:54 EST


Hi,

On Thu, Dec 10, 2020 at 2:58 PM Stephen Boyd <swboyd@xxxxxxxxxxxx> wrote:
>
> Quoting Doug Anderson (2020-12-10 09:14:15)
> >
> > This is my untested belief of what's happening
> >
> > CPU0 CPU1
> > ---- ----
> > setup_fifo_xfer()
> > ...
> > geni_se_setup_m_cmd()
> > <hardware starts transfer>
> > <unrelated interrupt storm> spin_unlock_irq()
> > <continued interrupt storm> <time passes>
> > <continued interrupt storm> <transfer complets in hardware>
> > <continued interrupt storm> <hardware sets M_RX_FIFO_WATERMARK_EN>
> > <continued interrupt storm> <time passes>
> > <continued interrupt storm> handle_fifo_timeout()
> > <continued interrupt storm> spin_lock_irq()
> > <continued interrupt storm> mas->cur_xfer = NULL
> > <continued interrupt storm> geni_se_cancel_m_cmd()
> > <continued interrupt storm> spin_unlock_irq()
> > <continued interrupt storm> wait_for_completion_timeout() => timeout
> > <continued interrupt storm> spin_lock_irq()
> > <continued interrupt storm> geni_se_abort_m_cmd()
> > <continued interrupt storm> spin_unlock_irq()
> > <continued interrupt storm> wait_for_completion_timeout() => timeout
> > <interrupt storm ends>
> > geni_spi_isr()
> > spin_lock()
> > if (m_irq & M_RX_FIFO_WATERMARK_EN)
> > geni_spi_handle_rx()
> > mas->cur_xfer NULL derefrence
>
> Ok so the one line summary is "If geni_spi_isr() is sufficiently delayed
> then we may deref NULL in the irq handler because the handler tries to
> deref mas->cur_xfer after the timeout handling code has set it to NULL".

Sure.


> CPU0 CPU1
> ---- ----
> setup_fifo_xfer()
> ...
> geni_se_setup_m_cmd()
> <hardware starts transfer>
> unrelated_irq_handler() spin_unlock_irq()
> ...
> <transfer completes in hardware>
> <hardware sets M_RX_FIFO_WATERMARK_EN>
>
> handle_fifo_timeout()
> spin_lock_irq()
> mas->cur_xfer = NULL
> geni_se_cancel_m_cmd()
> spin_unlock_irq()
> wait_for_completion_timeout() => timeout
> spin_lock_irq()
> geni_se_abort_m_cmd()
> spin_unlock_irq()
> wait_for_completion_timeout() => timeout
> return IRQ_HANDLED;
> gic_handle_irq()
> geni_spi_isr()
> spin_lock()
> if (m_irq & M_RX_FIFO_WATERMARK_EN)
> geni_spi_handle_rx()
> rx_buf = mas->cur_xfer->rx_buf <--- OOPS!
>
> > With my proposed fix, I believe that would transform into:
> >
> > CPU0 CPU1
> > ---- ----
> > setup_fifo_xfer()
> > ...
> > geni_se_setup_m_cmd()
> > <hardware starts transfer>
> > <unrelated interrupt storm> spin_unlock_irq()
> > <continued interrupt storm> <time passes>
> > <continued interrupt storm> <transfer complets in hardware>
> > <continued interrupt storm> <hardware sets M_RX_FIFO_WATERMARK_EN>
> > <continued interrupt storm> <time passes>
> > <continued interrupt storm> handle_fifo_timeout()
> > <continued interrupt storm> synchronize_irq()
> > <continued interrupt storm> <time passes>
> > <interrupt storm ends>
> > geni_spi_isr()
> > ...
> > <synchronize_irq() finishes>
> > spin_lock_irq()
> > mas->cur_xfer = NULL
> > geni_se_cancel_m_cmd()
> > spin_unlock_irq()
> > geni_spi_isr()
> > ...
> > wait_for_completion_timeout() => success
> >
> > The extra synchronize_irq() I was suggesting at the end of the
> > function would be an extra bit of paranoia. Maybe a new storm showed
> > up while we were processing the timeout?
>
> Shouldn't we check in the timeout logic to see if m_irq has
> M_RX_FIFO_WATERMARK_EN or M_TX_FIFO_WATERMARK_EN set instead? Similarly
> for the CS assert/deassert stuff. If the timeout hits but one of those
> bits are set then it seems we've run into some poor irqsoff section but
> the hardware is still working. Calling synchronize_irq() wouldn't help
> if the CPU handling the irqs (i.e. CPU0) had irqs off for a long time,
> right? It will only ensure that other irq handlers have completed, which
> may be a problem, but not the only one.
>
> TL;DR: Peek at the irq status register in the timeout logic and skip it
> if the irq is pending?

I don't have tons of experience with synchronize_irq(), but the
function comment seems to indicate that as long as the interrupt is
pending synchronize_irq() will do what we want even if the CPU that
should handle the interrupt is in an irqsoff section. Digging a
little bit I guess it relies upon the interrupt controller being able
to read this state, but (hopefully) the GIC can?

If it doesn't work like I think it does, I'd be OK with peeking in the
IRQ status register, but we shouldn't _skip_ the logic IMO. As long
as we believe that an interrupt could happen in the future we
shouldn't return from handle_fifo_timeout(). It's impossible to
reason about how future transfers would work if the pending interrupt
from the previous transfer could fire at any point.

-Doug