RE: [PATCH 1/2] perf/arm-cmn: Cope with spurious IRQs better

From: Geoff Blake
Date: Mon Dec 05 2022 - 10:38:56 EST


> > > > From my perspective, this is a worse solution as now we're sweeping an
> > > > issue under the rug and consuming CPU cycles handling IRQs we should not
> > > > be getting in the first place. While an overflow IRQ from the cmn
> > > > should
> > > > not be high frequency, there is a non-zero chance in the future it could
> > > > be and this could lead to a very hard to debug performance issue instead
> > > > of the current problem, which is discovering we need to clean up better
> > > > from a noisy kernel message.
> > >
> > > Kexec is not the only possible source of spurious IRQs. If they cause a
> > > problem for this driver, that cannot be robustly addressed by trying to
> > > rely on whatever software might happen to run before this driver.
> >
> > Sure, I can agree with the assertion a spurious IRQ could come from
> > anywhere, in that case though, shouldn't the behavior still be to log
> > spurious IRQs as a warning instead of silently sinking them?
>
> We still have to handle the interrupt anyway to avoid it getting
> disabled behind our back, and beyond that it's not really something
> that's actionable by the user. What would we say?
>
> dev_warn(dev, "Something harmless, and in some cases expected,
> happened! If you've just rebooted after a kernel panic, maybe try having
> the kernel not panic?");
>
> Perhaps that should be a core IRQ helper so that many other drivers can
> also call it too?
>
> Furthermore if you're worried about performance implications from a
> theoretical interrupt storm, I can tell you from experience that logging
> to a serial console from a high-frequency interrupt handler is one of
> the best ways to cripple a system to the point where reaching for the
> power switch is the only option.

Logging unexpected events is necessary to give clues of what is going
wrong before they implode on fully remote machines. If you prefer to
handle the IRQ here rather than in the bad_irq section, then can we at
least have a WARN_ON() in the case where a spurious IRQ happens but no
overflow bit is set.

> The DTC_CTL documentation seems fairly unambiguous:
>
> [0] dt_en Enables debug, trace, and PMU features
>
> The design intent is that the PMU counters do not count when the entire
> PMU feature is disabled. I'm pretty sure I did confirm that empirically
> during development too (I recall the sheer number of different "enable"
> bits baffled me at the beginning, and there was actually one that did
> nothing, which I think did eventually get removed from the documentation).
>
> Of course clearing PMCR_PMU_EN is sufficient to simply stop counting,
> which we also depend on for correct operation, but I believe clearing
> DT_EN allows it to put all of the DT logic into a quiescent state.

I took the other patch that writes 0 to DTC_CTL.dt_en only and put it in a
loop of kexec'ing when the PMU is active for a few hours, I did not see
anymore spurious IRQs (whereas with the stock driver I could reproduce in under 10 tries).
You are correct Robin, that is all that is needed, and my code was overly
cautious.

- Geoff