Re: [PATCH 1/2] arm64: smp: Fix pseudo NMI issues w/ broken Mediatek FW

From: Doug Anderson
Date: Thu Oct 05 2023 - 12:16:24 EST


Hi,

On Thu, Oct 5, 2023 at 3:27 AM Mark Rutland <mark.rutland@xxxxxxx> wrote:
>
> On Wed, Oct 04, 2023 at 07:04:12AM -0700, Doug Anderson wrote:
> > On Wed, Oct 4, 2023 at 3:15 AM Marc Zyngier <maz@xxxxxxxxxx> wrote:
> > >
> > > On Wed, 04 Oct 2023 10:59:50 +0100,
> > > Mark Rutland <mark.rutland@xxxxxxx> wrote:
> > > >
> > > > Given you haven't seen any issues, I suspect those are getting reset to fixed
> > > > values that happens to work out for us, but it is a bit worrisome more
> > > > generally (e.g. the LPI case above).
> > >
> > > It is likely that these SoCs don't even have an ITS.
> >
> > Right. That was what we decided [1] when Marc pointed this out earlier.
> >
> > Overall: we know that this firmware behavior is not good but we're
> > stuck with it. :( At the very least, any new devices coming out will
> > have this fixed. Presumably if old devices are working OK enough today
> > (as long as you don't enable pseudo-NMI) then they can be made to keep
> > working?
> >
> > So circling back: what patch should we actually land?
>
> For now I'd prefer we took the patch I sent in:
>
> https://lore.kernel.org/linux-arm-kernel/ZRr8r7XMoyDKaitd@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
>
> ... as that leaves us no worse than before this series, and it's pretty simple.

Sounds good to me!

Catalin / Will: Please yell if there's anything you need me to do.
Otherwise I'll assume you'll pick up Mark's patch instead of my patch
#1 and then you'll pick up my patch #2.


> > As of right now only pseudo-NMI is broken, but it would be good to make sure
> > that if the kernel later adds other features that would be broken on this
> > hardware that it gets handled properly...
>
> Going further than the above, I think there are three options here:
>
> 1) Complete fix: depend on a working firmware, and throw this workaround away.
>
> IIUC from the above, that's not something you can commit to.

Right. We've landed the fix in the firmware branch for many of the
devices that had the issue but there's a whole process (and cost
involved) in getting this actually rolled out. Each unique board needs
to kick off a FW qual. Given that nothing is actually broken today
it's hard to justify a FW qual just for this, so we're left
piggybacking on the next reason for a FW qual (if there is one).
...even if we could kick them off all instantly, though, it's always
best not to rely on a FW fix, especially if (as in this case) it's to
keep us from crashing. There'll always be some case where someone
tries to boot a new OS with an old firmware. One such case can happen
when booting recovery images on ChromeOS where the device always boots
from the Read-only (not updatable) firmware.


> 2) Partial fix: have the kernel save/restore everything.
>
> IIUC this is unpalatable.

Yeah, Marc already NAKed it.


> 3) Partial fix: make the ARM64_HAS_GIC_PRIO_MASKING cpucap depend on the
> absence of a "mediatek,broken-save-restore-fw" property in the DT. I believe
> we can check that in early_enable_pseudo_nmi() or can_use_gic_priorities().
>
> That'll avoid potential issues if/when we change the priorities used for
> pNMI (which is something I've been looking at).
>
> I'm happy with (3) if Marc is.

Yeah, that seems best to me as a long term solution, too.

-Doug